SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
#datapopupseattle
Scala and Spark are Ideal
for Big Data
John Nestor
Sr Architect, 47 Degrees
47deg
#datapopupseattle
UNSTRUCTURED
Data Science POP-UP in Seattle
www.dominodatalab.com
D
Produced by Domino Data Lab
Domino’s enterprise data science platform is used
by leading analytical organizations to increase
productivity, enable collaboration, and publish
models into production faster.
Scala and Spark are
Ideal for Big Data
John Nestor
47 Degrees
Seattle Unstructured Data Science Pop-Up
October 7, 2015
www.47deg.com
347deg.com
47deg.com
Why Scala?
• Strong typing
• Concise elegant syntax
• Runs on JVM (Java Virtual Machine)
• Supports both object-oriented and functional
• Small simple programs through large parallel
distributed systems
• Easy to cleanly extend with new libraries and DSL’s
• Ideal for parallel and distributed systems
4
47deg.com
Scala: Strong Typing and Concise Syntax
• Strong typing like Java.
• Compile time checks
• Better modularity via strongly typed interfaces
• Easier maintenance: types make code easier to
understand
• Concise syntax like Python.
• Type inference. Compiler infers most types that had to be
explicit in Java.
• Powerful syntax that avoid much of the boilerplate of Java
code (see next slide).
• Best of both worlds: safety of strong typing with conciseness
(like Python).
5
47deg.com
Scala Case Class
• Java version



class User {

private String name;

private Int age;

public User(String name, Int age) {

this.name = name; this.age = age;

}

public getAge() { return age; }

public setAge(Int age) { this.age = age;}

}

User joe = new User(“Joe”, 30);

• Scala version



case class User(name:String, var age:Int)

val joe = User(“Joe”, 30)
6
47deg.com
Functional Scala
• Anonymous functions. 

(a:Int,b:Int) => a+b
• Functions that take and return other functions.
• Rarely need variables or loops
• Immutable collections: Seq[T], Map[K,V], …
• Works well with concurrent or distributed systems
• Natural for functional programming
• Functional collection operations (a small sample)
• map, flatMap, reduce, …
• filter, groupBy, sortBy, take, drop, …
7
47deg.com
Scala Availability and Support
• Open Source
• Typesafe provides support. Founded my Martin
Odersky who designed Scala.
• IDEs: Intellij IDEA and Eclipse
• Libraries: lots now and more every day
• ScalaNLP - Epic (natural language processing)
• Major Scala users: LinkedIn, Twitter, Goldman Sachs,
Coursera, Angies List, Whitepages
• Major systems written in Scala: Spark, Kafka
8
47deg.com
Typesafe Scala Components
• Scala Compiler (includes REPL)
• Scala Standard Libraries
• SBT - Scala Build Tool
• Play - scaleable web applications
• Scala JS - compiles Scala to JavaScript
• Akka - for parallel and distributed computation
• Spray - high performance asynchronous TCP/ HTTP library
• Spark - Typesafe also supports Spark
• Slick - for SQL database access
• ConductR - Scala deployment/devops tool
• Reactive Monitoring (Beta)
9
47deg.com
Why Spark?
• Support for not only batch but also (near) real-time
• Fast - keeps data in memory as much as possible
• Often 10X to 100X Hadoop speed
• A clean easy-to-use API
• A richer set of functional operations than just map and
reduce
• A foundation for a wide set of integrated data
applications
• Can recover from failures - recompute or (optional)
replication
• Scalable for very large data sets and reduced time
10
47deg.com
Spark RDDs
• RDD[T] - resilient distributed data set
• typed (must be serializable)
• immutable
• ordered
• can be processed in parallel
• lazy evaluation - permits more global optimizations
• Rich set of functional operations ( a small sample)
• map, flatMap, reduce, …
• filter, groupBy, sortBy, take, drop, …
11
47deg.com
Spark Components
• Spark Core
• Scalable multi-node cluster
• Failure detection and recovery
• RDDs and functional operations
• MLLib - for machine learning
• linear regression, SVMs, clustering, collaborative
filtering, dimension reduction
• more on the way!
• GraphX - for graph computation
• Streaming - for near real-time
• Dataframes - for SQL and Json
12
47deg.com
Spark Availability and Support
• Open Source - top level Apache project
• Over 750 contributors from over 200 organizations
• Can process multiple petabytes on clusters of over
8000 nodes
• Databricks. Matei Zaharia who wrote the original Spark
is a founder and CTO
• Packages (more every day)
• Zeppelin - Scala notebooks
• Cassandra, Kafka connectors
13
47deg.com
Clusters and Scalability
• Scala Akka clusters (process distribution, micro services)
• message passing
• remote Actors
• Spark clusters (data distribution)
• local
• Stand alone (optionally with ZooKeeper)
• Apache Mesos
• Hadoop Yarn
• can run above on Amazon and Google clouds
14
47deg.com
Why Scala for Spark?
• Why not Python, R, or Java for Spark?
• Spark is written in Scala
• Scala source code is important Spark documentation
• Spark is best extended in Scala
• The primary API for Spark is Scala
• The functional features of Scala and Spark are a
natural fit and easiest to use in Scala
• If you want to build scalable high performance
production code based on Spark, R by itself is too
specialized, Python is too slow and Java is tedious to
write and maintain
15
47deg.com
Demo
16
47deg.com
Seattle Resources
• Seattle Meetups
• Scala at the Sea Meetup 

http://www.meetup.com/Seattle-Scala-User-Group/
• Seattle Spark Meetup 

http://www.meetup.com/Seattle-Spark-Meetup/
• Seattle Training: Spark and Typesafe Scala Classes
http://www.47deg.com/events#training
• UW Scala Professional Certificate Program 

http://www.pce.uw.edu/certificates/scala-functional-reactive-programming.html
17
#datapopupseattle
@datapopup
#datapopupseattle
#datapopupseattle
Thank You To Our Sponsors

Contenu connexe

Tendances

Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Thoughtworks
 
Future of data visualization
Future of data visualizationFuture of data visualization
Future of data visualizationhadoopsphere
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
 
ELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedTin Le
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software DevelopmentAlexis Seigneurin
 
Devops Days, 2019 - Charlotte
Devops Days, 2019 - CharlotteDevops Days, 2019 - Charlotte
Devops Days, 2019 - Charlottebotsplash.com
 
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...Databricks
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkMuktadiur Rahman
 
Scala in Model-Driven development for Apparel Cloud Platform
Scala in Model-Driven development for Apparel Cloud PlatformScala in Model-Driven development for Apparel Cloud Platform
Scala in Model-Driven development for Apparel Cloud PlatformTomoharu ASAMI
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark PresentationStephen Borg
 
Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020Piotr Findeisen
 
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiApache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiDatabricks
 
Lessons from {distributed,remote,virtual} communities and companies
Lessons from {distributed,remote,virtual} communities and companiesLessons from {distributed,remote,virtual} communities and companies
Lessons from {distributed,remote,virtual} communities and companiesColin Charles
 
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...In-Memory Computing Summit
 
Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices   Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices ZalandoHayley
 
Bootstrap SaaS startup using Open Source Tools
Bootstrap SaaS startup using Open Source ToolsBootstrap SaaS startup using Open Source Tools
Bootstrap SaaS startup using Open Source Toolsbotsplash.com
 
Presto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and FuturePresto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and FutureDataWorks Summit
 
Introduction to df
Introduction to dfIntroduction to df
Introduction to dfMohit Jaggi
 
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
 Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi... Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...Databricks
 

Tendances (20)

Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
 
Future of data visualization
Future of data visualizationFuture of data visualization
Future of data visualization
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
ELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learned
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software Development
 
Devops Days, 2019 - Charlotte
Devops Days, 2019 - CharlotteDevops Days, 2019 - Charlotte
Devops Days, 2019 - Charlotte
 
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Scala in Model-Driven development for Apparel Cloud Platform
Scala in Model-Driven development for Apparel Cloud PlatformScala in Model-Driven development for Apparel Cloud Platform
Scala in Model-Driven development for Apparel Cloud Platform
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
 
Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020
 
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiApache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
 
Lessons from {distributed,remote,virtual} communities and companies
Lessons from {distributed,remote,virtual} communities and companiesLessons from {distributed,remote,virtual} communities and companies
Lessons from {distributed,remote,virtual} communities and companies
 
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
 
Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices   Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices
 
Bootstrap SaaS startup using Open Source Tools
Bootstrap SaaS startup using Open Source ToolsBootstrap SaaS startup using Open Source Tools
Bootstrap SaaS startup using Open Source Tools
 
Presto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and FuturePresto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and Future
 
Introduction to df
Introduction to dfIntroduction to df
Introduction to df
 
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
 Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi... Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
 

En vedette

El wki
El wkiEl wki
El wkiSA1YCL
 
Fake Pokémon GO Apps Are Released in the Wild
Fake Pokémon GO Apps Are Released in the WildFake Pokémon GO Apps Are Released in the Wild
Fake Pokémon GO Apps Are Released in the WildHouse of I.T.
 
Solidarity - server architecture
Solidarity - server architectureSolidarity - server architecture
Solidarity - server architectureYury Chernushenko
 
Soloway. Простой programmatic для высокой результативности performance-кампаний.
Soloway. Простой programmatic для высокой результативности performance-кампаний.Soloway. Простой programmatic для высокой результативности performance-кампаний.
Soloway. Простой programmatic для высокой результативности performance-кампаний.HybridRussia
 
Hybrid. Увеличение количества звонков с помощью mobile programmatic
Hybrid. Увеличение количества звонков с помощью mobile programmaticHybrid. Увеличение количества звонков с помощью mobile programmatic
Hybrid. Увеличение количества звонков с помощью mobile programmaticHybridRussia
 
فهرست مطالب مطالعه بازار آنلاین املاک در ایران
فهرست مطالب مطالعه بازار آنلاین املاک در ایران فهرست مطالب مطالعه بازار آنلاین املاک در ایران
فهرست مطالب مطالعه بازار آنلاین املاک در ایران E-Commerce Monitor (ECM)
 
مطالعه فروشگاه های آنلاین ایران
مطالعه فروشگاه های آنلاین ایرانمطالعه فروشگاه های آنلاین ایران
مطالعه فروشگاه های آنلاین ایرانE-Commerce Monitor (ECM)
 
Scheduling and management of seed fund allocation across multiple npd projects
Scheduling and management of seed fund allocation across multiple npd projectsScheduling and management of seed fund allocation across multiple npd projects
Scheduling and management of seed fund allocation across multiple npd projectsAshok Rangaswamy
 
Apache flink - prise en main rapide
Apache flink - prise en main rapideApache flink - prise en main rapide
Apache flink - prise en main rapideBilal Baltagi
 

En vedette (15)

El wki
El wkiEl wki
El wki
 
Fake Pokémon GO Apps Are Released in the Wild
Fake Pokémon GO Apps Are Released in the WildFake Pokémon GO Apps Are Released in the Wild
Fake Pokémon GO Apps Are Released in the Wild
 
Solidarity - server architecture
Solidarity - server architectureSolidarity - server architecture
Solidarity - server architecture
 
start_up 01
start_up 01start_up 01
start_up 01
 
Soloway. Простой programmatic для высокой результативности performance-кампаний.
Soloway. Простой programmatic для высокой результативности performance-кампаний.Soloway. Простой programmatic для высокой результативности performance-кампаний.
Soloway. Простой programmatic для высокой результативности performance-кампаний.
 
Hybrid. Увеличение количества звонков с помощью mobile programmatic
Hybrid. Увеличение количества звонков с помощью mobile programmaticHybrid. Увеличение количества звонков с помощью mobile programmatic
Hybrid. Увеличение количества звонков с помощью mobile programmatic
 
Elena duque blog
Elena duque blogElena duque blog
Elena duque blog
 
How to test vocabulary
How to test vocabularyHow to test vocabulary
How to test vocabulary
 
лекция 4
лекция 4лекция 4
лекция 4
 
Hotels
Hotels Hotels
Hotels
 
فهرست مطالب مطالعه بازار آنلاین املاک در ایران
فهرست مطالب مطالعه بازار آنلاین املاک در ایران فهرست مطالب مطالعه بازار آنلاین املاک در ایران
فهرست مطالب مطالعه بازار آنلاین املاک در ایران
 
مطالعه فروشگاه های آنلاین ایران
مطالعه فروشگاه های آنلاین ایرانمطالعه فروشگاه های آنلاین ایران
مطالعه فروشگاه های آنلاین ایران
 
Scheduling and management of seed fund allocation across multiple npd projects
Scheduling and management of seed fund allocation across multiple npd projectsScheduling and management of seed fund allocation across multiple npd projects
Scheduling and management of seed fund allocation across multiple npd projects
 
Entrepisos2015
Entrepisos2015Entrepisos2015
Entrepisos2015
 
Apache flink - prise en main rapide
Apache flink - prise en main rapideApache flink - prise en main rapide
Apache flink - prise en main rapide
 

Similaire à Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...Thoughtworks
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
.NET for Azure Synapse (and viceversa)
.NET for Azure Synapse (and viceversa).NET for Azure Synapse (and viceversa)
.NET for Azure Synapse (and viceversa)Marco Parenzan
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topicsValentin Kropov
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practiceDarko Marjanovic
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltreMarco Parenzan
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...Tim Vaillancourt
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on DockerDataWorks Summit
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 

Similaire à Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle (20)

Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
.NET for Azure Synapse (and viceversa)
.NET for Azure Synapse (and viceversa).NET for Azure Synapse (and viceversa)
.NET for Azure Synapse (and viceversa)
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topics
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 

Plus de Domino Data Lab

What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...Domino Data Lab
 
The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...Domino Data Lab
 
Racial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops dataRacial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops dataDomino Data Lab
 
Data Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itData Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itDomino Data Lab
 
Supporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentationSupporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentationDomino Data Lab
 
Leveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryLeveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryDomino Data Lab
 
Summertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile VirusSummertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile VirusDomino Data Lab
 
Reproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterReproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterDomino Data Lab
 
GeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data ScienceGeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data ScienceDomino Data Lab
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Domino Data Lab
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Domino Data Lab
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at ScaleDomino Data Lab
 
How I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataDomino Data Lab
 
Software Engineering for Data Scientists
Software Engineering for Data ScientistsSoftware Engineering for Data Scientists
Software Engineering for Data ScientistsDomino Data Lab
 
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Domino Data Lab
 
Building Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyBuilding Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyDomino Data Lab
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsDomino Data Lab
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino Data Lab
 
The Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data ScienceThe Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data ScienceDomino Data Lab
 

Plus de Domino Data Lab (20)

What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...
 
The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...
 
Racial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops dataRacial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops data
 
Data Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itData Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using it
 
Supporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentationSupporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentation
 
Leveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryLeveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive Industry
 
Summertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile VirusSummertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile Virus
 
Reproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterReproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with Jupyter
 
GeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data ScienceGeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data Science
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
 
How I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked Data
 
Software Engineering for Data Scientists
Software Engineering for Data ScientistsSoftware Engineering for Data Scientists
Software Engineering for Data Scientists
 
Making Big Data Smart
Making Big Data SmartMaking Big Data Smart
Making Big Data Smart
 
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...
 
Building Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyBuilding Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technology
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
 
The Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data ScienceThe Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data Science
 

Dernier

Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 

Dernier (20)

Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 

Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

  • 1. #datapopupseattle Scala and Spark are Ideal for Big Data John Nestor Sr Architect, 47 Degrees 47deg
  • 2. #datapopupseattle UNSTRUCTURED Data Science POP-UP in Seattle www.dominodatalab.com D Produced by Domino Data Lab Domino’s enterprise data science platform is used by leading analytical organizations to increase productivity, enable collaboration, and publish models into production faster.
  • 3. Scala and Spark are Ideal for Big Data John Nestor 47 Degrees Seattle Unstructured Data Science Pop-Up October 7, 2015 www.47deg.com 347deg.com
  • 4. 47deg.com Why Scala? • Strong typing • Concise elegant syntax • Runs on JVM (Java Virtual Machine) • Supports both object-oriented and functional • Small simple programs through large parallel distributed systems • Easy to cleanly extend with new libraries and DSL’s • Ideal for parallel and distributed systems 4
  • 5. 47deg.com Scala: Strong Typing and Concise Syntax • Strong typing like Java. • Compile time checks • Better modularity via strongly typed interfaces • Easier maintenance: types make code easier to understand • Concise syntax like Python. • Type inference. Compiler infers most types that had to be explicit in Java. • Powerful syntax that avoid much of the boilerplate of Java code (see next slide). • Best of both worlds: safety of strong typing with conciseness (like Python). 5
  • 6. 47deg.com Scala Case Class • Java version
 
 class User {
 private String name;
 private Int age;
 public User(String name, Int age) {
 this.name = name; this.age = age;
 }
 public getAge() { return age; }
 public setAge(Int age) { this.age = age;}
 }
 User joe = new User(“Joe”, 30);
 • Scala version
 
 case class User(name:String, var age:Int)
 val joe = User(“Joe”, 30) 6
  • 7. 47deg.com Functional Scala • Anonymous functions. 
 (a:Int,b:Int) => a+b • Functions that take and return other functions. • Rarely need variables or loops • Immutable collections: Seq[T], Map[K,V], … • Works well with concurrent or distributed systems • Natural for functional programming • Functional collection operations (a small sample) • map, flatMap, reduce, … • filter, groupBy, sortBy, take, drop, … 7
  • 8. 47deg.com Scala Availability and Support • Open Source • Typesafe provides support. Founded my Martin Odersky who designed Scala. • IDEs: Intellij IDEA and Eclipse • Libraries: lots now and more every day • ScalaNLP - Epic (natural language processing) • Major Scala users: LinkedIn, Twitter, Goldman Sachs, Coursera, Angies List, Whitepages • Major systems written in Scala: Spark, Kafka 8
  • 9. 47deg.com Typesafe Scala Components • Scala Compiler (includes REPL) • Scala Standard Libraries • SBT - Scala Build Tool • Play - scaleable web applications • Scala JS - compiles Scala to JavaScript • Akka - for parallel and distributed computation • Spray - high performance asynchronous TCP/ HTTP library • Spark - Typesafe also supports Spark • Slick - for SQL database access • ConductR - Scala deployment/devops tool • Reactive Monitoring (Beta) 9
  • 10. 47deg.com Why Spark? • Support for not only batch but also (near) real-time • Fast - keeps data in memory as much as possible • Often 10X to 100X Hadoop speed • A clean easy-to-use API • A richer set of functional operations than just map and reduce • A foundation for a wide set of integrated data applications • Can recover from failures - recompute or (optional) replication • Scalable for very large data sets and reduced time 10
  • 11. 47deg.com Spark RDDs • RDD[T] - resilient distributed data set • typed (must be serializable) • immutable • ordered • can be processed in parallel • lazy evaluation - permits more global optimizations • Rich set of functional operations ( a small sample) • map, flatMap, reduce, … • filter, groupBy, sortBy, take, drop, … 11
  • 12. 47deg.com Spark Components • Spark Core • Scalable multi-node cluster • Failure detection and recovery • RDDs and functional operations • MLLib - for machine learning • linear regression, SVMs, clustering, collaborative filtering, dimension reduction • more on the way! • GraphX - for graph computation • Streaming - for near real-time • Dataframes - for SQL and Json 12
  • 13. 47deg.com Spark Availability and Support • Open Source - top level Apache project • Over 750 contributors from over 200 organizations • Can process multiple petabytes on clusters of over 8000 nodes • Databricks. Matei Zaharia who wrote the original Spark is a founder and CTO • Packages (more every day) • Zeppelin - Scala notebooks • Cassandra, Kafka connectors 13
  • 14. 47deg.com Clusters and Scalability • Scala Akka clusters (process distribution, micro services) • message passing • remote Actors • Spark clusters (data distribution) • local • Stand alone (optionally with ZooKeeper) • Apache Mesos • Hadoop Yarn • can run above on Amazon and Google clouds 14
  • 15. 47deg.com Why Scala for Spark? • Why not Python, R, or Java for Spark? • Spark is written in Scala • Scala source code is important Spark documentation • Spark is best extended in Scala • The primary API for Spark is Scala • The functional features of Scala and Spark are a natural fit and easiest to use in Scala • If you want to build scalable high performance production code based on Spark, R by itself is too specialized, Python is too slow and Java is tedious to write and maintain 15
  • 17. 47deg.com Seattle Resources • Seattle Meetups • Scala at the Sea Meetup 
 http://www.meetup.com/Seattle-Scala-User-Group/ • Seattle Spark Meetup 
 http://www.meetup.com/Seattle-Spark-Meetup/ • Seattle Training: Spark and Typesafe Scala Classes http://www.47deg.com/events#training • UW Scala Professional Certificate Program 
 http://www.pce.uw.edu/certificates/scala-functional-reactive-programming.html 17