SlideShare une entreprise Scribd logo
1  sur  35
1 
An Introduction to Spark 
Jai Ranganathan, Senior Director Product Management, Cloudera 
Denny Lee, Senior Director Data Sciences Engineering, Concur
Agenda 
• Cloudera’s Enterprise Data Hub 
• Why Spark? 
• Spark Use Cases 
• Concur Case Study 
• Cloudera and Spark 
• Future of Spark 
2 ©2014 Cloudera, Inc. All rights reserved.
Cloudera’s Enterprise Data Hub 
3 ©2014 Cloudera, Inc. All rights reserved. 
3RD PARTY 
APPS 
STORAGE FOR ANY TYPE OF DATA 
UNIFIED, ELASTIC, RESILIENT URE 
BATCH 
PROCESSING 
MAPREDUCE, 
SPARK 
ANALYTIC 
SQL 
IMPALA 
SEARCH 
ENGINE 
SOLR 
MACHINE 
LEARNING 
SPARK, PARTNERS, 
MAHOUT, MLLIB 
STREAM 
PROCESSING 
SPARK 
WORKLOAD MANAGEMENT YARN 
FILESYSTEM 
HDFS 
ONLINE NOSQL 
HBASE 
MANAGEMENT 
CLOUDERA NAVIGATOR 
DATA 
MANAGEMENT 
CLOUDERA MANAGER 
SYSTEM 
, SECURE SENTRY
Spark: Easy and Fast Big Data 
Easy to Develop 
• Rich APIs in Java, Scala, 
Python 
• Interactive shell 
Fast to Run 
• General execution 
graphs 
• In-memory storage 
2-5× less code Up to 10× faster on disk, 
4 ©2014 Cloudera, Inc. All rights reserved. 
100× in memory
Easy: Expressive API 
• map 
• filter 
• groupBy 
• sort 
• union 
• join 
• leftOuterJoin 
• rightOuterJoin 
• reduce 
• count 
• fold 
• reduceByKey 
• groupByKey 
• cogroup 
• cross 
• zip 
5 ©2014 Cloudera, Inc. All rights reserved. 
• sample 
• take 
• first 
• partitionBy 
• mapWith 
• pipe 
• save ...
Example: Logistic Regression 
data = spark.textFile(...).map(readPoint).cache() 
w = numpy.random.rand(D) 
for i in range(iterations): 
gradient = data 
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) 
* p.y * p.x) 
.reduce(lambda x, y: x + y) 
w -= gradient 
print “Final w: %s” % w 
6 ©2014 Cloudera, Inc. All rights reserved.
Spark Introduces Concept of RDD to Take 
Advantage of Memory 
RDD = Resilient Distributed Datasets 
• Memory caching layer that stores data in a distributed, fault-tolerant 
cache 
• Created by parallel transformations on data in stable storage 
Two observations: 
a. Can fall back to disk when data-set does not fit in memory 
b. Provides fault-tolerance through concept of lineage 
7 ©2014 Cloudera, Inc. All rights reserved.
Fast: Using RAM, Operator Graphs 
In-Memory Caching 
• Data Partitions read from RAM 
instead of disk 
Operator Graphs 
• Scheduling Optimizations 
• Fault Tolerance 
C: D: E: 
8 ©2014 Cloudera, Inc. All rights reserved. 
join 
B: B: 
groupBy 
filter 
F: 
Ç 
√ 
Ω 
map 
A: 
map 
take 
= RDD = cached partition
Easy: Out of the Box Functionality 
Hadoop Integration 
• Standard Hadoop data formats 
• Runs under YARN in mixed clusters 
Libraries 
• Mllib – Machine Learning toolkit 
• GraphX (alpha) – Graph analytics based on 
PowerGraph abstractions 
• Spark Streaming – Near real-time analytics 
• Spark SQL – direct SQL interface in a Spark 
application 
Language support: 
• SparkR (upcoming) 
• Java 8 
• Schema support in Spark’s APIs 
• SQL support in Spark Streaming (upcoming) 
9 ©2014 Cloudera, Inc. All rights reserved.
Logistic Regression Performance 
(Data Fits in Memory) 
4000 
3500 
3000 
2500 
2000 
1500 
1000 
500 
0 
1 5 10 20 30 
Running Time (s) 
Number of Iterations 
10 ©2014 Cloudera, Inc. All rights reserved. 
110 s / iteration 
Hadoop 
Spark 
first iteration 80 s 
further iterations 1 s
Spark Streaming 
What is it? 
• Run continuous processing of data using Spark’s core API. Extends Spark concepts to fault tolerant, 
transformable streams 
• Adds “rolling window” operations. E.g. compute rolling averages or counts for data over last five minutes. 
Why do you care? 
• Same programming paradigm for streaming and batch – reuse knowledge and code in both contexts 
• High level API with automatic DAG generation – simplicity of development 
• Excellent throughput – can scale easily to really large volumes of data ingest 
• Combine elements like MLLib & Oryx into a streaming application 
Example use cases: 
• “On-the-fly” ETL as data is ingested into Hadoop/HDFS 
• Detecting anomalous behavior and triggering alerts. 
• Continuous reporting of summary metrics for incoming data. 
11 ©2014 Cloudera, Inc. All rights reserved.
Streaming Architectures with Spark 
Data sources 
Integration 
layer 
Ingest 
HDFS 
Spark Stream processing 
• Flume 
• Kafka 
12 ©2014 Cloudera, Inc. All rights reserved. 
Data prep 
Aggregation / 
Scoring 
HBase 
Spark long-term analytics / model building 
Real-time result 
serving
Cloudera Customer Use Cases – Core Spark 
Sector Use case Replaces 
Financial 
• Multiple use cases to calculate VaR for portfolio risk analysis – 
Services 
Monte Carlo simulations as well as Var-Covar methods 
• ETL pipeline speed-up 
• Analyzing stock data for 20 years 
13 ©2014 Cloudera, Inc. All rights reserved. 
• Home grown 
applications 
Genomics • Two use cases to identify disease causing genes in full human 
genome 
• MySQL engine 
Data services • Trend analysis using statistical methods on large data sets 
• Document classification (LDA) 
• Fraud analytics 
• Netezza 
replacement 
• Net new 
Healthcare • Calculating Jaccard scores on health care data sets • Net new
Cloudera Customer Use Cases – Streaming 
Sector Use case Replaces 
Financial 
Services 
• On-line fraud detection • Net new 
Many • Continuous ETL 
Retail • On-line recommender systems 
• Inventory management 
14 ©2014 Cloudera, Inc. All rights reserved. 
• Custom apps
15 
Spark at Concur
16 
About Concur 
What do we do? 
• Leading provider of spend management solutions and (Travel, 
Invoice, TripIt, etc.) services in the world 
• Global customer base of 20,000 clients and 25 million users 
• Processing more than $50 Billion in Travel & Expense (T&E) 
spend each year
17 
About the Speaker 
Who Am I? 
• Long time SQL Server BI guy 
(24TB Yahoo! Cube) 
• Project Isotope (Hadoop on 
Windows and Azure) 
• At Concur, helping with Big 
Data and Data Sciences
18 
A long time ago… 
• We started using Hadoop because 
• It was free 
• i.e. Didn’t want to pay for a big data warehouse 
• Could slowly extract from hundreds of relational data sources, consolidate it, and query it 
• We were not thinking about advanced analytics 
• We were thinking …. “cheaper reporting” 
• We have some hardware lying around … let’s cobble it together and now we have reports
19 
Themes 
Consolidate Visualize Insight Recommend
20 
BTS 
Travel Weather 
Invoice Web Analytics 
Expense
Can quickly switch to map mode and determine where most itineraries are from in 2013 
21
22 
Or even quickly map out the airport locations on a map to see that Sun Moon 
Lake Airport is in the center of Taiwan
23 
Starbucks Store #3313 
601 108th Ave NE 
Bellevue, WA (425) 646-9602 
------------------------------- 
Chk 713452 
05/14/2014 11:04 AM 
1961558 Drawer: 1 Reg: 1 
------------------------------- 
Bacon Art Brkfst 3.45 
Warmed 
T1 Latte 2.70 
Triple 1.50 
Soy 0.60 
Gr Vanilla Mac 4.15 
Reload Card 50.00 
AMEX $50.00 
XXXXXXXXXXXXXXXXXX1004 
SBUX Card $13.56 
SUBTOTAL $62.40 
New Caffe Espresso 
Frappuccino(R) Blended beverage 
Our Signature 
Frappuccino(R) roast coffee and 
fresh milk, blended with ice. 
Topped with our new espresso 
whipped cream and new 
Italian roast drizzle 
Expense Categorization 
One of my receipts that I had OCRed 
One of the issues we’re trying to solve 
is to auto-categorize this, so how 
can we do this? 
Below is a simplistic solution using 
WordCount 
Note, a real solution should involve 
machine learning algorithms
24 
Spark assembly has been built with Hive, including Datanucleus jars on classpath 
Welcome to 
____ __ 
/ __/__ ___ _____/ /__ 
_ / _ / _ `/ __/ '_/ 
/___/ .__/_,_/_/ /_/_ version 1.1.0 
/_/ 
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45) 
Type in expressions to have them evaluated. 
Type :help for more information. 
2014-09-07 22:31:21.064 java[1871:15527] Unable to load realm info from SCDynamicStore 
14/09/07 22:31:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your 
platform... using builtin-java classes where applicable 
Spark context available as sc. 
scala> val receipt = sc.textFile("/usr/local/Cellar/workspace/data/receipt/receipt.txt") 
receipt: org.apache.spark.rdd.RDD[String] = 
/usr/local/Cellar/workspace/data/receipt/receipt.txt MappedRDD[1] at textFile at 
<console>:12 
scala> receipt.count 
res0: Long = 30
25 
scala> val words = receipt.flatMap(_.split(" ")) 
words: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[2] at flatMap at <console>:14 
scala> words.count 
res1: Long = 161 
scala> words.distinct.count 
res2: Long = 72 
scala> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _).map{case(x,y) => 
(y,x)}.sortByKey(false).map{case(i,j) => (j, i)} 
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = MappedRDD[12] at map at <console>:16 
scala> wordCounts.take(12) 
res5: Array[(String, Int)] = Array(("",82), (with,2), (Card,2), (new,2), (---------------- 
---------------,2), (Frappuccino(R),2), (roast,2), (1,2), (and,2), (New,1), (Topped,1), 
(Starbucks,1))
26
27 
What’s next… 
• With Spark 1.1 
• Sort-based shuffling 
• MLLib: correlations, sampling, feature extraction, decision 
trees 
• GraphX: label propagation
28 
Using AtScale to build up a dimensional model based on the data that is 
stored within Impala / Hive
29 
Slice and filter the Impala model using Tableau
30 
Spark and Cloudera
Why Cloudera? 
Expertise 
• Deep engineering investment – only distribution vendor with engineering contributions to Spark and 
actual technical know-how 
• Field team, support, training and services with experience in many Spark use cases 
• Driving roadmap for Spark 
Experience 
•Most customers running Spark across all distributions put together 
• Range from few nodes to 800+ nodes 
• Longest field presence – first vendor to support and still only two vendors with official support 
Partnerships 
• Intel partnership brings 15 Spark developers focused on Cloudera customer use cases 
• Business relationship with Databricks to do joint development on Spark 
31 ©2014 Cloudera, Inc. All rights reserved.
Spark Takes Over From MapReduce 
Stage 1 
• Crunch on Spark 
• Search on Spark 
Stage 2 
• Hive on Spark 
• Pig on Spark 
Stage 3 
• MR equivalence 
• Sqoop on Spark 
Cloudera led multi-organization effort: 
MapR, Intel, Databricks, IBM 
32 ©2014 Cloudera, Inc. All rights reserved.
Spark is Great but… 
• Opaque API limitations 
• Debugging and troubleshooting 
• Complex configuration 
CLOUDERA 
UNIVERSITY 
Spark Training 
33 ©2014 Cloudera, Inc. All rights reserved.
Questions & Next Steps 
Download Now – www.cloudera.com/download 
Spark Training - 
www.cloudera.com/content/cloudera/en/training/cour 
ses/spark-training.html 
34 ©2014 Cloudera, Inc. All rights reserved.
35 ©2014 Cloudera, Inc. All rights reserved. 
Thank You

Contenu connexe

Tendances

Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerDataWorks Summit
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingDataWorks Summit
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseDataWorks Summit/Hadoop Summit
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
 
Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid DataWorks Summit
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerSecuring data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerDataWorks Summit
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 

Tendances (20)

Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerSecuring data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache Ranger
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 

En vedette

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Spark on YARN: The Road Ahead
Spark on YARN: The Road AheadSpark on YARN: The Road Ahead
Spark on YARN: The Road AheadCloudera, Inc.
 
Решения Oracle для Big Data
Решения Oracle для Big DataРешения Oracle для Big Data
Решения Oracle для Big DataAndrey Akulov
 
Hadoopトレーニング番外編 〜間違えられやすいHadoopの7つの仕様〜
Hadoopトレーニング番外編 〜間違えられやすいHadoopの7つの仕様〜Hadoopトレーニング番外編 〜間違えられやすいHadoopの7つの仕様〜
Hadoopトレーニング番外編 〜間違えられやすいHadoopの7つの仕様〜Cloudera Japan
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search TrainingCloudera, Inc.
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.
 
Apache HBase 入門 (第2回)
Apache HBase 入門 (第2回)Apache HBase 入門 (第2回)
Apache HBase 入門 (第2回)tatsuya6502
 
Integrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseIntegrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseDataWorks Summit
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetupiwrigley
 
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetuBig Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetuEmre Sevinç
 
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線Recruit Lifestyle Co., Ltd.
 
Gremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming LanguageGremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming LanguageMarko Rodriguez
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Five Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSFive Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSCloudera, Inc.
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applicationshadooparchbook
 
Financial security and machine learning
Financial security and machine learningFinancial security and machine learning
Financial security and machine learningMk Kim
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopCloudera, Inc.
 
Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera, Inc.
 

En vedette (20)

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Spark on YARN: The Road Ahead
Spark on YARN: The Road AheadSpark on YARN: The Road Ahead
Spark on YARN: The Road Ahead
 
Решения Oracle для Big Data
Решения Oracle для Big DataРешения Oracle для Big Data
Решения Oracle для Big Data
 
Hadoopトレーニング番外編 〜間違えられやすいHadoopの7つの仕様〜
Hadoopトレーニング番外編 〜間違えられやすいHadoopの7つの仕様〜Hadoopトレーニング番外編 〜間違えられやすいHadoopの7つの仕様〜
Hadoopトレーニング番外編 〜間違えられやすいHadoopの7つの仕様〜
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search Training
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
Apache HBase 入門 (第2回)
Apache HBase 入門 (第2回)Apache HBase 入門 (第2回)
Apache HBase 入門 (第2回)
 
Integrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseIntegrating Hadoop Into the Enterprise
Integrating Hadoop Into the Enterprise
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
 
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetuBig Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
 
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
 
Gremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming LanguageGremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming Language
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Five Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSFive Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWS
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Financial security and machine learning
Financial security and machine learningFinancial security and machine learning
Financial security and machine learning
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Big Data: Myths and Realities
Big Data: Myths and RealitiesBig Data: Myths and Realities
Big Data: Myths and Realities
 
Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)
 

Similaire à The Future of Hadoop: A deeper look at Apache Spark

Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...DataStax
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsYousun Jeong
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkVince Gonzalez
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformDataStax Academy
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 

Similaire à The Future of Hadoop: A deeper look at Apache Spark (20)

Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 

Plus de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 

Dernier (20)

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 

The Future of Hadoop: A deeper look at Apache Spark

  • 1. 1 An Introduction to Spark Jai Ranganathan, Senior Director Product Management, Cloudera Denny Lee, Senior Director Data Sciences Engineering, Concur
  • 2. Agenda • Cloudera’s Enterprise Data Hub • Why Spark? • Spark Use Cases • Concur Case Study • Cloudera and Spark • Future of Spark 2 ©2014 Cloudera, Inc. All rights reserved.
  • 3. Cloudera’s Enterprise Data Hub 3 ©2014 Cloudera, Inc. All rights reserved. 3RD PARTY APPS STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT URE BATCH PROCESSING MAPREDUCE, SPARK ANALYTIC SQL IMPALA SEARCH ENGINE SOLR MACHINE LEARNING SPARK, PARTNERS, MAHOUT, MLLIB STREAM PROCESSING SPARK WORKLOAD MANAGEMENT YARN FILESYSTEM HDFS ONLINE NOSQL HBASE MANAGEMENT CLOUDERA NAVIGATOR DATA MANAGEMENT CLOUDERA MANAGER SYSTEM , SECURE SENTRY
  • 4. Spark: Easy and Fast Big Data Easy to Develop • Rich APIs in Java, Scala, Python • Interactive shell Fast to Run • General execution graphs • In-memory storage 2-5× less code Up to 10× faster on disk, 4 ©2014 Cloudera, Inc. All rights reserved. 100× in memory
  • 5. Easy: Expressive API • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip 5 ©2014 Cloudera, Inc. All rights reserved. • sample • take • first • partitionBy • mapWith • pipe • save ...
  • 6. Example: Logistic Regression data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w 6 ©2014 Cloudera, Inc. All rights reserved.
  • 7. Spark Introduces Concept of RDD to Take Advantage of Memory RDD = Resilient Distributed Datasets • Memory caching layer that stores data in a distributed, fault-tolerant cache • Created by parallel transformations on data in stable storage Two observations: a. Can fall back to disk when data-set does not fit in memory b. Provides fault-tolerance through concept of lineage 7 ©2014 Cloudera, Inc. All rights reserved.
  • 8. Fast: Using RAM, Operator Graphs In-Memory Caching • Data Partitions read from RAM instead of disk Operator Graphs • Scheduling Optimizations • Fault Tolerance C: D: E: 8 ©2014 Cloudera, Inc. All rights reserved. join B: B: groupBy filter F: Ç √ Ω map A: map take = RDD = cached partition
  • 9. Easy: Out of the Box Functionality Hadoop Integration • Standard Hadoop data formats • Runs under YARN in mixed clusters Libraries • Mllib – Machine Learning toolkit • GraphX (alpha) – Graph analytics based on PowerGraph abstractions • Spark Streaming – Near real-time analytics • Spark SQL – direct SQL interface in a Spark application Language support: • SparkR (upcoming) • Java 8 • Schema support in Spark’s APIs • SQL support in Spark Streaming (upcoming) 9 ©2014 Cloudera, Inc. All rights reserved.
  • 10. Logistic Regression Performance (Data Fits in Memory) 4000 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 Running Time (s) Number of Iterations 10 ©2014 Cloudera, Inc. All rights reserved. 110 s / iteration Hadoop Spark first iteration 80 s further iterations 1 s
  • 11. Spark Streaming What is it? • Run continuous processing of data using Spark’s core API. Extends Spark concepts to fault tolerant, transformable streams • Adds “rolling window” operations. E.g. compute rolling averages or counts for data over last five minutes. Why do you care? • Same programming paradigm for streaming and batch – reuse knowledge and code in both contexts • High level API with automatic DAG generation – simplicity of development • Excellent throughput – can scale easily to really large volumes of data ingest • Combine elements like MLLib & Oryx into a streaming application Example use cases: • “On-the-fly” ETL as data is ingested into Hadoop/HDFS • Detecting anomalous behavior and triggering alerts. • Continuous reporting of summary metrics for incoming data. 11 ©2014 Cloudera, Inc. All rights reserved.
  • 12. Streaming Architectures with Spark Data sources Integration layer Ingest HDFS Spark Stream processing • Flume • Kafka 12 ©2014 Cloudera, Inc. All rights reserved. Data prep Aggregation / Scoring HBase Spark long-term analytics / model building Real-time result serving
  • 13. Cloudera Customer Use Cases – Core Spark Sector Use case Replaces Financial • Multiple use cases to calculate VaR for portfolio risk analysis – Services Monte Carlo simulations as well as Var-Covar methods • ETL pipeline speed-up • Analyzing stock data for 20 years 13 ©2014 Cloudera, Inc. All rights reserved. • Home grown applications Genomics • Two use cases to identify disease causing genes in full human genome • MySQL engine Data services • Trend analysis using statistical methods on large data sets • Document classification (LDA) • Fraud analytics • Netezza replacement • Net new Healthcare • Calculating Jaccard scores on health care data sets • Net new
  • 14. Cloudera Customer Use Cases – Streaming Sector Use case Replaces Financial Services • On-line fraud detection • Net new Many • Continuous ETL Retail • On-line recommender systems • Inventory management 14 ©2014 Cloudera, Inc. All rights reserved. • Custom apps
  • 15. 15 Spark at Concur
  • 16. 16 About Concur What do we do? • Leading provider of spend management solutions and (Travel, Invoice, TripIt, etc.) services in the world • Global customer base of 20,000 clients and 25 million users • Processing more than $50 Billion in Travel & Expense (T&E) spend each year
  • 17. 17 About the Speaker Who Am I? • Long time SQL Server BI guy (24TB Yahoo! Cube) • Project Isotope (Hadoop on Windows and Azure) • At Concur, helping with Big Data and Data Sciences
  • 18. 18 A long time ago… • We started using Hadoop because • It was free • i.e. Didn’t want to pay for a big data warehouse • Could slowly extract from hundreds of relational data sources, consolidate it, and query it • We were not thinking about advanced analytics • We were thinking …. “cheaper reporting” • We have some hardware lying around … let’s cobble it together and now we have reports
  • 19. 19 Themes Consolidate Visualize Insight Recommend
  • 20. 20 BTS Travel Weather Invoice Web Analytics Expense
  • 21. Can quickly switch to map mode and determine where most itineraries are from in 2013 21
  • 22. 22 Or even quickly map out the airport locations on a map to see that Sun Moon Lake Airport is in the center of Taiwan
  • 23. 23 Starbucks Store #3313 601 108th Ave NE Bellevue, WA (425) 646-9602 ------------------------------- Chk 713452 05/14/2014 11:04 AM 1961558 Drawer: 1 Reg: 1 ------------------------------- Bacon Art Brkfst 3.45 Warmed T1 Latte 2.70 Triple 1.50 Soy 0.60 Gr Vanilla Mac 4.15 Reload Card 50.00 AMEX $50.00 XXXXXXXXXXXXXXXXXX1004 SBUX Card $13.56 SUBTOTAL $62.40 New Caffe Espresso Frappuccino(R) Blended beverage Our Signature Frappuccino(R) roast coffee and fresh milk, blended with ice. Topped with our new espresso whipped cream and new Italian roast drizzle Expense Categorization One of my receipts that I had OCRed One of the issues we’re trying to solve is to auto-categorize this, so how can we do this? Below is a simplistic solution using WordCount Note, a real solution should involve machine learning algorithms
  • 24. 24 Spark assembly has been built with Hive, including Datanucleus jars on classpath Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.1.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45) Type in expressions to have them evaluated. Type :help for more information. 2014-09-07 22:31:21.064 java[1871:15527] Unable to load realm info from SCDynamicStore 14/09/07 22:31:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context available as sc. scala> val receipt = sc.textFile("/usr/local/Cellar/workspace/data/receipt/receipt.txt") receipt: org.apache.spark.rdd.RDD[String] = /usr/local/Cellar/workspace/data/receipt/receipt.txt MappedRDD[1] at textFile at <console>:12 scala> receipt.count res0: Long = 30
  • 25. 25 scala> val words = receipt.flatMap(_.split(" ")) words: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[2] at flatMap at <console>:14 scala> words.count res1: Long = 161 scala> words.distinct.count res2: Long = 72 scala> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _).map{case(x,y) => (y,x)}.sortByKey(false).map{case(i,j) => (j, i)} wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = MappedRDD[12] at map at <console>:16 scala> wordCounts.take(12) res5: Array[(String, Int)] = Array(("",82), (with,2), (Card,2), (new,2), (---------------- ---------------,2), (Frappuccino(R),2), (roast,2), (1,2), (and,2), (New,1), (Topped,1), (Starbucks,1))
  • 26. 26
  • 27. 27 What’s next… • With Spark 1.1 • Sort-based shuffling • MLLib: correlations, sampling, feature extraction, decision trees • GraphX: label propagation
  • 28. 28 Using AtScale to build up a dimensional model based on the data that is stored within Impala / Hive
  • 29. 29 Slice and filter the Impala model using Tableau
  • 30. 30 Spark and Cloudera
  • 31. Why Cloudera? Expertise • Deep engineering investment – only distribution vendor with engineering contributions to Spark and actual technical know-how • Field team, support, training and services with experience in many Spark use cases • Driving roadmap for Spark Experience •Most customers running Spark across all distributions put together • Range from few nodes to 800+ nodes • Longest field presence – first vendor to support and still only two vendors with official support Partnerships • Intel partnership brings 15 Spark developers focused on Cloudera customer use cases • Business relationship with Databricks to do joint development on Spark 31 ©2014 Cloudera, Inc. All rights reserved.
  • 32. Spark Takes Over From MapReduce Stage 1 • Crunch on Spark • Search on Spark Stage 2 • Hive on Spark • Pig on Spark Stage 3 • MR equivalence • Sqoop on Spark Cloudera led multi-organization effort: MapR, Intel, Databricks, IBM 32 ©2014 Cloudera, Inc. All rights reserved.
  • 33. Spark is Great but… • Opaque API limitations • Debugging and troubleshooting • Complex configuration CLOUDERA UNIVERSITY Spark Training 33 ©2014 Cloudera, Inc. All rights reserved.
  • 34. Questions & Next Steps Download Now – www.cloudera.com/download Spark Training - www.cloudera.com/content/cloudera/en/training/cour ses/spark-training.html 34 ©2014 Cloudera, Inc. All rights reserved.
  • 35. 35 ©2014 Cloudera, Inc. All rights reserved. Thank You

Notes de l'éditeur

  1. Cloudera’s enterprise data hub (powered by Hadoop) is a data management platform that provides a unique offering that’s unified, compliance-ready, accessible, and open. This enterprise data hub bring everything together in one unified layer. No copying of data. Simply one single transparent view that allows you to easily meet auditing and compliance goals. It offers a single, unified solution for: Storage & serialization Data ingest & egress Security & governance Metadata Resource management It’s compliance-ready for security and governance and includes: Authentication, authorization, encryption, audit, RBAC, lineage Single interface with integrated controls It’s accessible through: Multiple frameworks Familiar tools and skills And it’s completely open: 100% open source Apache licensed platform Extensible to 3rd party frameworks Zero lock-in platform As mentioned, Cloudera’s enterprise data hub has multiple different frameworks integrated into the platform for robust querying. One of the newest and most exciting querying frameworks is Spark, an open source, flexible data processing framework for machine learning and stream processing. Before we dive into Spark, we need to understand why Spark is necessary. And that requires an understanding of MapReduce
  2. Key idea: add “variables” to the “functions” in functional programming
  3. This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)
  4. Quick view of Android vs. iOS mobile sessions