Jai Ranganathan, Senior Director of Product Management, discusses why Spark has experienced such wide adoption and provide a technical deep dive into the architecture. Additionally, he presents some use cases in production today. Finally, he shares our vision for the Hadoop ecosystem and why we believe Spark is the successor to MapReduce for Hadoop data processing.
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
The Future of Hadoop: A deeper look at Apache Spark
1. 1
An Introduction to Spark
Jai Ranganathan, Senior Director Product Management, Cloudera
Denny Lee, Senior Director Data Sciences Engineering, Concur
16. 16
About Concur
What do we do?
• Leading provider of spend management solutions and (Travel,
Invoice, TripIt, etc.) services in the world
• Global customer base of 20,000 clients and 25 million users
• Processing more than $50 Billion in Travel & Expense (T&E)
spend each year
17. 17
About the Speaker
Who Am I?
• Long time SQL Server BI guy
(24TB Yahoo! Cube)
• Project Isotope (Hadoop on
Windows and Azure)
• At Concur, helping with Big
Data and Data Sciences
18. 18
A long time ago…
• We started using Hadoop because
• It was free
• i.e. Didn’t want to pay for a big data warehouse
• Could slowly extract from hundreds of relational data sources, consolidate it, and query it
• We were not thinking about advanced analytics
• We were thinking …. “cheaper reporting”
• We have some hardware lying around … let’s cobble it together and now we have reports
21. Can quickly switch to map mode and determine where most itineraries are from in 2013
21
22. 22
Or even quickly map out the airport locations on a map to see that Sun Moon
Lake Airport is in the center of Taiwan
23. 23
Starbucks Store #3313
601 108th Ave NE
Bellevue, WA (425) 646-9602
-------------------------------
Chk 713452
05/14/2014 11:04 AM
1961558 Drawer: 1 Reg: 1
-------------------------------
Bacon Art Brkfst 3.45
Warmed
T1 Latte 2.70
Triple 1.50
Soy 0.60
Gr Vanilla Mac 4.15
Reload Card 50.00
AMEX $50.00
XXXXXXXXXXXXXXXXXX1004
SBUX Card $13.56
SUBTOTAL $62.40
New Caffe Espresso
Frappuccino(R) Blended beverage
Our Signature
Frappuccino(R) roast coffee and
fresh milk, blended with ice.
Topped with our new espresso
whipped cream and new
Italian roast drizzle
Expense Categorization
One of my receipts that I had OCRed
One of the issues we’re trying to solve
is to auto-categorize this, so how
can we do this?
Below is a simplistic solution using
WordCount
Note, a real solution should involve
machine learning algorithms
24. 24
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 1.1.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45)
Type in expressions to have them evaluated.
Type :help for more information.
2014-09-07 22:31:21.064 java[1871:15527] Unable to load realm info from SCDynamicStore
14/09/07 22:31:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
Spark context available as sc.
scala> val receipt = sc.textFile("/usr/local/Cellar/workspace/data/receipt/receipt.txt")
receipt: org.apache.spark.rdd.RDD[String] =
/usr/local/Cellar/workspace/data/receipt/receipt.txt MappedRDD[1] at textFile at
<console>:12
scala> receipt.count
res0: Long = 30
25. 25
scala> val words = receipt.flatMap(_.split(" "))
words: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[2] at flatMap at <console>:14
scala> words.count
res1: Long = 161
scala> words.distinct.count
res2: Long = 72
scala> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _).map{case(x,y) =>
(y,x)}.sortByKey(false).map{case(i,j) => (j, i)}
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = MappedRDD[12] at map at <console>:16
scala> wordCounts.take(12)
res5: Array[(String, Int)] = Array(("",82), (with,2), (Card,2), (new,2), (----------------
---------------,2), (Frappuccino(R),2), (roast,2), (1,2), (and,2), (New,1), (Topped,1),
(Starbucks,1))
Cloudera’s enterprise data hub (powered by Hadoop) is a data management platform that provides a unique offering that’s unified, compliance-ready, accessible, and open.
This enterprise data hub bring everything together in one unified layer. No copying of data. Simply one single transparent view that allows you to easily meet auditing and compliance goals.
It offers a single, unified solution for:
Storage & serialization
Data ingest & egress
Security & governance
Metadata
Resource management
It’s compliance-ready for security and governance and includes:
Authentication, authorization, encryption, audit, RBAC, lineage
Single interface with integrated controls
It’s accessible through:
Multiple frameworks
Familiar tools and skills
And it’s completely open:
100% open source Apache licensed platform
Extensible to 3rd party frameworks
Zero lock-in platform
As mentioned, Cloudera’s enterprise data hub has multiple different frameworks integrated into the platform for robust querying. One of the newest and most exciting querying frameworks is Spark, an open source, flexible data processing framework for machine learning and stream processing. Before we dive into Spark, we need to understand why Spark is necessary. And that requires an understanding of MapReduce
Key idea: add “variables” to the “functions” in functional programming
This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)