Contenu connexe Similaire à 2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China (20) Plus de Allen Day, PhD (20) 2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China1. © 2014 MapR Technologies 1© 2014 MapR Technologies
Genomics Use Cases @ MapR
2. © 2014 MapR Technologies 2© 2014 MapR Technologies
DNA Sequencing Company
3. © 2014 MapR Technologies 3
Parallelize Primary Analytics
.fastq .vcf
short read
alignment
genotype
callingreads &
mappings
4. © 2014 MapR Technologies 4
Sequence Analysis, Quick Overview
[…] G A C T A G A fragment1
A C A G T T T A C A fragment2
A G A T A - - A G A fragment3
A A C A G C T T A C A […] fragment4
C T A T A G A T A A fragment5
[…] G A T T A C A G A T T A C A G A T T A C A […] referenceDNA
[…] G A C T A C A G A T A A C A G A T T A C A […] sampleDNA
5. © 2014 MapR Technologies 5
What is the (Probable) Color of Each Column?
6. © 2014 MapR Technologies 6
Which Columns are (probably) Not White?
Strategy 1: examine foreach column, foreach row O(rows*cols)
+ O(1 col) memory
7. © 2014 MapR Technologies 7
Which Columns are (probably) Not White?
Strategy 2: examine foreach row. keep running tallies O(rows)
+ O(rows*cols) memory
8. © 2014 MapR Technologies 8
Which Columns are (probably) Not White?
Strategy 3: rotate matrix. examine foreach column O(rows log rows)
+ O(cols)
+ O(1 col) memory
9. © 2014 MapR Technologies 9
Comparison of Strategies
Strategy 1
• Low mem req
• Random access
pattern, many ops
Strategy 3
• Low mem req
• Sequential access
pattern
• Requires Sort
Strategy 2
• High mem req
• Sequential access
pattern
O(rows*cols)
+ O(1 col) memory
O(rows)
+ O(rows*cols) memory
O(rows log rows)
+ O(cols)
+ O(1 col) memory
10. © 2014 MapR Technologies 10
Comparison of Strategies
Strategy 1
• Low mem req
• Random access
pattern, many ops
Strategy 3
• Low mem req
• Sequential access
pattern
• Requires Sort
Strategy 2
• High mem req
• Sequential access
pattern
O(rows*cols)
+ O(1 col) memory
O(rows)
+ O(rows*cols) memory
O(rows log rows) ÷ shards
+ O(cols) ÷ shards
+ O(1 col) memory
As # of rows & columns increases
Strategy 3 becomes more attractive
11. © 2014 MapR Technologies 11
Primary Sequence Analysis (ETL), MapReduce style
.fastq .bam .vcf
short read
alignment
genotype
calling
MAP
MAP
REDUCE, rotate matrix 90º
(O(mn)) / 1 (O(mn) + O(n log n)) / s
Hello!
12. © 2014 MapR Technologies 12
Clinical Applications: Performance Matters
MapR
FilesystemN
F
S
DNA
Sequencer
DNA
Sequencer
DNA
Sequencer
Raw
DNARaw
DNARaw
DNA
1º Analytics
Raw
DNARaw
DNASNP
calls
Static
Clinical
Reporting
PhysicianPatient
Reference
DBs
SNP DB
ETL
2º
Analytics
ResearcherSubject
13. © 2014 MapR Technologies 13
Variant Collection Enables Downstream Apps
• GWAS Association Studies
• Versioned, Personalized
Medicine
• Companion Diagnostics
SNP DB 2º
Analytics
New
Markets
Hello!
More linear algebra
[Spark,
Summingbird,
Lambda Architecture
Slides]
14. © 2014 MapR Technologies 14
The Post-Sequencing Genomics Workload
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
15. © 2014 MapR Technologies 15
Example GWAS/SNP Analysis
• Find me related SNPs…
– From other experiments
• Given a phenotype…
– And an associated SNP from my
experiment
• That elucidate genetic basis of
phenotype…
• And rank order them by
impact/likelihood/etc
16. © 2014 MapR Technologies 16
Example GWAS/SNP Analysis
• Find me related SNPs…
– From other experiments
• Given a phenotype…
– And an associated SNP from my
experiment
• That elucidate genetic basis of
phenotype…
• And rank order them by
impact/likelihood/etc
• In context of, e.g.
– ε1: Racial, etc. background
– ε2: Experimental design-
specific concerns (e.g. familial
IBD/IBS)
– ε3: Environmental factors and
penetrance
– ε4: Assay-specific biases and
noise
phenotype = αgenotype + β + ε1 + ε2 + ε3 + ε4
At risk of over-simplifying as
business-level concept…
17. © 2014 MapR Technologies 17
HUGE PROBLEM
COMBINATORIAL EXPLOSION
18. © 2014 MapR Technologies 18
What’s a Percolator?
• Google Percolator
– “Caffeine” update 2010
• Iterative, incremental prioritized
updates
• No batch processing
• Decouple computational results
from data size
Peng & Dabek, 2010. Large-scale Incremental Processing Using Distributed Transactions and Notifications
19. © 2014 MapR Technologies 19
Solution: Percolate
SNPs,
experimental groupings,
assay technologies,
assayed phenotypes,
annotations/ontologies
Denormalize
and Percolate
(re)prioritize &
(re)process
service queries
drive
dashboards
create reports
denormalize for
display
buffer
New
models
20. © 2014 MapR Technologies 20
Robot Scientist
Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
21. © 2014 MapR Technologies 21
Robot (Data?) Scientist
Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
22. © 2014 MapR Technologies 22© 2014 MapR Technologies
Genealogy Company
Slides credit: Bill Yetman, Hadoop Summit 2014
http://slidesha.re/1vRh3kY
23. © 2014 MapR Technologies 23
GERMLINE is…
• …an algorithm that finds hidden relationships within a pool of
DNA
• …the reference implementation of that algorithm written in C++.
• You can find it here:
http://www1.cs.columbia.edu/~gusev/germline/
2
3
24. © 2014 MapR Technologies 24
Projected GERMLINE run times (in hours)
2
4
Hours
Samples
0
100
200
300
400
500
600
700
2,500
12,500
22,500
32,500
42,500
52,500
62,500
72,500
82,500
92,500
102,500
112,500
122,500
GERMLINE run times
Projected GERMLINE run
times
700 hours = 29+ days
EXPONENTIAL COMPLEXITY
25. © 2014 MapR Technologies 25
GERMLINE: What’s the Problem?
• GERMLINE (the implementation) was not meant to be used in
an industrial setting
– Stateless, single threaded, prone to swapping (heavy memory usage)
– GERMLINE performs poorly on large data sets
• Our metrics predicted exactly where the process would slow to
a crawl
• Put simply: GERMLINE couldn't scale
2
5
26. © 2014 MapR Technologies 26
Run times for matching (in hours)
2
6
Hours
Samples
0
20
40
60
80
100
120
140
160
180
GERMLINE run times
Jermline run times
Projected GERMLINE
run times
EXPONENTIAL LINEAR
HBase
Refactor
27. © 2014 MapR Technologies 27
• Paper submitted describing the implementation
• Releasing as an Open Source project soon
• [HBase Schema/Algorithm Slides]
2
7
28. © 2014 MapR Technologies 28© 2014 MapR Technologies
Further Growth & Optimization
29. © 2014 MapR Technologies 29
Underdog (Strand Phasing) performance
– Went from 12 hours to process 1,000 samples
to under 25 minutes with a MapReduce
implementation
2
9
With improved accuracy!
Underdog
replaces
Beagle
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
Total Run Size Total Beagle-Underdog Duration
30. © 2014 MapR Technologies 30
Pipeline steps and incremental change…
– Incremental change over time
– Supporting the business in a “just in time” Agile way
3
0
0
50000
100000
150000
200000
250000
500
3622
7243
9615
12353
16333
19522
22861
26642
31172
35986
40852
45252
49817
54738
61675
69496
77257
84337
90074
97448
104684
111937
119669
127194
134970
142232
149988
157710
165685
173719
181617
189817
197853
205855
213471
221290
228912
236516
243550
251315
259164
267266
275335
283114
291017
298823
306556
314662
322655
330745
338813
346847
354938
362954
371064
379208
387334
395432
Beagle-Underdog Phasing
Pipeline Finalize
Relationship Processing
Germline-Jermline Results Processing
Germline-Jermline Processing
Beagle Post Phasing
Admixture
Plink Prep
Pipeline Initialization
Jermline replaces
Germline
Ethnicity V2 Release
Underdog Replaces
Beagle
AdMixture on
Hadoop
31. © 2014 MapR Technologies 31
…while the business continues to grow rapidly
3
1
-
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14
#ofprocessedsamples)
DNA Database Size
32. © 2014 MapR Technologies 32© 2014 MapR Technologies
BigData App Development Lifecycle
33. © 2014 MapR Technologies 33
BigData App Development Lifecycle
outputinput
1M rows
tail | grep | sort | uniq -c
34. © 2014 MapR Technologies 34
Evolution of Data Storage
Functionality
Compatibility
Scalability
Linux
POSIX
Over decades of progress,
Unix-based systems have set the
standard for compatibility and
functionality
35. © 2014 MapR Technologies 35
BigData App Development Lifecycle
outputinput
1M rows
tail | grep | sort | uniq -c
36. © 2014 MapR Technologies 36
BigData App Development Lifecycle
tail | grep | sort | uniq -c
outputinput
1M rows
1B rows
37. © 2014 MapR Technologies 37
BigData App Development Lifecycle
tail | grep | sort | uniq -c
outputinput
1M rows
1B rows
1T rows
38. © 2014 MapR Technologies 38
Evolution of Data Storage
Functionality
Compatibility
Scalability
Linux
POSIX
Hadoop
Hadoop achieves much higher scalability
by trading away essentially all of this
compatibility
39. © 2014 MapR Technologies 39
BigData App Development Lifecycle
tail | grep | sort | uniq -c
outputinput
1T rows
1T rows
input output
Port to BigData Tools
($$$$)
40. © 2014 MapR Technologies 40
Evolution of Data Storage
Functionality
Compatibility
Scalability
Linux
POSIX
Hadoop
MapR enhances Apache Hadoop by restoring
the compatibility while increasing scalability and
performance
41. © 2014 MapR Technologies 41
BigData App Development Lifecycle
tail | grep | sort | uniq -c
outputinput
1T rows
POSIX (NFS)
Hadoop HDFS
Port
42. © 2014 MapR Technologies 42
BigData App Development Lifecycle
tail | grep | sort | uniq -c
1 1 1 1
100 100 100 100
Prototype Tools
Dev Cost
BigData Tools
Dev Cost
Use When
Possible
Use When
Needed
43. © 2014 MapR Technologies 43
BigData App Development Lifecycle
tail | grep | sort | uniq -c
1 1 1
100
Prototype Tools
Dev Cost
BigData Tools
Dev Cost
Use When
Possible
Use When
Needed
44. © 2014 MapR Technologies 44© 2014 MapR Technologies
Aadhaar – World’s Largest Biometric
Database
45. © 2014 MapR Technologies 45
Largest Biometric Database in the World
PEOPLE
1.2B
PEOPLE
46. © 2014 MapR Technologies 46
India: Problem
• 1.2 billion residents
– 640,000 villages, ~60% lives under $2/day
– ~75% literacy, <3% pays Income Tax, <20% banking
– ~800 million mobile, ~200-300 mn migrant workers
• Govt. spends about $25-40 billion on direct subsidies
– Residents have no standard identity document
– Most programs plagued with ghost and multiple identities causing
leakage of 30-40%
47. © 2014 MapR Technologies 47
India: Vision
• Create a common “national identity” for every “resident”
– Biometric backed identity to eliminate duplicates
– “Verifiable online identity” for portability
• Applications ecosystem using open APIs
– Aadhaar enabled bank account and payment platform
– Aadhaar enabled electronic, paperless KYC
• Enrolment
– One time in a person’s lifetime
– Multi-modal biometrics (fingerprints, iris)
48. © 2014 MapR Technologies 48
Aadhaar Biometric Capture & Index
49. © 2014 MapR Technologies 49
Aadhaar Biometric Capture & Index
50. © 2014 MapR Technologies 50
Aadhaar Biometric Capture & Index
51. © 2014 MapR Technologies 51
Architectural Principles
• Design for Scale
– Every component needs to scale to large volumes
– Millions of transactions and billions of records
– Accommodate failure and design for recovery
• Open Architecture
– Open Source
– Open APIs
• Security
– End-to-end security of resident data
52. © 2014 MapR Technologies 52
Design for Scale
• Horizontal scale-out
• Distributed computing
• Distributed data storage and partitioning
• No single points of failure
• No single points of bottleneck
• Asynchronous processing throughout the system
– Allows loose coupling various components
– Allows independent component level scaling
53. © 2014 MapR Technologies 53
MapR Filesystem
Aadhaar Multi-DC Data Storage Stack*
ID + Biometrics
(M7 HBase)
All raw packets
(HDFS+NFS)
Enrollment ID
API
ID + Demo + Photo + Benefits
(MySQL, Solr)
Authentication
API
Authorization
API
* as best I understand
from public
documents
54. © 2014 MapR Technologies 54
Enrollment Volume
• 600 to 800 million UIDs in 4 years
– 1 million a day
– 200+ trillion matches every day!!!
• ~5MB per resident
– Maps to about 10-15 PB of raw data (2048-bit PKI encrypted!)
– About 30 TB I/O every day
– Replication and backup across DCs of about 5+ TB of incremental data every day
– Lifecycle updates and new enrolments will continue for ever
• Additional process data
– Several million events on an average moving through async channels (some
persistent and some transient)
– Needing complete update and insert guarantees across data stores
55. © 2014 MapR Technologies 55
Authentication Volume
• 100+ million authentications per day (10 hrs)
– Possible high variance on peak and average
– Sub second response
– Guaranteed audits
• Multi-DC architecture
– All changes needs to be propagated from enrolment data stores to all authentication
sites
• Authentication request is about 4 K
– 100 million authentications a day
– 1 billion audit records in 10 days (30+ billion a year)
– 4 TB encrypted audit logs in 10 days
– Audit write must be guaranteed
56. © 2014 MapR Technologies 56
How Do Biometrics Relate to Genomics?
Data Shape and Size
• Aadhaar: 5MB features (minutia)
• Genome: ~3M features (variants)
Data Set Operations
• Aadhaar: ƒ(x) Unique feature subset => identity
• Genome: “ “ “ “ “
• Genome: Variant × Phenotype
Commonality => Causal Genes
ƒ-1(x) !
SNP DB 2º
Analytics
57. © 2014 MapR Technologies 57
Data Shape and Size
• Aadhaar: 5MB features (minutia)
• Genome: ~3M features (variants)
Data Set Operations
• Aadhaar: ƒ(x) Unique feature subset => identity
• Genome: “ “ “ “ “
• Genome: Variant × Phenotype
Commonality => Causal Genes
ƒ-1(x) !
Vector Pattern Matching
SNP DB 2º
Analytics
ƒ-1(x): common features
ƒ(x): unique features
ƒ(x): uncommon features
ƒ(x): other features
58. © 2014 MapR Technologies 58
Data Shape and Size
• Aadhaar: 5MB features (minutia)
• Genome: ~3M features (variants)
Data Set Operations
• Aadhaar: ƒ(x) Unique feature subset => identity
• Genome: “ “ “ “ “
• Genome: Variant × Phenotype
Commonality => Causal Genes
ƒ-1(x) !
Topological Pattern Matching
SNP DB 2º
Analytics
59. © 2014 MapR Technologies 59© 2014 MapR Technologies
MapR Platform
60. © 2014 MapR Technologies 60
Apache Hadoop NameNode High Availability (HA)
NameNode
A B C D E F
HDFS-based Distributions
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
Primary NameNode
A B C D E F
Standby NameNode
A B C D E F
NameNode
A B
NameNode
C D
NameNode
E F
NameNode
A B
NameNode
C D
NameNode
E F
NAS
Appliance
HDFS HA
HDFS
Federation
Single point of failure
Limited to 50-200 million files
Performance bottleneck
Metadata must fit in memory
Only one active NameNode
Limited to 50-200 million files
Commercial NAS possibly needed
Metadata must fit in memory
Performance bottleneck Double the block reports
Multiple single points
of failure w/o HA
Needs 20 NameNodes
for 1 Billion files
Commercial NAS needed
Metadata must fit in memory
Performance bottleneck
Double the block reports
61. © 2014 MapR Technologies 61
No NameNode Architecture
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
NameNode
A B C D E FAAA BBBB CCC DDD EEE FFF
Up to 1T files (> 5000x advantage)
Significantly less hardware & OpEx
Higher performance
No special config to enable HA
Automatic failover & re-replication
Metadata is persisted to disk
62. © 2014 MapR Technologies 62
MapR M7: The Best In-Hadoop Database
NoSQL Columnar Store
Apache HBase API
In-Hadoop database
HBase
JVM
HDFS
JVM
ext3/ext4
Disks
Other Distros
Tables/Files
Disks
MapR M7
The most scalable, enterprise-grade,
NoSQL database that supports online applications and analytics
63. © 2014 MapR Technologies 63
MapR M7: The Best In-Hadoop Database
NoSQL Columnar Store
Apache HBase API
In-Hadoop database
Hbase
Interface
JVM
HDFS
Interface
JVM
ext3/ext4
Disks
Other Distros
Tables/Files
Disks
MapR M7
The most scalable, enterprise-grade,
NoSQL database that supports online applications and analytics
BigData Application
64. © 2014 MapR Technologies 64
Hbase Apps: High Performance with Consistent Low
Latency
--- M7 Read Latency --- Others Read Latency
65. © 2014 MapR Technologies 65© 2014 MapR Technologies
MapR Services
66. © 2014 MapR Technologies 66
Professional Services
• Installation
• Migrations
• SLA Plans
• Best Practices
• Performance
Tuning
Hadoop Core
Services
IT/ Infrastructure
Linux
Networking
Data Center
Storage
Operations
Big Data
Workflows
• Hive/Pig
• Oozie/Sqoop
• Flume
• M7/HBase
• Data Flow
BI / DBA
BI / ETL / Reporting
Scripting / Java
Hadoop MR
Eco Projects
(HBase, Hive, …)
Solution
Design
• HBase/M7
• Map/Reduce
• Application
Development
• Integration
Development
Java
Hadoop Developer
Architectural Design
Advanced
Analytics
• Use case
Discovery
• Use case
Modeling
• POC
• Workshops
Modeler / Analyst
PhD
Statistics/Math
MatLab / R / SAS
Scripting / Java
BI / ETL / Reporting
Data Engineering Data Science
AUDIENCE
ENGAGEMENTS
SKILLS
67. © 2014 MapR Technologies 67
Global PS Resources
17 Today (+8 in Q3)
D.C.
Keys Botzum (DE/Security/Developer)
Joe Blue (Data Scientist)
Venkat Gunnup (DE/Development)
Alex Rodriguez (DE/Development)
Kannappan Sirchabesa (DE/OPS)
SAN JOSE
Wayne Cappas (Director/DE)
John Benninghoff (DE/OPS)
Dmitry Gomerman (DE/OPS & Security)
Ivan Bishop (DE/OPS)
James Caseletto (Data Scientist)
Sungwook Yoon (Data Scientist)
Sridhar Reddy (Director - M7/Hbase)
LOS ANGELES
John Ewing (DE/OPS)
Marco Vasquez (Data Scientist/DE)
SOUTH CAROLINA
David Schexnayder (DE/OPS)
PHOENIX
Michael Farnbach (DE/OPS)
SINGAPORE
Allen Day (Data Scientist)
68. © 2014 MapR Technologies 68
Use Case Data Flow Example
MapR Data Platform
Processing and Analytics
Ingest
Sqoop
Flume
HDFS
NFS
Access
Tez
Drill
Hive
Pig
Impala
Data Sources
Clickstream
Billing Data
Mobile Data
Product Catalog
Social Media
Server Logs
Merchant Listings
Online Chat
Call Detail Records
Visualization
M7HBase
MapReduce
v1 & v2
StormCascadingPig
Solr MahoutYARN
Oozie Hive MLLib
Set-Top Box Data
69. © 2014 MapR Technologies 69
Engagement Types
• Customer engagement is typically 1-4 weeks (longer okay)
• Well established partners (15,000 resources globally)
• Custom training based on customer use-case
• Small 1-3 days workshops
• Extended support / Staff augmentation
70. © 2014 MapR Technologies 70
Q&A
twitter.com/allenday aday@mapr.com
Thanks!
slideshare.net/allendaylinkedin.com/in/allenday
71. © 2014 MapR Technologies 71© 2014 MapR Technologies
An Overview of Apache Spark
72. © 2014 MapR Technologies 72
Agenda
• MapReduce Refresher
• What is Spark?
• The Difference with Spark
• Preexisting MapReduce
• Examples and Resources
73. © 2014 MapR Technologies 73© 2014 MapR Technologies
MapReduce Refresher
74. © 2014 MapR Technologies 74
MapReduce Basics
• Foundational model is based on a distributed file system
– Scalability and fault-tolerance
• Map
– Loading of the data and defining a set of keys
• Reduce
– Collects the organized key-based data to process and output
• Performance can be tweaked based on known details of your
source files and cluster shape (size, total number)
75. © 2014 MapR Technologies 75
Languages and Frameworks
• Languages
– Java, Scala, Clojure
– Python, Ruby
• Higher Level Languages
– Hive
– Pig
• Frameworks
– Cascading, Crunch
• DSLs
– Scalding, Scrunch, Scoobi, Cascalog
76. © 2014 MapR Technologies 76
MapReduce Processing Model
• Define mappers
• Shuffling is automatic
• Define reducers
• For complex work, chain jobs together
– Or use a higher level language or DSL that does this for you
77. © 2014 MapR Technologies 77© 2014 MapR Technologies
What is Spark?
78. © 2014 MapR Technologies 78
Apache Spark
spark.apache.org
github.com/apache/spark
user@spark.apache.org
• Originally developed in 2009 in UC
Berkeley’s AMP Lab
• Fully open sourced in 2010 – now
a Top Level Project at the Apache
Software Foundation
80. © 2014 MapR Technologies 80
Spark is the Most Active Open Source Project in Big Data
Giraph
Storm
Tez
0
20
40
60
80
100
120
140
Projectcontributorsinpastyear
81. © 2014 MapR Technologies 81
Unified Platform
Shark
(SQL)
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
Continued innovation bringing new functionality, e.g.:
• Java 8 (Closures, Lamba Expressions)
• Spark SQL (SQL on Spark, not just Hive)
• BlinkDB (Approximate Queries)
• SparkR (R wrapper for Spark)
82. © 2014 MapR Technologies 82
Supported Languages
• Java
• Scala
• Python
• Hive?
83. © 2014 MapR Technologies 83
Data Sources
• Local Files
– file:///opt/httpd/logs/access_log
• S3
• Hadoop Distributed Filesystem
– Regular files, sequence files, any other Hadoop InputFormat
• HBase
84. © 2014 MapR Technologies 84
Machine Learning - MLlib
• K-Means
• L1 and L2-regularized Linear Regression
• L1 and L2-regularized Logistic Regression
• Alternating Least Squares
• Naive Bayes
• Stochastic Gradient Descent
* As of May 14, 2014
** Don’t be surprised if you see the Mahout library converting to Spark soon
85. © 2014 MapR Technologies 85© 2014 MapR Technologies
The Difference with Spark
86. © 2014 MapR Technologies 86
Easy and Fast Big Data
• Easy to Develop
– Rich APIs in Java, Scala,
Python
– Interactive shell
• Fast to Run
– General execution graphs
– In-memory storage
2-5× less code Up to 10× faster on disk,
100× in memory
87. © 2014 MapR Technologies 87
Resilient Distributed Datasets (RDD)
• Spark revolves around RDDs
• Fault-tolerant collection of elements that can be operated on in
parallel
– Parallelized Collection: Scala collection which is run in parallel
– Hadoop Dataset: records of files supported by Hadoop
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
88. © 2014 MapR Technologies 88
RDD Operations
• Transformations
– Creation of a new dataset from an existing
• map, filter, distinct, union, sample, groupByKey, join, etc…
• Actions
– Return a value after running a computation
• collect, count, first, takeSample, foreach, etc…
Check the documentation for a complete list
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
89. © 2014 MapR Technologies 89
RDD Persistence / Caching
• Variety of storage levels
– memory_only (default), memory_and_disk, etc…
• API Calls
– persist(StorageLevel)
– cache() – shorthand for persist(StorageLevel.MEMORY_ONLY)
• Considerations
– Read from disk vs. recompute (memory_and_disk)
– Total memory storage size (memory_only_ser)
– Replicate to second node for faster fault recovery (memory_only_2)
• Think about this option if supporting a web application
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence
90. © 2014 MapR Technologies 90
Cache Scaling Matters
69
58
41
30
12
0
20
40
60
80
100
Cache
disabled
25% 50% 75% Fully cached
Executiontime(s)
% of working set in cache
91. © 2014 MapR Technologies 91
Directed Acylic Graph (DAG)
• Directed
– Only in a single direction
• Acyclic
– No looping
• Why does this matter?
– This supports fault-tolerance
92. © 2014 MapR Technologies 92
RDD Fault Recovery
RDDs track lineage information that can be used to efficiently
recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
93. © 2014 MapR Technologies 93
Comparison to Storm
• Higher throughput than Storm
– Spark Streaming: 670k records/sec/node
– Storm: 115k records/sec/node
– Commercial systems: 100-500k records/sec/node
0
10
20
30
100 1000
Throughputpernode
(MB/s)
Record Size (bytes)
WordCount
Spark
Storm
0
20
40
60
100 1000
Throughputpernode
(MB/s)
Record Size (bytes)
Grep
Spark
Storm
94. © 2014 MapR Technologies 94
Interactive Shell
• Iterative Development
– Cache those RDDs
– Open the shell and ask questions
• We have all wished we could do this with MapReduce
– Compile / save your code for scheduled jobs later
• Scala – spark-shell
• Python – pyspark
95. © 2014 MapR Technologies 95
The Game Changer!
• The
– Port them over if you need better performance
• Be sure to share the results and learning's
• Pig Scripts
– Port them over
– Try SPORK!
• Hive Queries….
96. © 2014 MapR Technologies 96© 2014 MapR Technologies
Preexisting MapReduce
97. © 2014 MapR Technologies 97
Existing Jobs
• Java MapReduce
– Port them over if you need better performance
• Be sure to share the results and learning's
• Pig Scripts
– Port them over
– Try SPORK!
• Hive Queries….
98. © 2014 MapR Technologies 98
Shark – SQL over Spark
• Hive-compatible (HiveQL, UDFs, metadata)
– Works in existing Hive warehouses without changing queries or data!
• Augments Hive
– In-memory tables and columnar memory store
• Fast execution engine
– Uses Spark as the underlying execution engine
– Low-latency, interactive queries
– Scale-out and tolerates worker failures
99. © 2014 MapR Technologies 99© 2014 MapR Technologies
Examples and Resources
100. © 2014 MapR Technologies 100
SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]);
JavaRDD<String> file = sc.textFile("hdfs://...");
JavaRDD<String> counts = file.flatMap(line -> Arrays.asList(line.split(" ")))
.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
.reduceByKey((x, y) -> x + y);
counts.saveAsTextFile("hdfs://...");
val sc = new SparkContext(master, appName, [sparkHome], [jars])
val file = sc.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Word Count
• Java MapReduce (~15 lines of code)
• Java Spark (~ 7 lines of code)
• Scala and Python (4 lines of code)
– interactive shell: skip line 1 and replace the last line with counts.collect()
• Java8 (4 lines of code)
101. © 2014 MapR Technologies 101
Network Word Count – Streaming
// Create the context with a 1 second batch size
val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1),
System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass))
// Create a NetworkInputDStream on target host:port and count the
// words in input stream of n delimited text (eg. generated by 'nc')
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_ONLY_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
102. © 2014 MapR Technologies 102
Deploying Spark – Cluster Manager Types
• Standalone mode
– Comes bundled (EC2 capable)
• YARN
• Mesos
103. © 2014 MapR Technologies 103
Remember
• If you want to use a new technology you must learn that new
technology
• For those who have been using Hadoop for a while, at one time
you had to learn all about MapReduce and how to manage and
tune it
• To get the most out of a new technology you need to learn that
technology, this includes tuning
– There are switches you can use to optimize your work
104. © 2014 MapR Technologies 104
Configuration
http://spark.apache.org/docs/latest/
Most Important
• Application Configuration
http://spark.apache.org/docs/latest/configuration.html
• Standalone Cluster Configuration
http://spark.apache.org/docs/latest/spark-standalone.html
• Tuning Guide
http://spark.apache.org/docs/latest/tuning.html
105. © 2014 MapR Technologies 105
Resources
• Pig on Spark
– http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-td2367.html
– https://github.com/aniket486/pig
– https://github.com/twitter/pig/tree/spork
– http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1
– https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix
• Latest on Spark
– http://databricks.com/categories/spark/
– http://www.spark-stack.org/
106. © 2014 MapR Technologies 106
• San Francisco
June 30 – July 2
• Use Cases
• Tech Talks
• Training
http://spark-summit.org/
107. © 2014 MapR Technologies 107
Q&A
@mapr maprtech
jscott@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies
Notes de l'éditeur Graph of each step in the pipeline for every run. This graph shows how important it is to measure everything. Some steps have been greatly reduced or eliminated. Light blue is the matching step. You can see it going quadratic and then the change when ‘J’ Jermline was released.
Gives up random access read on files
Gives up strong authentication / authorization model
Gives up random access write / append on files 45 Historically, the NameNode in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.
Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is 200M at the maximum and that is with an extremely high end server. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation.
As you add more nodes to your cluster and want to configure HA, you have to add expensive NAS and have warm standby’s for the NN and related metadata which is persisted in memory. Even more, once you surpass the file limit in HDFS, you have to have region NameNode servers to support those additional nodes. A “federated NameNode approach”.
Think of the additional dedicated hardware and configurations/administration required to set up NameNode HA in Hadoop! And this is ONLY for NameNode HA. What if you could distribute the NameNode metadata and have it share resources in your cluster? What if Hadoop was a truly distributed environment?
With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance.
(advantages of this approach are called out on the left and right side of the diagram Because of architecture.
Apache Hbase runs in a JVM which read/writes to HDFS which is also running is a separate JVM, storing data in the Linux OS which is reading and writing to disk.
As data is collected, it needs to be written to disk and “compacted” (i.e, maintenance is performed), this introduces many layers and steps that need to happen
MapR M7 has integrated tables and files which are a true file system, reading and writing directly on disks.
MapR M7 is a tightly integrated, in-Hadoop database which is NoSQL, columnar store which is 100% Apache Hbase API compatible Because of architecture.
Apache Hbase runs in a JVM which read/writes to HDFS which is also running is a separate JVM, storing data in the Linux OS which is reading and writing to disk.
As data is collected, it needs to be written to disk and “compacted” (i.e, maintenance is performed), this introduces many layers and steps that need to happen
MapR M7 has integrated tables and files which are a true file system, reading and writing directly on disks.
MapR M7 is a tightly integrated, in-Hadoop database which is NoSQL, columnar store which is 100% Apache Hbase API compatible **Consistent** low latency on read due to compactions
Recall Aadhar
Why? Spark is really cool… When do you use regular mapreduce over higher level languages? When Hive? When Pig? When anything? You can find Project Resources on the Apache. You’ll also find information about the mailing list there (including archives) Yahoo and Adobe are in production with Spark. This sounds a lot like the reason to consider Pig vs. Java MapReduce Gracefully Looks kind of like a source control tree You can import the MLlib to use here in the shell! Best use case? Standalone followed by Mesos… My personal opinion is that Mesos is where the future will take us. Don’t forget to share your experiences. This is really what the community is about.
Don’t have time to contribute to open source, use it and share your experiences! This isn’t all proven out yet, but some of it should just work already. This is a really simple example. Reality is 22 chromosomes and 96 characters in a word
‘G’ Germline would have to rebuild the hash table for all samples and then re-run all comparisons. An all by all comparison
This is where HBase shines. It is easy to add columns and rows, very efficient with empty cells (sparse matrix). Hammer HBase with multiple processes doing this at the same time.