2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China

© 2014 MapR Technologies 1© 2014 MapR Technologies
Genomics Use Cases @ MapR

DNA Sequencing Company

© 2014 MapR Technologies 3
Parallelize Primary Analytics
.fastq .vcf
short read
alignment
genotype
callingreads &
mappings

Sequence Analysis, Quick Overview
[…] G A C T A G A fragment1
A C A G T T T A C A fragment2
A G A T A - - A G A fragment3
A A C A G C T T A C A […] fragment4
C T A T A G A T A A fragment5
[…] G A T T A C A G A T T A C A G A T T A C A […] referenceDNA
[…] G A C T A C A G A T A A C A G A T T A C A […] sampleDNA

What is the (Probable) Color of Each Column?

Which Columns are (probably) Not White?
Strategy 1: examine foreach column, foreach row O(rows*cols)
+ O(1 col) memory

Strategy 2: examine foreach row. keep running tallies O(rows)
+ O(rows*cols) memory

Strategy 3: rotate matrix. examine foreach column O(rows log rows)
+ O(cols)
+ O(1 col) memory

Comparison of Strategies
Strategy 1
• Low mem req
• Random access
pattern, many ops
Strategy 3
• Low mem req
• Sequential access
pattern
• Requires Sort
Strategy 2
• High mem req
pattern
O(rows*cols)
+ O(1 col) memory
O(rows)
O(rows log rows)
+ O(cols)
+ O(1 col) memory

Comparison of Strategies
Strategy 1
• Low mem req
• Random access
pattern, many ops
Strategy 3
• Low mem req
pattern
• Requires Sort
Strategy 2
• High mem req
pattern
O(rows*cols)
+ O(1 col) memory
O(rows)
O(rows log rows) ÷ shards
+ O(cols) ÷ shards
+ O(1 col) memory
As # of rows & columns increases
Strategy 3 becomes more attractive

Primary Sequence Analysis (ETL), MapReduce style
.fastq .bam .vcf
short read
alignment
genotype
calling
MAP
MAP
REDUCE, rotate matrix 90º
(O(mn)) / 1 (O(mn) + O(n log n)) / s
Hello!

Clinical Applications: Performance Matters
MapR
FilesystemN
F
S
DNA
Sequencer
DNA
Sequencer
DNA
Sequencer
Raw
DNARaw
DNARaw
DNA
1º Analytics
Raw
DNARaw
DNASNP
calls
Static
Clinical
Reporting
PhysicianPatient
Reference
DBs
SNP DB
ETL
2º
Analytics
ResearcherSubject

Variant Collection Enables Downstream Apps
• GWAS Association Studies
• Versioned, Personalized
Medicine
• Companion Diagnostics
SNP DB 2º
Analytics
New
Markets
Hello!
More linear algebra 
[Spark,
Summingbird,
Lambda Architecture
Slides]

The Post-Sequencing Genomics Workload
Sboner, et al, 2011. The real cost of sequencing: higher than you think!

Example GWAS/SNP Analysis
• Find me related SNPs…
– From other experiments
• Given a phenotype…
– And an associated SNP from my
experiment
• That elucidate genetic basis of
phenotype…
• And rank order them by
impact/likelihood/etc

Example GWAS/SNP Analysis
• Find me related SNPs…
– From other experiments
• Given a phenotype…
– And an associated SNP from my
experiment
• That elucidate genetic basis of
phenotype…
• And rank order them by
impact/likelihood/etc
• In context of, e.g.
– ε1: Racial, etc. background
– ε2: Experimental design-
specific concerns (e.g. familial
IBD/IBS)
– ε3: Environmental factors and
penetrance
– ε4: Assay-specific biases and
noise
phenotype = αgenotype + β + ε1 + ε2 + ε3 + ε4
At risk of over-simplifying as
business-level concept…

HUGE PROBLEM
COMBINATORIAL EXPLOSION

What’s a Percolator?
• Google Percolator
– “Caffeine” update 2010
• Iterative, incremental prioritized
updates
• No batch processing
• Decouple computational results
from data size
Peng & Dabek, 2010. Large-scale Incremental Processing Using Distributed Transactions and Notifications

Solution: Percolate
SNPs,
experimental groupings,
assay technologies,
assayed phenotypes,
annotations/ontologies
Denormalize
and Percolate
(re)prioritize &
(re)process
service queries
drive
dashboards
create reports
denormalize for
display
buffer
New
models

Robot Scientist
Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery

Robot (Data?) Scientist
Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery

Genealogy Company
Slides credit: Bill Yetman, Hadoop Summit 2014
http://slidesha.re/1vRh3kY

GERMLINE is…
• …an algorithm that finds hidden relationships within a pool of
DNA
• …the reference implementation of that algorithm written in C++.
• You can find it here:
http://www1.cs.columbia.edu/~gusev/germline/
2
3

Projected GERMLINE run times (in hours)
2
4
Hours
Samples
0
100
200
300
400
500
600
700
2,500
12,500
22,500
32,500
42,500
52,500
62,500
72,500
82,500
92,500
102,500
112,500
122,500
GERMLINE run times
Projected GERMLINE run
times
700 hours = 29+ days
EXPONENTIAL COMPLEXITY

GERMLINE: What’s the Problem?
• GERMLINE (the implementation) was not meant to be used in
an industrial setting
– Stateless, single threaded, prone to swapping (heavy memory usage)
– GERMLINE performs poorly on large data sets
• Our metrics predicted exactly where the process would slow to
a crawl
• Put simply: GERMLINE couldn't scale
2
5

Run times for matching (in hours)
2
6
Hours
Samples
0
20
40
60
80
100
120
140
160
180
GERMLINE run times
Jermline run times
Projected GERMLINE
run times
EXPONENTIAL LINEAR
HBase
Refactor

• Paper submitted describing the implementation
• Releasing as an Open Source project soon
• [HBase Schema/Algorithm Slides]
2
7

Further Growth & Optimization

Underdog (Strand Phasing) performance
– Went from 12 hours to process 1,000 samples
to under 25 minutes with a MapReduce
implementation
2
9
With improved accuracy!
Underdog
replaces
Beagle
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
Total Run Size Total Beagle-Underdog Duration

Pipeline steps and incremental change…
– Incremental change over time
– Supporting the business in a “just in time” Agile way
3
0
0
50000
100000
150000
200000
250000
500
3622
7243
9615
12353
16333
19522
22861
26642
31172
35986
40852
45252
49817
54738
61675
69496
77257
84337
90074
97448
104684
111937
119669
127194
134970
142232
149988
157710
165685
173719
181617
189817
197853
205855
213471
221290
228912
236516
243550
251315
259164
267266
275335
283114
291017
298823
306556
314662
322655
330745
338813
346847
354938
362954
371064
379208
387334
395432
Beagle-Underdog Phasing
Pipeline Finalize
Relationship Processing
Germline-Jermline Results Processing
Germline-Jermline Processing
Beagle Post Phasing
Admixture
Plink Prep
Pipeline Initialization
Jermline replaces
Germline
Ethnicity V2 Release
Underdog Replaces
Beagle
AdMixture on
Hadoop

…while the business continues to grow rapidly
3
1
-
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14
#ofprocessedsamples)
DNA Database Size

BigData App Development Lifecycle

outputinput
1M rows
tail | grep | sort | uniq -c

Evolution of Data Storage
Functionality
Compatibility
Scalability
Linux
POSIX
Over decades of progress,
Unix-based systems have set the
standard for compatibility and
functionality

outputinput
1M rows

outputinput
1M rows
1B rows

outputinput
1M rows
1B rows
1T rows

Functionality
Compatibility
Scalability
Linux
POSIX
Hadoop
Hadoop achieves much higher scalability
by trading away essentially all of this
compatibility

outputinput
1T rows
1T rows
input output
Port to BigData Tools
($$$$)

Functionality
Compatibility
Scalability
Linux
POSIX
Hadoop
MapR enhances Apache Hadoop by restoring
the compatibility while increasing scalability and
performance

outputinput
1T rows
POSIX (NFS)
Hadoop HDFS
Port

1 1 1 1
100 100 100 100
Prototype Tools
Dev Cost
BigData Tools
Dev Cost
Use When
Possible
Use When
Needed

1 1 1
100
Prototype Tools
Dev Cost
BigData Tools
Dev Cost
Use When
Possible
Use When
Needed

Aadhaar – World’s Largest Biometric
Database

Largest Biometric Database in the World
PEOPLE
1.2B
PEOPLE

India: Problem
• 1.2 billion residents
– 640,000 villages, ~60% lives under $2/day
– ~75% literacy, <3% pays Income Tax, <20% banking
– ~800 million mobile, ~200-300 mn migrant workers
• Govt. spends about $25-40 billion on direct subsidies
– Residents have no standard identity document
– Most programs plagued with ghost and multiple identities causing
leakage of 30-40%

India: Vision
• Create a common “national identity” for every “resident”
– Biometric backed identity to eliminate duplicates
– “Verifiable online identity” for portability
• Applications ecosystem using open APIs
– Aadhaar enabled bank account and payment platform
– Aadhaar enabled electronic, paperless KYC
• Enrolment
– One time in a person’s lifetime
– Multi-modal biometrics (fingerprints, iris)

Aadhaar Biometric Capture & Index

Architectural Principles
• Design for Scale
– Every component needs to scale to large volumes
– Millions of transactions and billions of records
– Accommodate failure and design for recovery
• Open Architecture
– Open Source
– Open APIs
• Security
– End-to-end security of resident data

Design for Scale
• Horizontal scale-out
• Distributed computing
• Distributed data storage and partitioning
• No single points of failure
• No single points of bottleneck
• Asynchronous processing throughout the system
– Allows loose coupling various components
– Allows independent component level scaling

MapR Filesystem
Aadhaar Multi-DC Data Storage Stack*
ID + Biometrics
(M7 HBase)
All raw packets
(HDFS+NFS)
Enrollment ID
API
ID + Demo + Photo + Benefits
(MySQL, Solr)
Authentication
API
Authorization
API
* as best I understand
from public
documents

Enrollment Volume
• 600 to 800 million UIDs in 4 years
– 1 million a day
– 200+ trillion matches every day!!!
• ~5MB per resident
– Maps to about 10-15 PB of raw data (2048-bit PKI encrypted!)
– About 30 TB I/O every day
– Replication and backup across DCs of about 5+ TB of incremental data every day
– Lifecycle updates and new enrolments will continue for ever
• Additional process data
– Several million events on an average moving through async channels (some
persistent and some transient)
– Needing complete update and insert guarantees across data stores

Authentication Volume
• 100+ million authentications per day (10 hrs)
– Possible high variance on peak and average
– Sub second response
– Guaranteed audits
• Multi-DC architecture
– All changes needs to be propagated from enrolment data stores to all authentication
sites
• Authentication request is about 4 K
– 100 million authentications a day
– 1 billion audit records in 10 days (30+ billion a year)
– 4 TB encrypted audit logs in 10 days
– Audit write must be guaranteed

How Do Biometrics Relate to Genomics?
Data Shape and Size
• Aadhaar: 5MB features (minutia)
• Genome: ~3M features (variants)
Data Set Operations
• Aadhaar: ƒ(x) Unique feature subset => identity
• Genome: “ “ “ “ “
• Genome: Variant × Phenotype
Commonality => Causal Genes
ƒ-1(x) !
SNP DB 2º
Analytics

Data Shape and Size
Data Set Operations
• Genome: “ “ “ “ “
ƒ-1(x) !
Vector Pattern Matching
SNP DB 2º
Analytics
ƒ-1(x): common features
ƒ(x): unique features
ƒ(x): uncommon features
ƒ(x): other features

Data Shape and Size
Data Set Operations
• Genome: “ “ “ “ “
ƒ-1(x) !
Topological Pattern Matching
SNP DB 2º
Analytics

MapR Platform

Apache Hadoop NameNode High Availability (HA)
NameNode
A B C D E F
HDFS-based Distributions
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
Primary NameNode
A B C D E F
Standby NameNode
A B C D E F
NameNode
A B
NameNode
C D
NameNode
E F
NameNode
A B
NameNode
C D
NameNode
E F
NAS
Appliance
HDFS HA
HDFS
Federation
Single point of failure
Limited to 50-200 million files
Performance bottleneck
Metadata must fit in memory
Only one active NameNode
Limited to 50-200 million files
Commercial NAS possibly needed
Performance bottleneck Double the block reports
Multiple single points
of failure w/o HA
Needs 20 NameNodes
for 1 Billion files
Commercial NAS needed
Performance bottleneck
Double the block reports

No NameNode Architecture
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
NameNode
A B C D E FAAA BBBB CCC DDD EEE FFF
Up to 1T files (> 5000x advantage)
Significantly less hardware & OpEx
Higher performance
No special config to enable HA
Automatic failover & re-replication
Metadata is persisted to disk

MapR M7: The Best In-Hadoop Database
 NoSQL Columnar Store
 Apache HBase API
 In-Hadoop database
HBase
JVM
HDFS
JVM
ext3/ext4
Disks
Other Distros
Tables/Files
Disks
MapR M7
The most scalable, enterprise-grade,
NoSQL database that supports online applications and analytics

MapR M7: The Best In-Hadoop Database
 NoSQL Columnar Store
 Apache HBase API
 In-Hadoop database
Hbase
Interface
JVM
HDFS
Interface
JVM
ext3/ext4
Disks
Other Distros
Tables/Files
Disks
MapR M7
The most scalable, enterprise-grade,
NoSQL database that supports online applications and analytics
BigData Application

Hbase Apps: High Performance with Consistent Low
Latency
--- M7 Read Latency --- Others Read Latency

MapR Services

Professional Services
• Installation
• Migrations
• SLA Plans
• Best Practices
• Performance
Tuning
Hadoop Core
Services
IT/ Infrastructure
Linux
Networking
Data Center
Storage
Operations
Big Data
Workflows
• Hive/Pig
• Oozie/Sqoop
• Flume
• M7/HBase
• Data Flow
BI / DBA
BI / ETL / Reporting
Scripting / Java
Hadoop MR
Eco Projects
(HBase, Hive, …)
Solution
Design
• HBase/M7
• Map/Reduce
• Application
Development
• Integration
Development
Java
Hadoop Developer
Architectural Design
Advanced
Analytics
• Use case
Discovery
• Use case
Modeling
• POC
• Workshops
Modeler / Analyst
PhD
Statistics/Math
MatLab / R / SAS
Scripting / Java
BI / ETL / Reporting
Data Engineering Data Science
AUDIENCE
ENGAGEMENTS
SKILLS

Global PS Resources
17 Today (+8 in Q3)
D.C.
Keys Botzum (DE/Security/Developer)
Joe Blue (Data Scientist)
Venkat Gunnup (DE/Development)
Alex Rodriguez (DE/Development)
Kannappan Sirchabesa (DE/OPS)
SAN JOSE
Wayne Cappas (Director/DE)
John Benninghoff (DE/OPS)
Dmitry Gomerman (DE/OPS & Security)
Ivan Bishop (DE/OPS)
James Caseletto (Data Scientist)
Sungwook Yoon (Data Scientist)
Sridhar Reddy (Director - M7/Hbase)
LOS ANGELES
John Ewing (DE/OPS)
Marco Vasquez (Data Scientist/DE)
SOUTH CAROLINA
David Schexnayder (DE/OPS)
PHOENIX
Michael Farnbach (DE/OPS)
SINGAPORE
Allen Day (Data Scientist)

Use Case Data Flow Example
MapR Data Platform
Processing and Analytics
Ingest
Sqoop
Flume
HDFS
NFS
Access
Tez
Drill
Hive
Pig
Impala
Data Sources
Clickstream
Billing Data
Mobile Data
Product Catalog
Social Media
Server Logs
Merchant Listings
Online Chat
Call Detail Records
Visualization
M7HBase
MapReduce
v1 & v2
StormCascadingPig
Solr MahoutYARN
Oozie Hive MLLib
Set-Top Box Data

Engagement Types
• Customer engagement is typically 1-4 weeks (longer okay)
• Well established partners (15,000 resources globally)
• Custom training based on customer use-case
• Small 1-3 days workshops
• Extended support / Staff augmentation

Q&A
twitter.com/allenday aday@mapr.com
Thanks!
slideshare.net/allendaylinkedin.com/in/allenday

An Overview of Apache Spark

Agenda
• MapReduce Refresher
• What is Spark?
• The Difference with Spark
• Preexisting MapReduce
• Examples and Resources

MapReduce Refresher

MapReduce Basics
• Foundational model is based on a distributed file system
– Scalability and fault-tolerance
• Map
– Loading of the data and defining a set of keys
• Reduce
– Collects the organized key-based data to process and output
• Performance can be tweaked based on known details of your
source files and cluster shape (size, total number)

Languages and Frameworks
• Languages
– Java, Scala, Clojure
– Python, Ruby
• Higher Level Languages
– Hive
– Pig
• Frameworks
– Cascading, Crunch
• DSLs
– Scalding, Scrunch, Scoobi, Cascalog

MapReduce Processing Model
• Define mappers
• Shuffling is automatic
• Define reducers
• For complex work, chain jobs together
– Or use a higher level language or DSL that does this for you

What is Spark?

Apache Spark
spark.apache.org
github.com/apache/spark
user@spark.apache.org
• Originally developed in 2009 in UC
Berkeley’s AMP Lab
• Fully open sourced in 2010 – now
a Top Level Project at the Apache
Software Foundation

The Spark Community

Spark is the Most Active Open Source Project in Big Data
Giraph
Storm
Tez
0
20
40
60
80
100
120
140
Projectcontributorsinpastyear

Unified Platform
Shark
(SQL)
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
Continued innovation bringing new functionality, e.g.:
• Java 8 (Closures, Lamba Expressions)
• Spark SQL (SQL on Spark, not just Hive)
• BlinkDB (Approximate Queries)
• SparkR (R wrapper for Spark)

Supported Languages
• Java
• Scala
• Python
• Hive?

Data Sources
• Local Files
– file:///opt/httpd/logs/access_log
• S3
• Hadoop Distributed Filesystem
– Regular files, sequence files, any other Hadoop InputFormat
• HBase

Machine Learning - MLlib
• K-Means
• L1 and L2-regularized Linear Regression
• L1 and L2-regularized Logistic Regression
• Alternating Least Squares
• Naive Bayes
• Stochastic Gradient Descent
* As of May 14, 2014
** Don’t be surprised if you see the Mahout library converting to Spark soon

The Difference with Spark

Easy and Fast Big Data
• Easy to Develop
– Rich APIs in Java, Scala,
Python
– Interactive shell
• Fast to Run
– General execution graphs
– In-memory storage
2-5× less code Up to 10× faster on disk,
100× in memory

Resilient Distributed Datasets (RDD)
• Spark revolves around RDDs
• Fault-tolerant collection of elements that can be operated on in
parallel
– Parallelized Collection: Scala collection which is run in parallel
– Hadoop Dataset: records of files supported by Hadoop
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

RDD Operations
• Transformations
– Creation of a new dataset from an existing
• map, filter, distinct, union, sample, groupByKey, join, etc…
• Actions
– Return a value after running a computation
• collect, count, first, takeSample, foreach, etc…
Check the documentation for a complete list
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations

RDD Persistence / Caching
• Variety of storage levels
– memory_only (default), memory_and_disk, etc…
• API Calls
– persist(StorageLevel)
– cache() – shorthand for persist(StorageLevel.MEMORY_ONLY)
• Considerations
– Read from disk vs. recompute (memory_and_disk)
– Total memory storage size (memory_only_ser)
– Replicate to second node for faster fault recovery (memory_only_2)
• Think about this option if supporting a web application
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence

Cache Scaling Matters
69
58
41
30
12
0
20
40
60
80
100
Cache
disabled
25% 50% 75% Fully cached
Executiontime(s)
% of working set in cache

Directed Acylic Graph (DAG)
• Directed
– Only in a single direction
• Acyclic
– No looping
• Why does this matter?
– This supports fault-tolerance

RDD Fault Recovery
RDDs track lineage information that can be used to efficiently
recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))

Comparison to Storm
• Higher throughput than Storm
– Spark Streaming: 670k records/sec/node
– Storm: 115k records/sec/node
– Commercial systems: 100-500k records/sec/node
0
10
20
30
100 1000
Throughputpernode
(MB/s)
Record Size (bytes)
WordCount
Spark
Storm
0
20
40
60
100 1000
Throughputpernode
(MB/s)
Record Size (bytes)
Grep
Spark
Storm

Interactive Shell
• Iterative Development
– Cache those RDDs
– Open the shell and ask questions
• We have all wished we could do this with MapReduce
– Compile / save your code for scheduled jobs later
• Scala – spark-shell
• Python – pyspark

The Game Changer!
• The
– Port them over if you need better performance
• Be sure to share the results and learning's
• Pig Scripts
– Port them over
– Try SPORK!
• Hive Queries….

Preexisting MapReduce

Existing Jobs
• Java MapReduce
– Port them over if you need better performance
• Be sure to share the results and learning's
• Pig Scripts
– Port them over
– Try SPORK!
• Hive Queries….

Shark – SQL over Spark
• Hive-compatible (HiveQL, UDFs, metadata)
– Works in existing Hive warehouses without changing queries or data!
• Augments Hive
– In-memory tables and columnar memory store
• Fast execution engine
– Uses Spark as the underlying execution engine
– Low-latency, interactive queries
– Scale-out and tolerates worker failures

Examples and Resources

SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]);
JavaRDD<String> file = sc.textFile("hdfs://...");
JavaRDD<String> counts = file.flatMap(line -> Arrays.asList(line.split(" ")))
.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
.reduceByKey((x, y) -> x + y);
counts.saveAsTextFile("hdfs://...");
val sc = new SparkContext(master, appName, [sparkHome], [jars])
val file = sc.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Word Count
• Java MapReduce (~15 lines of code)
• Java Spark (~ 7 lines of code)
• Scala and Python (4 lines of code)
– interactive shell: skip line 1 and replace the last line with counts.collect()
• Java8 (4 lines of code)

Network Word Count – Streaming
// Create the context with a 1 second batch size
val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1),
System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass))
// Create a NetworkInputDStream on target host:port and count the
// words in input stream of n delimited text (eg. generated by 'nc')
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_ONLY_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()

Deploying Spark – Cluster Manager Types
• Standalone mode
– Comes bundled (EC2 capable)
• YARN
• Mesos

Remember
• If you want to use a new technology you must learn that new
technology
• For those who have been using Hadoop for a while, at one time
you had to learn all about MapReduce and how to manage and
tune it
• To get the most out of a new technology you need to learn that
technology, this includes tuning
– There are switches you can use to optimize your work

Configuration
http://spark.apache.org/docs/latest/
Most Important
• Application Configuration
http://spark.apache.org/docs/latest/configuration.html
• Standalone Cluster Configuration
http://spark.apache.org/docs/latest/spark-standalone.html
• Tuning Guide
http://spark.apache.org/docs/latest/tuning.html

Resources
• Pig on Spark
– http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-td2367.html
– https://github.com/aniket486/pig
– https://github.com/twitter/pig/tree/spork
– http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1
– https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix
• Latest on Spark
– http://databricks.com/categories/spark/
– http://www.spark-stack.org/

• San Francisco
June 30 – July 2
• Use Cases
• Tech Talks
• Training
http://spark-summit.org/

Q&A
@mapr maprtech
jscott@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (9)

Similaire à 2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China

Similaire à 2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China (20)

Plus de Allen Day, PhD

Plus de Allen Day, PhD (20)

Dernier

Dernier (20)

2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China

Notes de l'éditeur