Hadoop world overview trends and topics

Trends and Topics
Valentyn Kropov
Solutions Architect, SAG, SoftServe

Agenda
1. Conference Overview
2. Bright Future of Hadoop Map-Reduce
3. Apache Spark Data Frames
4. Cloudera Kudu
5. Most Popular Reference Architecture
6. Use Cases

#2
Bright Future of
Hadoop MapReduce

Spark is a Future
Cloudera Anounces One Platform Initiative (Sep, 9 2015)

Spark is a Present
It appeared in 72% of
presentations and use-cases
At Hadoop World Conference

Spark is Easier to Code
Map Reduce / Java Spark / Scala

Spark is Faster
Up to 100x
faster!

And they have Power 
• 400 contributors
• From 100+ companies
• Databricks (1 y.o, 30->100
people, $47 million)
• Cloudera (370 patches, 43k
lines of code)

Cloudera One Platform:
Read More
http://goo.gl/jSK0h6

Most of Data is Still
Structured!
• No Sorting?
• No Joins?
• No Aggregations?
• No Filtering?
• No cross-DB connections?

Data Frame is…
• API
• like a Table (RDBMS)
• or Data Frame (Python/R)
• Abstraction layer over RDD

Construct Data Frame
# Constructs a DataFrame from Hive
users = context.table("users")
# from JSON files in S3
logs = context.load("s3n://data.json", "
json")

Filtering
# Create a new DataFrame that
contains “young users” only
young = users.filter(users.age < 21)

Group By
# Count the number of young
users by gender
young.groupBy("gender").count()

Joins!
# Join users with another
DataFrame called logs
users.join(logs, logs.userId ==
users.userId, "left_outer")

Spark Languages
Spark Survey 2015

Data Frames and Python
• Compiled into JVM bytecode
• Data Never Leaves the JVM
• Python passes commands only
• Commands are pushed down

Data Frames: Read More
http://www.slideshare.net/J
onHaddad/enter-the-snake-
pit-for-fast-and-easy-spark

What’s Kudu?
• Columnar Storage for Hadoop
• Not just a file-format
• Supports low-latency random access (ms)
• Good alternative for Impala + Parquet
• Integrates with Spark, Hadoop, Impala
• It’s in Beta now

Kudu: use-cases
• Write: newly-arrived data immediately
available to users
• Time-Series applications which needs to
support both random and scattered
reads

Kudu: Read More
http://getkudu.io/

#5
Most Popular
Reference Architecture

Reference Architecture
Yarn (90%)
Mesos (10%)

Kafka
• Highly-scalable
• Fault-tolerant (commit-log)
• Partition-based load-balancing

Spark Streaming
• Processes data in micro batches (Dstream,
windows slides)
• Supports data locality with Cassandra
• Real-time data science (Data Frames, Mlib)
• BI Support (Spark SQL)

Cassandra
• No SPOF
• Masterless (easy operations and scaling)
• Replicates data across data-centers
• Most mature and fast growing
• Evolves into New SQL (transactions)
• SQL-like-CQL

Spark
• Is Awesome for Analytics (both
real-time and batch)

Reference Architecture:
Read More
http://www.datastax.com/d
ev/blog/streaming-big-data-
with-spark-spark-streaming-
kafka-cassandra-and-akka

Netflix: Size
•20PB DW on S3
•Read ~10% of data daily
•Write ~10% of read data
daily
•500 billion events daily

Netflix: Analyze
•300 Data Scientists
•Python, R, Scala, etc

Netflix: Compute and
Storage
• Separate Compute and
storage (S3)
• To have heterogeneous
clusters
• And no-downtime upgrades

Netflix: Read More
http://strataconf.com/big-
data-conference-ny-
2015/public/schedule/detail
/43373

Mission Orion: Size
• 350k measurands
• 2TB / hour
• 1200 telemetry sensors
• 3 x 1GB networks busy
• Data retention is 25 years 

Data Reader/Simulator Ingest
Packet
Measurands
(GPBs)
Kafka
Message
Bus
Packet
Measurands
(GPBs)
Deduplica
tion
(Spark)
HBase
Writer
(Spark)
mach5-
sample Obj
Splitter +
Decom
(GDS)
C++ client Reads
Packets and
Decommutates
Tlm
Data
Packet Measurands
GPB File
(represents a Packet(s)
and contains
decommutated
measurands)
Header Metadata
apid:seqctr:time: value1
…
..
apid:seqctr:time: valueN
mach5-
sample
(Spark)
Packet
Measurands
(GPBs)
Lockheed Martin Proprietary Information
Storage
Analytics
HDFS
HFiles
(HBase-RDD)
Mach-5 Data Ingest for Orion
HBase
Web/UITomcat
Glassfish
Etc.
Trace
FOSS
widgets
Aggregati
on
(Spark)
Alerting
(Spark)
Limit
Checking
(Spark)

Orion: Read More
http://strataconf.com/big-
data-conference-ny-
2015/public/schedule/detail
/43181

Thanks!
valentine.kropov@gmail.com

Hadoop world overview trends and topics

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (15)

Similaire à Hadoop world overview trends and topics

Similaire à Hadoop world overview trends and topics (20)

Dernier

Dernier (20)

Hadoop world overview trends and topics

Notes de l'éditeur