Valentyn Kropov, Big Data Solutions Architect has recently attended "Hadoop World / Strata" – biggest and coolest Big Data conference in a World, and he can't wait to share fresh trends and topics straight from New-York. Come and learn how Hadoop cluster will help NASA to explore Mars, how Netflix build 10PB platform, what are the latest trends in Spark, to learn about newest, just announced storage engine from Cloudera called Kudu and many many more interesting stuff.
2. Agenda
1. Conference Overview
2. Bright Future of Hadoop Map-Reduce
3. Apache Spark Data Frames
4. Cloudera Kudu
5. Most Popular Reference Architecture
6. Use Cases
15. And they have Power
• 400 contributors
• From 100+ companies
• Databricks (1 y.o, 30->100
people, $47 million)
• Cloudera (370 patches, 43k
lines of code)
18. Most of Data is Still
Structured!
• No Sorting?
• No Joins?
• No Aggregations?
• No Filtering?
• No cross-DB connections?
19. Data Frame is…
• API
• like a Table (RDBMS)
• or Data Frame (Python/R)
• Abstraction layer over RDD
20. Construct Data Frame
# Constructs a DataFrame from Hive
users = context.table("users")
# from JSON files in S3
logs = context.load("s3n://data.json", "
json")
21. Filtering
# Create a new DataFrame that
contains “young users” only
young = users.filter(users.age < 21)
22. Group By
# Count the number of young
users by gender
young.groupBy("gender").count()
23. Joins!
# Join users with another
DataFrame called logs
users.join(logs, logs.userId ==
users.userId, "left_outer")
30. What’s Kudu?
• Columnar Storage for Hadoop
• Not just a file-format
• Supports low-latency random access (ms)
• Good alternative for Impala + Parquet
• Integrates with Spark, Hadoop, Impala
• It’s in Beta now
33. Kudu: use-cases
• Write: newly-arrived data immediately
available to users
• Time-Series applications which needs to
support both random and scattered
reads
38. Spark Streaming
• Processes data in micro batches (Dstream,
windows slides)
• Supports data locality with Cassandra
• Real-time data science (Data Frames, Mlib)
• BI Support (Spark SQL)
39. Cassandra
• No SPOF
• Masterless (easy operations and scaling)
• Replicates data across data-centers
• Most mature and fast growing
• Evolves into New SQL (transactions)
• SQL-like-CQL