What do you talk about to a hall full of database gurus? Instead of science - my talk focused on the art. What made Hadoop successful? What can we learn from it? What principles work well in building software for large scale services? What are some interesting unsolved problems in a world overrun by open-source (and VC investments :-))
1. The Meta of Hadoop
Joydeep Sen Sarma
Ex-Facebook DI Lead, Founder Qubole
2. Intro
• File/Database Systems developer (ex- Netapp/Oracle)
• Yahoo (2005-07), Facebook (2007-11)
• @Facebook:
– SysAdmin: operated massive Hadoop/Hive installs
– Architect: conceived/wrote Apache Hive. made Hbase@FB
happen
– Herded cats: first manager of Data Infra team
– IT engineer/DBA: built ETL tools, warehouse/reporting for
FB Virtual Currency
• Founder Qubole Inc. (2011-)
3. Why Hadoop Succeeded
• Complete Solution and Extensible
– useful to Engineers, Data Scientists, Analysts
– performance isn’t everything.
– Agile – Businesses much faster than before
• Market Dynamics
– Captive Super-Reference Customer – Yahoo
– Had early market to itself for Long-Time
• Separation of Compute and Storage
– Parallel Computing != Database
4. Why Hadoop Succeeded
• Data Consolidation!
– Just store everything in HDFS
– MR/Hive/Pig can chew
anything
• Lights Out Architecture
DATA – Low System Operational Cost
– Low Data Management Cost
• Don’t need Data Priests
DATA
6. Adaptive Lights-Out Software
• Successful efforts:
– Automatic map-join/skew join implementations
– Automatic local mode, resource cache
• Failed:
– Statistics: alter table analyze table
– Pre-Bucketing tables
Learning Frameworks for Systems Software
7. Adaptive Lights-Out Software
• Caching + Prefetching is Adaptive
– Replication is not
– Can bridge gap between Compute and Storage
• Page Cache over Disk >> In-memory
– Degrades gracefully
• Provide APIs – not packages
8. Murphy’s Law
• No Trusted Components
• Defend everything
– Rate-Limit access to every resource
– Log and Monitor everything
• Clear and Overwhelming Force
– Oversize it!
• Think QOS from Day-1
9. Open Source
• Small is Beautiful
– Build small easy to use/understand components
– Redis!
• Iterative Small Changes
– Operators HATE large releases
– Hive (2 weeks) vs. Hadoop (2 years?)
11. Interesting Problems - I
• Collaborative Analysis
– Most analysis is Repeat
– Tracking and Searching historical analysis
• Consistency Aware Querying
– OLAP: Snapshots instead of live tables
– OLTP: Lookup stale caches instead of master
12. Interesting Problems - II
• SQL is Rope
– Better than procedural – but still Rope
– Higher Level templates: moving averages
• Data = Mutating + Immutable
– Immutable data is easy to manage
– Cheap: One copy per data center (Facebook
Haystack)
13. Think Services, not Software
• Software is getting less interesting
– Even Distributed Systems Software
• Run/Operate long-running, hot services
– Innovate inside this boundary