Understand the path Jeff Hammerbacher from Facebook and building scalable systems on Hadoop to Co-founding Cloudera and building an organization that provides the leading Hadoop platform.
3. Data Applications Scientist
“I have only heard back from one person about that
‘Data Applications Scientist’ thing. I had anticipated
more discussion” – me, February 29, 2008
3
4. “I guess I’m arguing for ‘Data’ to replace ‘Research’ in
those titles (I am happy to drop the ‘Applications’) as
the primary focus of our organization is not corporate
research.” – me, March 1, 2008
4
5. Data Scientist
“I’d like to avoid specialization at this early stage and I
expect every member of our group to have a mix of
research, engineering, and analysis in their workload.”
– me, March 1, 2008
5
6. Facebook Data Team
The Facebook Data Team built scalable platforms for
the collection, management, and analysis of data.
We used these platforms to drive informed decisions in
areas critical to the success of the company and to
build data-intensive products and services.
6
10. Philosophy
• Instrument everything
• Put all of your data in one place
• Data first, questions later
• Store first, structure later
• Keep raw data forever
• Let everyone party on the data
• Produce tools to support the whole research cycle
• Modular and composable infrastructure
10
11. CDH
• Storage
• Append-only unstructured data
• Append-only tabular data
• Mutable tabular data
11
18. Cloudera Customer Survey
• 67% use Hive
• 54% use HBase
• 51% load data every 90 minutes or less
• 71% move data from Hadoop to RDBMS for
interactive SQL
• 62% would like to consolidate into single platform
18
19. Cloudera Impala
• General-purpose SQL query engine
• Should work both for analytic and transactional workloads
• Will support queries that take from microseconds to hours
19
20. Cloudera Impala
• Runs directly within Hadoop
• Reads widely used Hadoop file formats
• Talks to widely used Hadoop storage managers
• Runs on same nodes that run Hadoop processes
20
21. Cloudera Impala
• High performance
• C++ instead of Java
• Runtime code generation
• Completely new execution engine—not MapReduce
21
27. The Last Mile
• Data libraries
• Language
• Libraries
• IDE for Data Scientists
• Mixed-initiative
• Memory
• Collaboration
• Model and analysis path selection
27
28. Doing Data Science
• More data sources
• More rows
• More columns (novel or derived)
• Better data quality
• Better outcomes
• Better loss functions
• Causal inference in observational studies
• Effect size estimates
• Meta-analysis
• Model lifecycle
28