Talk on 31.10.2013 by Matthew Moloney the founder of Tsunami (tsunami.io). Matthew previously worked on Big Data tooling at eBay and Microsoft and is particularly interested in Functional Programming and Machine Learning.
Abstract: Many common mistakes in Hive programming are preventable and waste both user time and cluster time. Matt will present an interface that not only prevents these mistakes but is able to give you helpful hints while your typing.
2.
Founder of Lift Analytics
(F# IDE)
Big Data / Machine Learning
Applied Researcher
Business Intelligence
Process Engineering
Social
@tsunamiide
tsunami.io
Earthquake Enterprises
3.
Two main tasks on the Big Data Pipeline
are Getting More Data and Finding More
Factors
Get More
Data
Social
@tsunamiide
Find
More
Factors
tsunami.io
Machine
Learning
A/B
Testing
Earthquake Enterprises
4.
Data Science is an inherently exploratory pursuit
Over 50% of the queries you write will only be
executed once
Very few queries are worth saving
Exploration involves writing a whole lot of new code
and then throwing most of it away
There are no QA teams and no test suites
Many more new opportunities to make mistakes
Social
@tsunamiide
tsunami.io
Earthquake Enterprises
5.
Most queries are exploratory in nature
The decision on which query to run next often
depends on the result of the previous query
Queries are often very short (e.g. 5 minutes)
Queues are often very long (e.g. 2 hours)
Mistakes waste a lot of your time
Social
@tsunamiide
tsunami.io
Earthquake Enterprises
6.
Clusters are a shared resource
A simple mistake may kill a big job after
12 hours of cluster processing time
Mistakes waste everyone's time
Social
@tsunamiide
tsunami.io
Earthquake Enterprises
7.
Democratized write access to the cluster
Meta-data is often kept as tribal
knowledge, team wikis, and files sent in
emails
Social
@tsunamiide
tsunami.io
Earthquake Enterprises
8.
Users will start to share datasets amongst
themselves without any formal
agreements or dependency management
Other people can now break your code
Social
@tsunamiide
tsunami.io
Earthquake Enterprises
9.
There are many more chances for making
mistakes, they have become easier to
make and they are far more costly.
Most of the time they are not even your
fault and there is nothing you could have
reasonably done to prevent them.
Social
@tsunamiide
tsunami.io
Earthquake Enterprises