SlideShare une entreprise Scribd logo
1  sur  94
Agile Data Science
January 2014
Agile Analytics Applications with Hadoop
2
About Me…Bearding.
• Bearding is my #1 natural talent.
• I’m going to beat this guy.
• Seriously.
• Salty Sea Beard
• Fortified with Pacific Ocean Minerals
2
3
Agile Data Science: The Book
A philosophy.
Not the only way,
but it’s a really good way!
Code: ‘AUTHD’ – 50% off
3
4
We Go Fast, But Don’t Worry!
• Download the slides - click the links - read examples!
• If it’s not on the blog (Hortonworks, Data Syndrome), it’s in
the book!
• Order now: http://shop.oreilly.com/product/0636920025054.do
4
5
Agile Application
Development: Check
• LAMP stack mature
• Post-Rails frameworks to choose from
• Enable rapid feedback and agility
+ NoSQL
5
6
Data Warehousing
6
7
Scientific Computing / HPC
Tubes and Mercury (Old School) Cores and Spindles (New School)
UNIVAC and Deep Blue both fill a warehouse. We’re back!
7
‘Smart Kid’ Only: MPI, Globus, etc. Until Hadoop
8
Data Science?
Application
Development
Data Warehousing
Scientific Computing / HPC
8
9
Data Center as Computer
“A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient
manner.” Click here for a paper on operating a ‘data center as computer.’
9
Warehouse Scale Computers and Applications
10
Hadoop to the Rescue!
• Easy to use (Pig, Hive, Cascading)
• CHEAP: 1% the cost of SAN/NAS
• A department can afford its own Hadoop cluster!
• Dump all your data in one place: Hadoop DFS
• Silos come CRASHING DOWN!
• JOIN like crazy!
• ETL like whoa!
• An army of mappers and reducers at your command
• OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME!
10
11
NOW
WHAT?
11
12
Analytics Apps: It takes a Team
• Broad skill-set
• Nobody has them all
• Inherently collaborative
12
13
Data Science Team
• 3-4 team members with broad, diverse skill-sets that overlap
• Transactional overhead dominates at 5+ people
• Expert researchers: lend 25-50% of their time to teams
• Creative workers. Like a studio, not an assembly line
• Total freedom... with goals and deliverables.
• Work environment matters most
13
14
How To Get Insight Into Product
• Back-end has gotten THICKER
• Generating $$$ insight can take 10-100x app dev
• Timeline disjoint: analytics vs agile app-dev/design
• How do you ship insights efficiently?
• Can you collaborate on research vs developer timeline?
14
15
The Wrong Way - Part One
“We made a great design.
Your job is to predict the future for it.”
15
16
The Wrong Way - Part Two
“What is taking you so long
to reliably predict the future?”
16
17
The Wrong Way - Part Three
“The users don’t understand
what 86% true means.”
17
18
The Wrong Way - Part Four
GHJIAEHGIEhjagigehganb!!!!!RJ(@J?!!
18
19
The Wrong Way - Conclusion
Inevitable Conclusion
Plane Mountain
19
20
Reminds me of... the waterfall
model
:( 20
21
Chief Problem
You can’t design insight in analytics applications.
You discover it.
You discover by exploring.
21
22
-> Strategy
So make an app for exploring your data.
Which becomes a palette for what you ship.
Iterate and publish intermediate results.
22
23
Data Design
• Not the 1st query that = insight, it’s the 15th, or 150th
• Capturing “Ah ha!” moments
• Slow to do those in batch…
• Faster, better context in an interactive web application.
• Pre-designed charts wind up terrible. So bad.
• Easy to invest man-years in wrong statistical models
• Semantics of presenting predictions are complex
• Opportunity lies at intersection of data & design
23
24
How Do We Get Back to Agile?
24
25
Statement of Principles
(Then Tricks With Code)
25
26
Setup An Environment Where:
• Insights repeatedly produced
• Iterative work shared with entire team
• Interactive from day Zero
• Data model is consistent end-to-end
• Minimal impedance between layers
• Scope and depth of insights grow
• Insights form the palette for what you ship
• Until the application pays for itself and more
26
27
Snowballing Audience
27
28
Value Document > Relation
Most data is dirty. Most data is semi-structured or unstructured. Rejoice!
28
29
Value Document > Relation
Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction.
29
30
Relational Data = Legacy Format
• Why JOIN? Storage is fundamentally cheap!
• Duplicate that JOIN data in one big record type!
• ETL once to document format on import, NOT every job
• Not zero JOINs, but far fewer JOINs
• Semi-structured documents preserve data’s actual structur
• Column compressed document formats beat JOINs!
30
31
Value Imperative > Declarative
• We don’t know what we want to SELECT.
• Data is dirty - check each step, clean iteratively.
• 85% of data scientist’s time spent munging. ETL.
• Imperative is optimized for our process.
• Process = iterative, snowballing insight
• Efficiency matters, self optimize
31
32
Value Dataflow > SELECT
32
33
Ex. Dataflow: ETL +
Email Sent Count
(I can’t read this either. Get a big version here.)
33
34
Value Pig > Hive (for app-dev)
• Pigs eat ANYTHING
• Pig is optimized for refining data, as opposed to consuming it
• Pig is imperative, iterative
• Pig is dataflows, and SQLish (but not SQL)
• Code modularization/re-use: Pig Macros
• ILLUSTRATE speeds dev time (even UDFs)
• Easy UDFs in Java, JRuby, Jython, Javascript
• Pig Streaming = use any tool, period.
• Easily prepare our data as it will appear in our app.
• If you prefer Hive, use Hive.
Actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive.
See: HCatalog for Pig/Hive integration.
34
35
Localhost vs Petabyte Scale:
Same Tools
• Simplicity essential to scalability: highest level tools we can
• Prepare a good sample - tricky with joins, easy with documents
• Local mode: pig -l /tmp -x local -v -w
• Frequent use of ILLUSTRATE
• 1st: Iterate, debug & publish locally
• 2nd: Run on cluster, publish to team/customer
• Consider skipping Object-Relational-Mapping (ORM)
• We do not trust ‘databases,’ only HDFS @ n=3
• Everything we serve in our app is re-creatable via Hadoop.
35
36
Data-Value Pyramid
Climb it. Do not skip steps. See here.
36
37
0/1) Display Atomic Records
On The Web
37
38
0.0) Document - Serialize Events
• Protobuf
• Thrift
• JSON
• Avro - I use Avro because the schema is onboard.
38
39
0.1) Documents Via Relation ETL
enron_messages = load '/enron/enron_messages.tsv' as (
message_id:chararray,
sql_date:chararray,
from_address:chararray,
from_name:chararray,
subject:chararray,
body:chararray);
enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray,
name:chararray);
split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';
headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10;
with_headers = join headers by group, enron_messages by message_id parallel 10;
emails = foreach with_headers generate enron_messages::message_id as message_id,
CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date,
TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray,
name:chararray), enron_messages::subject as subject,
enron_messages::body as body,
headers::tos.(address, name) as tos,
headers::ccs.(address, name) as ccs,
headers::bccs.(address, name) as bccs;
store emails into '/enron/emails.avro' using AvroStorage(
Example here.
39
40
0.2) Serialize Events From
Streamsclass GmailSlurper(object):
...
def init_imap(self, username, password):
self.username = username
self.password = password
try:
imap.shutdown()
except:
pass
self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993)
self.imap.login(username, password)
self.imap.is_readonly = True
...
def write(self, record):
self.avro_writer.append(record)
...
def slurp(self):
if(self.imap and self.imap_folder):
for email_id in self.id_list:
(status, email_hash, charset) = self.fetch_email(email_id)
if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash):
print email_id, charset, email_hash['thread_id']
self.write(email_hash)
Scrape your own gmail in Python and Ruby.
40
41
0.3) ETL Logs
log_data = LOAD 'access_log'
USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader
AS (remoteAddr,
remoteLogname,
user,
time,
method,
uri,
proto,
bytes);
41
42
1) Plumb Atomic Events->Browser
(Example stack that enables high productivity)
42
43
1.1) Cat Avro Serialized Events
me$ cat_avro ~/Data/enron.avro
{
u'bccs': [],
u'body': u'scamming people, blah blah',
u'ccs': [], u'date': u'2000-08-28T01:50:00.000Z',
u'from': {u'address': u'bob.dobbs@enron.com', u'name': None},
u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>',
u'subject': u'Re: Enron trade for frop futures',
u'tos': [
{u'address': u'connie@enron.com', u'name': None}
]
}
Get cat_avro in python, ruby
43
44
1.2) Load Events in Pig
me$ pig -l /tmp -x local -v -w
grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage();
grunt> describe enron_emails
emails: {
message_id: chararray,
datetime: chararray,
from:tuple(address:chararray,name:chararray)
subject: chararray,
body: chararray,
tos: {to: (address: chararray,name: chararray)},
ccs: {cc: (address: chararray,name: chararray)},
bccs: {bcc: (address: chararray,name: chararray)}
}
 
44
45
1.3) ILLUSTRATE Events in Pig
grunt> illustrate enron_emails
 ---------------------------------------------------------------------------
| emails |
| message_id:chararray |
| datetime:chararray |
| from:tuple(address:chararray,name:chararray) |
| subject:chararray |
| body:chararray |
tos:bag{to:tuple(address:chararray,name:chararray)} |
| ccs:bag{cc:tuple(address:chararray,name:chararray)} |
| bccs:bag{bcc:tuple(address:chararray,name:chararray)} |
---------------------------------------------------------------------------
| |
| <1731.10095812390082.JavaMail.evans@thyme> |
| 2001-01-09T06:38:00.000Z |
| (bob.dobbs@enron.com, J.R. Bob Dobbs) |
| Re: Enron trade for frop futures |
| scamming people, blah blah |
| {(connie@enron.com,)} |
| {} |
| {} |
Upgrade to Pig 0.10+
45
46
1.4) Publish Events to a ‘Database’
pig -l /tmp -x local -v -w -param avros=enron.avro 
-param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig
/* MongoDB libraries and configuration */
register /me/mongo-hadoop/mongo-2.7.3.jar
register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar
/* Set speculative execution off to avoid chance of duplicate records in Mongo */
set mapred.map.tasks.speculative.execution false
set mapred.reduce.tasks.speculative.execution false
define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */
/* By default, lets have 5 reducers */
set default_parallel 5
avros = load '$avros' using AvroStorage();
store avros into '$mongourl' using MongoStorage();
Full instructions here.
Which does this:
From Avro to MongoDB in one command:
46
47
1.5) Check Events in ‘Database’
$ mongo enron
MongoDB shell version: 2.0.2
connecting to: enron
show collections
Emails
system.indexes
>db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"})
{
"_id" : ObjectId("502b4ae703643a6a49c8d180"),
"message_id" : "<1731.10095812390082.JavaMail.evans@thyme>",
"date" : "2001-01-09T06:38:00.000Z",
"from" : { "address" : "bob.dobbs@enron.com", "name" : "J.R. Bob Dobbs" },
"subject" : Re: Enron trade for frop futures,
"body" : "Scamming more people...",
"tos" : [ { "address" : "connie@enron", "name" : null } ],
"ccs" : [ ],
"bccs" : [ ]
}
47
48
1.6) Publish Events on the Web
require 'rubygems'
require 'sinatra'
require 'mongo'
require 'json'
connection = Mongo::Connection.new
database = connection['agile_data']
collection = database['emails']
get '/email/:message_id' do |message_id|
data = collection.find_one({:message_id => message_id})
JSON.generate(data)
end
48
49
1.6) Publish events on the web
49
50
One-Liner to Transition Stack
50
51
What’s the Point?
• A designer can work against real data.
• An application developer can work against real data.
• A product manager can think in terms of real data.
• Entire team is grounded in reality!
• You’ll see how ugly your data really is.
• You’ll see how much work you have yet to do.
• Ship early and often!
• Feels agile, don’t it? Keep it up!
51
52
1.7) Wrap Events with Bootstrap
<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">
</head>
<body>
<div class="container" style="margin-top: 100px;">
<table class="table table-striped table-bordered table-condensed">
<thead>
{% for key in data['keys'] %}
<th>{{ key }}</th>
{% endfor %}
</thead>
<tbody>
<tr>
{% for value in data['values'] %}
<td>{{ value }}</td>
{% endfor %}
</tr>
</tbody>
</table>
</div>
</body>
Complete example here with code here.
52
53
1.7) Wrap Events with Bootstrap
53
54
Refine. Add Links
Between Documents.
Not the Mona Lisa, but coming along... See: here
54
56
1.8) List Links to Sorted Events
mongo enron
> db.emails.ensureIndex({message_id: 1})
> db.emails.find().sort({date:0}).limit(10).pretty()
{
{
"_id" : ObjectId("4f7a5da2414e4dd0645d1176"),
"message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>",
"from" : [
...
pig -l /tmp -x local -v -w
emails_per_user = foreach (group emails by from.address) {
sorted = order emails by date;
last_1000 = limit sorted 1000;
generate group as from_address, emails as emails;
};
store emails_per_user into '$mongourl' using MongoStorage();
Use Pig, serve/cache a bag/array of email documents:
Use your ‘database’, if it can sort.
56
57
1.8) List Links
to Sorted Documents
57
58
1.9) Make It Searchable
If you have list, search is easy with
ElasticSearch and Wonderdog...
/* Load ElasticSearch integration */
register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar';
register '/me/elasticsearch-0.18.6/lib/*';
define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();
emails = load '/me/tmp/emails' using AvroStorage();
store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/elasticsearch-
0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins');
curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1'
Test it with curl:
ElasticSearch has no security features. Take note. Isolate.
58
59
2) Create Simple Charts
59
60
2) Create Simple Tables and
Charts
60
61
2) Create Simple Charts
• Start with an HTML table on general principle.
• Then use nvd3.js - reusable charts for d3.js
• Aggregate by properties & displaying is first step in entity resolution
• Start extracting entities. Ex: people, places, topics, time series
• Group documents by entities, rank and count.
• Publish top N, time series, etc.
• Fill a page with charts.
• Add a chart to your event page.
61
62
2.1) Top N (of Anything) in Pig
pig -l /tmp -x local -v -w
top_things = foreach (group things by key) {
sorted = order things by arbitrary_rank desc;
top_10_things = limit sorted 10;
generate group as key, top_10_things as top_10_things;
};
store top_n into '$mongourl' using MongoStorage();
Remember, this is the same structure the browser gets as json.
This would make a good Pig Macro.
62
63
2.2) Time Series (of Anything) in
Pig
pig -l /tmp -x local -v -w
/* Group by our key and date rounded to the month, get a total */
things_by_month = foreach (group things by (key, ISOToMonth(datetime))
generate flatten(group) as (key, month),
COUNT_STAR(things) as total;
/* Sort our totals per key by month to get a time series */
things_timeseries = foreach (group things_by_month by key) {
timeseries = order things by month;
generate group as key, timeseries as timeseries;
};
store things_timeseries into '$mongourl' using MongoStorage();
Yet another good Pig Macro.
63
64
Data Processing in Our Stack
A new feature in our application might begin at any layer…
GREAT!
Any team member can add new features, no problemo!
I’m creative!
I know Pig!
I’m creative too!
I <3 Javascript!
omghi2u!
where r my legs?
send halp
64
65
Data Processing in Our Stack
... but we shift the data-processing towards batch, as we are able.
Ex: Overall total emails calculated in each layer
See real example here.
65
66
3) Exploring with Reports
66
67
3) Exploring with Reports
67
68
3.0) From Charts to Reports
• Extract entities from properties we aggregated by in charts (Step 2)
• Each entity gets its own type of web page
• Each unique entity gets its own web page
• Link to entities as they appear in atomic event documents (Step 1)
• Link most related entities together, same and between types.
• More visualizations!
• Parametize results via forms.
68
69
3.1) Looks Like This:
69
70
3.2) Cultivate Common Keyspaces
70
71
3.3) Get People Clicking. Learn.
• Explore this web of generated pages, charts and links!
• Everyone on the team gets to know your data.
• Keep trying out different charts, metrics, entities, links.
• See whats interesting.
• Figure out what data needs cleaning and clean it.
• Start thinking about predictions & recommendations.
‘People’ could be just your team, if data is sensitive.
71
72
4) Predictions and
Recommendations
72
73
4.0) Preparation
• We’ve already extracted entities, their properties and relationships
• Our charts show where our signal is rich
• We’ve cleaned our data to make it presentable
• The entire team has an intuitive understanding of the data
• They got that understanding by exploring the data
• We are all on the same page!
73
74
4.2) Think in Different
Perspectives
• Networks
• Time Series / Distributions
• Natural Language Processing
• Conditional Probabilities / Bayesian Inference
• Check out Chapter 2 of the book
74
75
4.3) Networks
75
76
4.3.1) Weighted Email
Networks in Pig
76
77
4.3.2) Networks Viz with Gephi
77
78
4.3.3) Gephi = Easy
78
79
4.3.4) Social Network Analysis
79
80
4.4) Time Series & Distributions
80
81
4.4.1) Smooth Sparse Data
See here. 81
82
4.4.2) Regress to Find Trends
JRuby Linear Regression UDF Pig to use the UDF
Trend Line in your Application
82
83
4.5.1) Natural Language
Processing
Example with code here and macro here.
83
84
4.5.2) NLP: Extract Topics!
84
85
4.5.3) NLP for All: Extract Topics!
• TF-IDF in Pig - 2 lines of code with Pig Macros:
• http://hortonworks.com/blog/pig-macro-for-tf-idf-makes-
topic-summarization-2-lines-of-pig/
• LDA with Pig and the Lucene Tokenizer:
• http://thedatachef.blogspot.be/2012/03/topic-discovery-
with-apache-pig-and.html
85
86
4.6) Probability & Bayesian
Inference
86
87
4.6.1) Gmail Suggested Recipients
87
88
4.6.1) Reproducing it with Pig
88
89
4.6.2) Step 1: COUNT (From -> To)
89
90
4.6.2) Step 2: COUNT
(From, To, Cc)/Total
P(cc | to) = Probability of cc’ing someone, given that you’ve to’d someone
90
91
4.6.3) Wait - Stop Here! It Works!
They match…
91
92
4.4) Add Predictions to Reports
92
93
5) Enable New Actions
93
94
Why Doesn’t Kate Reply
to My Emails?
• What time is best to catch her?
• Are they too long?
• Are they meant to be replied to (original content)?
• Are they nice? (sentiment analysis)
• Do I reply to her emails (reciprocity)?
• Do I cc the wrong people (my mom)?
94
97
Thank You!
•Questions & Answers
97
• Follow: @rjurney
• Read the Blog: datasyndrome.com

Contenu connexe

Tendances

Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Social Network Analysis in Your Problem Domain
Social Network Analysis in Your Problem DomainSocial Network Analysis in Your Problem Domain
Social Network Analysis in Your Problem DomainRussell Jurney
 
Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
 
Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Miguel González-Fierro
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01Krishna Sankar
 
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief IntroductionData science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief IntroductionAdnan Masood
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
 
Reproducible, Open Data Science in the Life Sciences
Reproducible, Open  Data Science in the  Life SciencesReproducible, Open  Data Science in the  Life Sciences
Reproducible, Open Data Science in the Life SciencesEamonn Maguire
 
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationSeeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationGreg Goltsov
 
So your boss says you need to learn data science
So your boss says you need to learn data scienceSo your boss says you need to learn data science
So your boss says you need to learn data scienceSusan Ibach
 
Self Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System AccuracySelf Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System AccuracyDataWorks Summit
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java ProfessionalsEdureka!
 

Tendances (19)

Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Social Network Analysis in Your Problem Domain
Social Network Analysis in Your Problem DomainSocial Network Analysis in Your Problem Domain
Social Network Analysis in Your Problem Domain
 
Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domain
 
Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domain
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R Workshop
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
 
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief IntroductionData science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
Reproducible, Open Data Science in the Life Sciences
Reproducible, Open  Data Science in the  Life SciencesReproducible, Open  Data Science in the  Life Sciences
Reproducible, Open Data Science in the Life Sciences
 
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationSeeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
 
So your boss says you need to learn data science
So your boss says you need to learn data scienceSo your boss says you need to learn data science
So your boss says you need to learn data science
 
Self Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System AccuracySelf Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System Accuracy
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Demo Eclipse Science
Demo Eclipse ScienceDemo Eclipse Science
Demo Eclipse Science
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java Professionals
 

En vedette

Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Enabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPopEnabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPopJason Plurad
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Blistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQLBlistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQLSimon Harris
 
Bitraf - Particle Photon IoT workshop
Bitraf - Particle Photon IoT workshopBitraf - Particle Photon IoT workshop
Bitraf - Particle Photon IoT workshopJens Brynildsen
 
Mapa mental de un lider tahi
Mapa mental de un lider  tahiMapa mental de un lider  tahi
Mapa mental de un lider tahiTahi04
 
ConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving FutureConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving FutureEricsson
 
Your moment is Waiting
Your moment is WaitingYour moment is Waiting
Your moment is Waitingrittujacob
 
Teraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview PresentationTeraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview PresentationGord Sissons
 
Creating HTML Pages
Creating HTML PagesCreating HTML Pages
Creating HTML PagesMike Crabb
 
Top Insights from SaaStr by Leading Enterprise Software Experts
Top Insights from SaaStr by Leading Enterprise Software ExpertsTop Insights from SaaStr by Leading Enterprise Software Experts
Top Insights from SaaStr by Leading Enterprise Software ExpertsOpenView
 
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...Andreas Önnerfors
 
CSS Grid Layout for Topconf, Linz
CSS Grid Layout for Topconf, LinzCSS Grid Layout for Topconf, Linz
CSS Grid Layout for Topconf, LinzRachel Andrew
 
IBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBIBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBGord Sissons
 
Motivación laboral
Motivación laboralMotivación laboral
Motivación laboralalexander_hv
 

En vedette (20)

Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Enabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPopEnabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPop
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Blistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQLBlistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQL
 
tarea 7 gabriel
tarea 7 gabrieltarea 7 gabriel
tarea 7 gabriel
 
Bitraf - Particle Photon IoT workshop
Bitraf - Particle Photon IoT workshopBitraf - Particle Photon IoT workshop
Bitraf - Particle Photon IoT workshop
 
JSON-LD Update
JSON-LD UpdateJSON-LD Update
JSON-LD Update
 
Mapa mental de un lider tahi
Mapa mental de un lider  tahiMapa mental de un lider  tahi
Mapa mental de un lider tahi
 
ConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving FutureConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving Future
 
Zipcar
ZipcarZipcar
Zipcar
 
Feb 13 17 word of the day (1)
Feb 13 17 word of the day (1)Feb 13 17 word of the day (1)
Feb 13 17 word of the day (1)
 
Your moment is Waiting
Your moment is WaitingYour moment is Waiting
Your moment is Waiting
 
Mapa mental
Mapa mentalMapa mental
Mapa mental
 
Teraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview PresentationTeraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview Presentation
 
Creating HTML Pages
Creating HTML PagesCreating HTML Pages
Creating HTML Pages
 
Top Insights from SaaStr by Leading Enterprise Software Experts
Top Insights from SaaStr by Leading Enterprise Software ExpertsTop Insights from SaaStr by Leading Enterprise Software Experts
Top Insights from SaaStr by Leading Enterprise Software Experts
 
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
 
CSS Grid Layout for Topconf, Linz
CSS Grid Layout for Topconf, LinzCSS Grid Layout for Topconf, Linz
CSS Grid Layout for Topconf, Linz
 
IBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBIBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TB
 
Motivación laboral
Motivación laboralMotivación laboral
Motivación laboral
 

Similaire à Agile Data Science Applications with Hadoop

Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldOpenSource Connections
 
Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDavide Mauri
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Oscar Corcho
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...OpenSource Connections
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdbjixuan1989
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 

Similaire à Agile Data Science Applications with Hadoop (20)

Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
 
Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your life
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
DataHub
DataHubDataHub
DataHub
 

Dernier

IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 

Dernier (20)

IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 

Agile Data Science Applications with Hadoop

  • 1. Agile Data Science January 2014 Agile Analytics Applications with Hadoop
  • 2. 2 About Me…Bearding. • Bearding is my #1 natural talent. • I’m going to beat this guy. • Seriously. • Salty Sea Beard • Fortified with Pacific Ocean Minerals 2
  • 3. 3 Agile Data Science: The Book A philosophy. Not the only way, but it’s a really good way! Code: ‘AUTHD’ – 50% off 3
  • 4. 4 We Go Fast, But Don’t Worry! • Download the slides - click the links - read examples! • If it’s not on the blog (Hortonworks, Data Syndrome), it’s in the book! • Order now: http://shop.oreilly.com/product/0636920025054.do 4
  • 5. 5 Agile Application Development: Check • LAMP stack mature • Post-Rails frameworks to choose from • Enable rapid feedback and agility + NoSQL 5
  • 7. 7 Scientific Computing / HPC Tubes and Mercury (Old School) Cores and Spindles (New School) UNIVAC and Deep Blue both fill a warehouse. We’re back! 7 ‘Smart Kid’ Only: MPI, Globus, etc. Until Hadoop
  • 9. 9 Data Center as Computer “A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient manner.” Click here for a paper on operating a ‘data center as computer.’ 9 Warehouse Scale Computers and Applications
  • 10. 10 Hadoop to the Rescue! • Easy to use (Pig, Hive, Cascading) • CHEAP: 1% the cost of SAN/NAS • A department can afford its own Hadoop cluster! • Dump all your data in one place: Hadoop DFS • Silos come CRASHING DOWN! • JOIN like crazy! • ETL like whoa! • An army of mappers and reducers at your command • OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME! 10
  • 12. 12 Analytics Apps: It takes a Team • Broad skill-set • Nobody has them all • Inherently collaborative 12
  • 13. 13 Data Science Team • 3-4 team members with broad, diverse skill-sets that overlap • Transactional overhead dominates at 5+ people • Expert researchers: lend 25-50% of their time to teams • Creative workers. Like a studio, not an assembly line • Total freedom... with goals and deliverables. • Work environment matters most 13
  • 14. 14 How To Get Insight Into Product • Back-end has gotten THICKER • Generating $$$ insight can take 10-100x app dev • Timeline disjoint: analytics vs agile app-dev/design • How do you ship insights efficiently? • Can you collaborate on research vs developer timeline? 14
  • 15. 15 The Wrong Way - Part One “We made a great design. Your job is to predict the future for it.” 15
  • 16. 16 The Wrong Way - Part Two “What is taking you so long to reliably predict the future?” 16
  • 17. 17 The Wrong Way - Part Three “The users don’t understand what 86% true means.” 17
  • 18. 18 The Wrong Way - Part Four GHJIAEHGIEhjagigehganb!!!!!RJ(@J?!! 18
  • 19. 19 The Wrong Way - Conclusion Inevitable Conclusion Plane Mountain 19
  • 20. 20 Reminds me of... the waterfall model :( 20
  • 21. 21 Chief Problem You can’t design insight in analytics applications. You discover it. You discover by exploring. 21
  • 22. 22 -> Strategy So make an app for exploring your data. Which becomes a palette for what you ship. Iterate and publish intermediate results. 22
  • 23. 23 Data Design • Not the 1st query that = insight, it’s the 15th, or 150th • Capturing “Ah ha!” moments • Slow to do those in batch… • Faster, better context in an interactive web application. • Pre-designed charts wind up terrible. So bad. • Easy to invest man-years in wrong statistical models • Semantics of presenting predictions are complex • Opportunity lies at intersection of data & design 23
  • 24. 24 How Do We Get Back to Agile? 24
  • 25. 25 Statement of Principles (Then Tricks With Code) 25
  • 26. 26 Setup An Environment Where: • Insights repeatedly produced • Iterative work shared with entire team • Interactive from day Zero • Data model is consistent end-to-end • Minimal impedance between layers • Scope and depth of insights grow • Insights form the palette for what you ship • Until the application pays for itself and more 26
  • 28. 28 Value Document > Relation Most data is dirty. Most data is semi-structured or unstructured. Rejoice! 28
  • 29. 29 Value Document > Relation Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction. 29
  • 30. 30 Relational Data = Legacy Format • Why JOIN? Storage is fundamentally cheap! • Duplicate that JOIN data in one big record type! • ETL once to document format on import, NOT every job • Not zero JOINs, but far fewer JOINs • Semi-structured documents preserve data’s actual structur • Column compressed document formats beat JOINs! 30
  • 31. 31 Value Imperative > Declarative • We don’t know what we want to SELECT. • Data is dirty - check each step, clean iteratively. • 85% of data scientist’s time spent munging. ETL. • Imperative is optimized for our process. • Process = iterative, snowballing insight • Efficiency matters, self optimize 31
  • 32. 32 Value Dataflow > SELECT 32
  • 33. 33 Ex. Dataflow: ETL + Email Sent Count (I can’t read this either. Get a big version here.) 33
  • 34. 34 Value Pig > Hive (for app-dev) • Pigs eat ANYTHING • Pig is optimized for refining data, as opposed to consuming it • Pig is imperative, iterative • Pig is dataflows, and SQLish (but not SQL) • Code modularization/re-use: Pig Macros • ILLUSTRATE speeds dev time (even UDFs) • Easy UDFs in Java, JRuby, Jython, Javascript • Pig Streaming = use any tool, period. • Easily prepare our data as it will appear in our app. • If you prefer Hive, use Hive. Actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive. See: HCatalog for Pig/Hive integration. 34
  • 35. 35 Localhost vs Petabyte Scale: Same Tools • Simplicity essential to scalability: highest level tools we can • Prepare a good sample - tricky with joins, easy with documents • Local mode: pig -l /tmp -x local -v -w • Frequent use of ILLUSTRATE • 1st: Iterate, debug & publish locally • 2nd: Run on cluster, publish to team/customer • Consider skipping Object-Relational-Mapping (ORM) • We do not trust ‘databases,’ only HDFS @ n=3 • Everything we serve in our app is re-creatable via Hadoop. 35
  • 36. 36 Data-Value Pyramid Climb it. Do not skip steps. See here. 36
  • 37. 37 0/1) Display Atomic Records On The Web 37
  • 38. 38 0.0) Document - Serialize Events • Protobuf • Thrift • JSON • Avro - I use Avro because the schema is onboard. 38
  • 39. 39 0.1) Documents Via Relation ETL enron_messages = load '/enron/enron_messages.tsv' as ( message_id:chararray, sql_date:chararray, from_address:chararray, from_name:chararray, subject:chararray, body:chararray); enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray); split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc'; headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10; with_headers = join headers by group, enron_messages by message_id parallel 10; emails = foreach with_headers generate enron_messages::message_id as message_id, CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date, TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray), enron_messages::subject as subject, enron_messages::body as body, headers::tos.(address, name) as tos, headers::ccs.(address, name) as ccs, headers::bccs.(address, name) as bccs; store emails into '/enron/emails.avro' using AvroStorage( Example here. 39
  • 40. 40 0.2) Serialize Events From Streamsclass GmailSlurper(object): ... def init_imap(self, username, password): self.username = username self.password = password try: imap.shutdown() except: pass self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993) self.imap.login(username, password) self.imap.is_readonly = True ... def write(self, record): self.avro_writer.append(record) ... def slurp(self): if(self.imap and self.imap_folder): for email_id in self.id_list: (status, email_hash, charset) = self.fetch_email(email_id) if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash): print email_id, charset, email_hash['thread_id'] self.write(email_hash) Scrape your own gmail in Python and Ruby. 40
  • 41. 41 0.3) ETL Logs log_data = LOAD 'access_log' USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader AS (remoteAddr, remoteLogname, user, time, method, uri, proto, bytes); 41
  • 42. 42 1) Plumb Atomic Events->Browser (Example stack that enables high productivity) 42
  • 43. 43 1.1) Cat Avro Serialized Events me$ cat_avro ~/Data/enron.avro { u'bccs': [], u'body': u'scamming people, blah blah', u'ccs': [], u'date': u'2000-08-28T01:50:00.000Z', u'from': {u'address': u'bob.dobbs@enron.com', u'name': None}, u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>', u'subject': u'Re: Enron trade for frop futures', u'tos': [ {u'address': u'connie@enron.com', u'name': None} ] } Get cat_avro in python, ruby 43
  • 44. 44 1.2) Load Events in Pig me$ pig -l /tmp -x local -v -w grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage(); grunt> describe enron_emails emails: { message_id: chararray, datetime: chararray, from:tuple(address:chararray,name:chararray) subject: chararray, body: chararray, tos: {to: (address: chararray,name: chararray)}, ccs: {cc: (address: chararray,name: chararray)}, bccs: {bcc: (address: chararray,name: chararray)} }   44
  • 45. 45 1.3) ILLUSTRATE Events in Pig grunt> illustrate enron_emails  --------------------------------------------------------------------------- | emails | | message_id:chararray | | datetime:chararray | | from:tuple(address:chararray,name:chararray) | | subject:chararray | | body:chararray | tos:bag{to:tuple(address:chararray,name:chararray)} | | ccs:bag{cc:tuple(address:chararray,name:chararray)} | | bccs:bag{bcc:tuple(address:chararray,name:chararray)} | --------------------------------------------------------------------------- | | | <1731.10095812390082.JavaMail.evans@thyme> | | 2001-01-09T06:38:00.000Z | | (bob.dobbs@enron.com, J.R. Bob Dobbs) | | Re: Enron trade for frop futures | | scamming people, blah blah | | {(connie@enron.com,)} | | {} | | {} | Upgrade to Pig 0.10+ 45
  • 46. 46 1.4) Publish Events to a ‘Database’ pig -l /tmp -x local -v -w -param avros=enron.avro -param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig /* MongoDB libraries and configuration */ register /me/mongo-hadoop/mongo-2.7.3.jar register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar /* Set speculative execution off to avoid chance of duplicate records in Mongo */ set mapred.map.tasks.speculative.execution false set mapred.reduce.tasks.speculative.execution false define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */ /* By default, lets have 5 reducers */ set default_parallel 5 avros = load '$avros' using AvroStorage(); store avros into '$mongourl' using MongoStorage(); Full instructions here. Which does this: From Avro to MongoDB in one command: 46
  • 47. 47 1.5) Check Events in ‘Database’ $ mongo enron MongoDB shell version: 2.0.2 connecting to: enron show collections Emails system.indexes >db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"}) { "_id" : ObjectId("502b4ae703643a6a49c8d180"), "message_id" : "<1731.10095812390082.JavaMail.evans@thyme>", "date" : "2001-01-09T06:38:00.000Z", "from" : { "address" : "bob.dobbs@enron.com", "name" : "J.R. Bob Dobbs" }, "subject" : Re: Enron trade for frop futures, "body" : "Scamming more people...", "tos" : [ { "address" : "connie@enron", "name" : null } ], "ccs" : [ ], "bccs" : [ ] } 47
  • 48. 48 1.6) Publish Events on the Web require 'rubygems' require 'sinatra' require 'mongo' require 'json' connection = Mongo::Connection.new database = connection['agile_data'] collection = database['emails'] get '/email/:message_id' do |message_id| data = collection.find_one({:message_id => message_id}) JSON.generate(data) end 48
  • 49. 49 1.6) Publish events on the web 49
  • 51. 51 What’s the Point? • A designer can work against real data. • An application developer can work against real data. • A product manager can think in terms of real data. • Entire team is grounded in reality! • You’ll see how ugly your data really is. • You’ll see how much work you have yet to do. • Ship early and often! • Feels agile, don’t it? Keep it up! 51
  • 52. 52 1.7) Wrap Events with Bootstrap <link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet"> </head> <body> <div class="container" style="margin-top: 100px;"> <table class="table table-striped table-bordered table-condensed"> <thead> {% for key in data['keys'] %} <th>{{ key }}</th> {% endfor %} </thead> <tbody> <tr> {% for value in data['values'] %} <td>{{ value }}</td> {% endfor %} </tr> </tbody> </table> </div> </body> Complete example here with code here. 52
  • 53. 53 1.7) Wrap Events with Bootstrap 53
  • 54. 54 Refine. Add Links Between Documents. Not the Mona Lisa, but coming along... See: here 54
  • 55. 56 1.8) List Links to Sorted Events mongo enron > db.emails.ensureIndex({message_id: 1}) > db.emails.find().sort({date:0}).limit(10).pretty() { { "_id" : ObjectId("4f7a5da2414e4dd0645d1176"), "message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>", "from" : [ ... pig -l /tmp -x local -v -w emails_per_user = foreach (group emails by from.address) { sorted = order emails by date; last_1000 = limit sorted 1000; generate group as from_address, emails as emails; }; store emails_per_user into '$mongourl' using MongoStorage(); Use Pig, serve/cache a bag/array of email documents: Use your ‘database’, if it can sort. 56
  • 56. 57 1.8) List Links to Sorted Documents 57
  • 57. 58 1.9) Make It Searchable If you have list, search is easy with ElasticSearch and Wonderdog... /* Load ElasticSearch integration */ register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar'; register '/me/elasticsearch-0.18.6/lib/*'; define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage(); emails = load '/me/tmp/emails' using AvroStorage(); store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/elasticsearch- 0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins'); curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1' Test it with curl: ElasticSearch has no security features. Take note. Isolate. 58
  • 58. 59 2) Create Simple Charts 59
  • 59. 60 2) Create Simple Tables and Charts 60
  • 60. 61 2) Create Simple Charts • Start with an HTML table on general principle. • Then use nvd3.js - reusable charts for d3.js • Aggregate by properties & displaying is first step in entity resolution • Start extracting entities. Ex: people, places, topics, time series • Group documents by entities, rank and count. • Publish top N, time series, etc. • Fill a page with charts. • Add a chart to your event page. 61
  • 61. 62 2.1) Top N (of Anything) in Pig pig -l /tmp -x local -v -w top_things = foreach (group things by key) { sorted = order things by arbitrary_rank desc; top_10_things = limit sorted 10; generate group as key, top_10_things as top_10_things; }; store top_n into '$mongourl' using MongoStorage(); Remember, this is the same structure the browser gets as json. This would make a good Pig Macro. 62
  • 62. 63 2.2) Time Series (of Anything) in Pig pig -l /tmp -x local -v -w /* Group by our key and date rounded to the month, get a total */ things_by_month = foreach (group things by (key, ISOToMonth(datetime)) generate flatten(group) as (key, month), COUNT_STAR(things) as total; /* Sort our totals per key by month to get a time series */ things_timeseries = foreach (group things_by_month by key) { timeseries = order things by month; generate group as key, timeseries as timeseries; }; store things_timeseries into '$mongourl' using MongoStorage(); Yet another good Pig Macro. 63
  • 63. 64 Data Processing in Our Stack A new feature in our application might begin at any layer… GREAT! Any team member can add new features, no problemo! I’m creative! I know Pig! I’m creative too! I <3 Javascript! omghi2u! where r my legs? send halp 64
  • 64. 65 Data Processing in Our Stack ... but we shift the data-processing towards batch, as we are able. Ex: Overall total emails calculated in each layer See real example here. 65
  • 65. 66 3) Exploring with Reports 66
  • 66. 67 3) Exploring with Reports 67
  • 67. 68 3.0) From Charts to Reports • Extract entities from properties we aggregated by in charts (Step 2) • Each entity gets its own type of web page • Each unique entity gets its own web page • Link to entities as they appear in atomic event documents (Step 1) • Link most related entities together, same and between types. • More visualizations! • Parametize results via forms. 68
  • 68. 69 3.1) Looks Like This: 69
  • 69. 70 3.2) Cultivate Common Keyspaces 70
  • 70. 71 3.3) Get People Clicking. Learn. • Explore this web of generated pages, charts and links! • Everyone on the team gets to know your data. • Keep trying out different charts, metrics, entities, links. • See whats interesting. • Figure out what data needs cleaning and clean it. • Start thinking about predictions & recommendations. ‘People’ could be just your team, if data is sensitive. 71
  • 72. 73 4.0) Preparation • We’ve already extracted entities, their properties and relationships • Our charts show where our signal is rich • We’ve cleaned our data to make it presentable • The entire team has an intuitive understanding of the data • They got that understanding by exploring the data • We are all on the same page! 73
  • 73. 74 4.2) Think in Different Perspectives • Networks • Time Series / Distributions • Natural Language Processing • Conditional Probabilities / Bayesian Inference • Check out Chapter 2 of the book 74
  • 76. 77 4.3.2) Networks Viz with Gephi 77
  • 77. 78 4.3.3) Gephi = Easy 78
  • 79. 80 4.4) Time Series & Distributions 80
  • 80. 81 4.4.1) Smooth Sparse Data See here. 81
  • 81. 82 4.4.2) Regress to Find Trends JRuby Linear Regression UDF Pig to use the UDF Trend Line in your Application 82
  • 82. 83 4.5.1) Natural Language Processing Example with code here and macro here. 83
  • 84. 85 4.5.3) NLP for All: Extract Topics! • TF-IDF in Pig - 2 lines of code with Pig Macros: • http://hortonworks.com/blog/pig-macro-for-tf-idf-makes- topic-summarization-2-lines-of-pig/ • LDA with Pig and the Lucene Tokenizer: • http://thedatachef.blogspot.be/2012/03/topic-discovery- with-apache-pig-and.html 85
  • 85. 86 4.6) Probability & Bayesian Inference 86
  • 86. 87 4.6.1) Gmail Suggested Recipients 87
  • 88. 89 4.6.2) Step 1: COUNT (From -> To) 89
  • 89. 90 4.6.2) Step 2: COUNT (From, To, Cc)/Total P(cc | to) = Probability of cc’ing someone, given that you’ve to’d someone 90
  • 90. 91 4.6.3) Wait - Stop Here! It Works! They match… 91
  • 91. 92 4.4) Add Predictions to Reports 92
  • 92. 93 5) Enable New Actions 93
  • 93. 94 Why Doesn’t Kate Reply to My Emails? • What time is best to catch her? • Are they too long? • Are they meant to be replied to (original content)? • Are they nice? (sentiment analysis) • Do I reply to her emails (reciprocity)? • Do I cc the wrong people (my mom)? 94
  • 94. 97 Thank You! •Questions & Answers 97 • Follow: @rjurney • Read the Blog: datasyndrome.com