Paris HUG - Agile Analytics Applications on Hadoop

Agile Analytics Applications
Russell Jurney (@rjurney) - Hadoop Evangelist @Hortonworks

Formerly Viz, Data Science at Ning, LinkedIn

HBase Dashboards, Career Explorer, InMaps

© Hortonworks Inc. 2012
1

Agile Data - The Book (March, 2013)

Read it now on OFPS

A philosophy,
not the only way

But still, its good! Really!
© Hortonworks Inc. 2012 2

We go fast... but don’t worry!
• Examples for EVERYTHING on the Hortonworks blog:
http://hortonworks.com/blog/authors/russell_jurney

• Download the slides - click the links - read examples!

• If its not on the blog, its in the book!

• Order now: http://shop.oreilly.com/product/0636920025054.do

• Read the book NOW on OFPS:
• http://ofps.oreilly.com/titles/9781449326265/chapter_2.html


Agile Application Development: Check
• LAMP stack mature
• Post-Rails frameworks to choose from
• Enable rapid feedback and agility

+ NoSQL


Data Warehousing


Scientific Computing / HPC
• ‘Smart kid’ only: MPI, Globus, etc. until Hadoop

Tubes and Mercury (old school) Cores and Spindles (new school)

UNIVAC and Deep Blue both fill a warehouse. We’re back...

Data Science?

Application
Data Warehousing
Development

33% 33%

33%

Scientiﬁc Computing / HPC

Data Center as Computer
• Warehouse Scale Computers and applications

“A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efﬁcient manner.”
Click here for a paper on operating a ‘data center as computer.’


Hadoop to the Rescue!
Big data refinery / Modernize ETL
Audio, Web, Mobile, CRM,
Video, ERP, SCM, …
Images
New Data Business
Transactions
Docs, Sources
Text, & Interactions
XML

HDFS
Web
Logs,
Clicks
Big Data
Social, Refinery SQL NoSQL NewSQL
Graph,
Feeds
ETL
EDW MPP NewSQL
Sensors,
Devices,
RFID

Business
Spatial,
GPS Apache Hadoop
Intelligence
& Analytics
Events,
Other Dashboards, Reports,
Visualization, …

Page 7

I stole this slide from Eric. Update: He stole it from someone else.

Hadoop to the Rescue!
• Easy to use! (Pig, Hive, Cascading)
• CHEAP: 1% the cost of SAN/NAS
• A department can afford its own Hadoop cluster!

• Dump all your data in one place: Hadoop DFS
• Silos come CRASHING DOWN!
• JOIN like crazy!
• ETL like whoah!

• An army of mappers and reducers at your command
• OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME!

NOW WHAT?

?
11

Analytics Apps: It takes a Team
• Broad skill-set to make useful apps
• Basically nobody has them all
• Application development is inherently collaborative


Data Science Team
• 3-4 team members with broad, diverse skill-sets that overlap
• Transactional overhead dominates at 5+ people

• Expert researchers: lend 25-50% of their time to teams
• Pick relevant researchers. Leave them alone. They’ll spawn
new products by accident. Not just CS/Math. Design. Art?

• Creative workers. Run like a studio, not an assembly line
• Total freedom... with goals and deliverables.

• Work environment matters most: private, social & quiet space
• Desks/cubes optional

How to get insight into product?
• Back-end has gotten t-h-i-c-k-e-r

• Generating $$$ insight can take 10-100x app dev

• Timeline disjoint: analytics vs agile app-dev/design

• How do you ship insights efficiently?

• How do you collaborate on research vs developer timeline?

The Wrong Way - Part One

“We made a great design. Your job is to predict the future for it.”


The Wrong Way - Part Two

“Whats taking you so long to reliably predict the future?”


The Wrong Way - Part Three

“The users don’t understand what 86% true means.”


The Wrong Way - Part Four

GHJIAEHGIEhjagigehganbanbigaebjnain!!!!!RJ(@J?!!


The Wrong Way - Inevitable Conclusion

Plane Mountain


Reminds me of... the waterfall model

:( 20

Chief Problem

You can’t design insight in analytics applications.

You discover it.

You discover by exploring.


-> Strategy

So make an app for exploring your data.

Iterate and publish intermediate results.

Which becomes a palette for what you ship.


Data Design

• Not the 1st query that = insight, its the 15th, or the 150th
• Capturing “Ah ha!” moments
• Slow to do those in batch...

• Faster, better context in an interactive web application.
• Pre-designed charts wind up terrible. So bad.
• Easy to invest man-years in the wrong statistical models
• Semantics of presenting predictions are complex, delicate

• Opportunity lies at intersection of data & design


How do we get back to Agile?


Statement of Principles

(then tricks, with code)


Setup an environment where...

• Insights repeatedly produced
• Iterative work shared with entire team
• Interactive from day 0

• Data model is consistent end-to-end
• Minimal impedance between layers
• Scope and depth of insights grow
• Insights form the palette for what you ship

• Until the application pays for itself and more


Value document > relation

Most data is dirty. Most data is semi-structured or un-structured. Rejoice!

Value document > relation

Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction.

Relational Data = Legacy?
• Why JOIN? Storage is fundamentally cheap!

• Duplicate that JOIN data in one big record type!

• ETL once to document format on import, NOT every job

• Not zero JOINs, but far fewer JOINs

• Semi-structured documents preserve data’s actual structure

• Column compressed document formats beat JOINs! (paper
coming)

Value imperative > declarative
• We don’t know what we want to SELECT.
• Data is dirty - check each step, clean iteratively.
• 85% of data scientist’s time spent munging. See: ETL.

• Imperative is optimized for our process.
• Process = iterative, snowballing insight
• Efficiency matters, self optimize


Value dataflow > SELECT


Ex. dataflow: ETL + email sent count

© Hortonworks Inc. 2012 (I can’t read this either. Get a big version here.) 32

Value Pig > Hive (for app-dev)
• Pigs eat ANYTHING
• Pig is optimized for refining data, as opposed to consuming it
• Pig is imperative, iterative
• Pig is dataflows, and SQLish (but not SQL)
• Code modularization/re-use: Pig Macros
• ILLUSTRATE speeds dev time (even UDFs)
• Easy UDFs in Java, JRuby, Jython, Javascript
• Pig Streaming = use any tool, period.
• Easily prepare our data as it will appear in our app.
• If you prefer Hive, use Hive.
But actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive...
See: HCatalog for Pig/Hive integration, and this post.


Localhost vs Petabyte scale: same tools
• Simplicity essential to scalability: highest level tools we can
• Prepare a good sample - tricky with joins, easy with
documents
• Local mode: pig -l /tmp -x local -v -w
• Frequent use of ILLUSTRATE
• 1st: Iterate, debug & publish locally
• 2nd: Run on cluster, publish to team/customer
• Consider skipping Object-Relational-Mapping (ORM)
• We do not trust ‘databases,’ only HDFS @ n=3.
• Everything we serve in our app is re-creatable via Hadoop.


Data-Value Pyramid

Climb it. Do not skip steps. See here.


0/1) Display atomic records on the web


0.0) Document-serialize events
• Protobuf
• Thrift
• JSON

• Avro - I use Avro because the schema is onboard.


0.1) Documents via Relation ETL
enron_messages = load '/enron/enron_messages.tsv' as (
message_id:chararray,
sql_date:chararray,
from_address:chararray,
from_name:chararray,
subject:chararray,
body:chararray
);

enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray);

split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';

headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10;
with_headers = join headers by group, enron_messages by message_id parallel 10;
emails = foreach with_headers generate enron_messages::message_id as message_id,
CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date,
TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray),
enron_messages::subject as subject,
enron_messages::body as body,
headers::tos.(address, name) as tos,
headers::ccs.(address, name) as ccs,
headers::bccs.(address, name) as bccs;

store emails into '/enron/emails.avro' using AvroStorage(

Example here.

0.2) Serialize events from streams
class GmailSlurper(object):
...
  def init_imap(self, username, password):
    self.username = username
    self.password = password
    try:
      imap.shutdown()
    except:
      pass
    self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993)
    self.imap.login(username, password)
    self.imap.is_readonly = True
...
  def write(self, record):
    self.avro_writer.append(record)
...
  def slurp(self):
    if(self.imap and self.imap_folder):
      for email_id in self.id_list:
        (status, email_hash, charset) = self.fetch_email(email_id)
        if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash):
          print email_id, charset, email_hash['thread_id']
          self.write(email_hash)

© Hortonworks Inc. 2012 Scrape your own gmail in Python and Ruby. 39

0.3) ETL Logs

log_data = LOAD 'access_log'
USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader
AS (remoteAddr,
remoteLogname,
user,
time,
method,
uri,
proto,
bytes);


1) Plumb atomic events -> browser

(Example stack that enables high productivity)


Lots of Stack Options with Examples
• Pig with Voldemort, Ruby, Sinatra: example
• Pig with ElasticSearch: example
• Pig with MongoDB, Node.js: example

• Pig with Cassandra, Python Streaming, Flask: example
• Pig with HBase, JRuby, Sinatra: example
• Pig with Hive via HCatalog: example (trivial on HDP)
• Up next: Accumulo, Redis, MySQL, etc.


1.1) cat our Avro serialized events

me$ cat_avro ~/Data/enron.avro
{
u'bccs': [],
u'body': u'scamming people, blah blah',
u'ccs': [],
u'date': u'2000-08-28T01:50:00.000Z',
u'from': {u'address': u'bob.dobbs@enron.com', u'name': None},
u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>',
u'subject': u'Re: Enron trade for frop futures',
u'tos': [
{u'address': u'connie@enron.com', u'name': None}
]
}

© Hortonworks Inc. 2012 Get cat_avro in python, ruby 43

1.2) Load our events in Pig

me$ pig -l /tmp -x local -v -w
grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage();
grunt> describe enron_emails

emails: {
message_id: chararray,
datetime: chararray,
from:tuple(address:chararray,name:chararray)
subject: chararray,
body: chararray,
tos: {to: (address: chararray,name: chararray)},
ccs: {cc: (address: chararray,name: chararray)},
bccs: {bcc: (address: chararray,name: chararray)}
}


1.3) ILLUSTRATE our events in Pig
grunt> illustrate enron_emails

---------------------------------------------------------------------------
| emails |
| message_id:chararray |
| datetime:chararray |
| from:tuple(address:chararray,name:chararray) |
| subject:chararray |
| body:chararray |
| tos:bag{to:tuple(address:chararray,name:chararray)} |
| ccs:bag{cc:tuple(address:chararray,name:chararray)} |
| bccs:bag{bcc:tuple(address:chararray,name:chararray)} |
---------------------------------------------------------------------------
| |
| <1731.10095812390082.JavaMail.evans@thyme> |
| 2001-01-09T06:38:00.000Z |
| (bob.dobbs@enron.com, J.R. Bob Dobbs) |
| Re: Enron trade for frop futures |
| scamming people, blah blah |
| {(connie@enron.com,)} |
| {} |
| {} |
Upgrade to Pig 0.10+

1.4) Publish our events to a ‘database’
From Avro to MongoDB in one command:
pig -l /tmp -x local -v -w -param avros=enron.avro
-param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig

Which does this:
/* MongoDB libraries and configuration */
register /me/mongo-hadoop/mongo-2.7.3.jar
register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar

/* Set speculative execution off to avoid chance of duplicate records in Mongo */
set mapred.map.tasks.speculative.execution false
set mapred.reduce.tasks.speculative.execution false
define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */

/* By default, lets have 5 reducers */
set default_parallel 5

avros = load '$avros' using AvroStorage();
store avros into '$mongourl' using MongoStorage();

© Hortonworks Inc. 2012 Full instructions here. 46

1.5) Check events in our ‘database’
$ mongo enron

MongoDB shell version: 2.0.2
connecting to: enron

> show collections
emails
system.indexes

> db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"})
{
" "_id" : ObjectId("502b4ae703643a6a49c8d180"),
" "message_id" : "<1731.10095812390082.JavaMail.evans@thyme>",
" "date" : "2001-01-09T06:38:00.000Z",
" "from" : { "address" : "bob.dobbs@enron.com", "name" : "J.R. Bob Dobbs" },
" "subject" : Re: Enron trade for frop futures,
" "body" : "Scamming more people...",
" "tos" : [ { "address" : "connie@enron", "name" : null } ],
" "ccs" : [ ],
" "bccs" : [ ]
}


1.6) Publish events on the web

require 'rubygems'
require 'sinatra'
require 'mongo'
require 'json'

connection = Mongo::Connection.new
database = connection['agile_data']
collection = database['emails']

get '/email/:message_id' do |message_id|
data = collection.find_one({:message_id => message_id})
JSON.generate(data)
end


1.6) Publish events on the web


Whats the point?
• A designer can work against real data.
• An application developer can work against real data.
• A product manager can think in terms of real data.

• Entire team is grounded in reality!
• You’ll see how ugly your data really is.
• You’ll see how much work you have yet to do.
• Ship early and often!

• Feels agile, don’t it? Keep it up!


1.7) Wrap events with Bootstrap
<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">
</head>
<body>
<div class="container" style="margin-top: 100px;">
<table class="table table-striped table-bordered table-condensed">
<thead>
{% for key in data['keys'] %}
<th>{{ key }}</th>
{% endfor %}
</thead>
<tbody>
<tr>
{% for value in data['values'] %}
<td>{{ value }}</td>
{% endfor %}
</tr>
</tbody>
</table>
</div>
</body> Complete example here with code here.

1.7) Wrap events with Bootstrap


Refine. Add links between documents.

Not the Mona Lisa, but coming along... See: here

1.8) List links to sorted events
Use Pig, serve/cache a bag/array of email documents:
pig -l /tmp -x local -v -w

emails_per_user = foreach (group emails by from.address) {
sorted = order emails by date;
last_1000 = limit sorted 1000;
generate group as from_address, emails as emails;
};

store emails_per_user into '$mongourl' using MongoStorage();

Use your ‘database’, if it can sort.
mongo enron
> db.emails.ensureIndex({message_id: 1})
> db.emails.find().sort({date:0}).limit(10).pretty()
{
{
" "_id" : ObjectId("4f7a5da2414e4dd0645d1176"),
" "message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>",
" "from" : [
...


1.8) List links to sorted documents


1.9) Make it searchable...
If you have list, search is easy with ElasticSearch and Wonderdog...

/* Load ElasticSearch integration */
register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar';
register '/me/elasticsearch-0.18.6/lib/*';
define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();

emails = load '/me/tmp/emails' using AvroStorage();
store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/
elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins');

Test it with curl:
curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1'

ElasticSearch has no security features. Take note. Isolate.


From now on we speed up...

Don’t worry, its in the book and on the blog.

http://hortonworks.com/blog/


2) Create Simple Charts


2) Create Simple Tables and Charts


2) Create Simple Charts
• Start with an HTML table on general principle.
• Then use nvd3.js - reusable charts for d3.js
• Aggregate by properties & displaying is first step in entity
resolution
• Start extracting entities. Ex: people, places, topics, time series
• Group documents by entities, rank and count.

• Publish top N, time series, etc.
• Fill a page with charts.
• Add a chart to your event page.

2.1) Top N (of anything) in Pig


top_things = foreach (group things by key) {
sorted = order things by arbitrary_rank desc;
top_10_things = limit sorted 10;
generate group as key, top_10_things as top_10_things;
};
store top_n into '$mongourl' using MongoStorage();

Remember, this is the same structure the browser gets as json.

This would make a good Pig Macro.


2.2) Time Series (of anything) in Pig

/* Group by our key and date rounded to the month, get a total */
things_by_month = foreach (group things by (key, ISOToMonth(datetime))
generate flatten(group) as (key, month),
COUNT_STAR(things) as total;

/* Sort our totals per key by month to get a time series */
things_timeseries = foreach (group things_by_month by key) {
timeseries = order things by month;
generate group as key, timeseries as timeseries;
};

store things_timeseries into '$mongourl' using MongoStorage();

Yet another good Pig Macro.


Data processing in our stack

A new feature in our application might begin at any layer... great!

omghi2u!
I’m creative! I’m creative too! where r my legs?
I know Pig! I <3 Javascript!
send halp

Any team member can add new features, no problemo!

Data processing in our stack

... but we shift the data-processing towards batch, as we are able.

See real example here.
Ex: Overall total emails calculated in each layer

3) Exploring with Reports


3.0) From charts to reports...

• Extract entities from properties we aggregated by in charts (Step 2)

• Each entity gets its own type of web page

• Each unique entity gets its own web page
• Link to entities as they appear in atomic event documents (Step 1)

• Link most related entities together, same and between types.

• More visualizations!

• Parametize results via forms.


3.1) Looks like this...


3.2) Cultivate common keyspaces


3.3) Get people clicking. Learn.
• Explore this web of generated pages, charts and links!
• Everyone on the team gets to know your data.
• Keep trying out different charts, metrics, entities, links.

• See whats interesting.
• Figure out what data needs cleaning and clean it.
• Start thinking about predictions & recommendations.

‘People’ could be just your team, if data is sensitive.


4) Predictions and Recommendations


4.0) Preparation
• We’ve already extracted entities, their properties and relationships

• Our charts show where our signal is rich

• We’ve cleaned our data to make it presentable
• The entire team has an intuitive understanding of the data

• They got that understanding by exploring the data

• We are all on the same page!


4.2) Think in different perspectives
• Networks
• Time Series
• Distributions

• Natural Language
• Probability / Bayes

© Hortonworks Inc. 2012 See here. 74

4.3) Sink more time in deeper analysis
TF-IDF
import 'tfidf.macro';
my_tf_idf_scores = tf_idf(id_body, 'message_id', 'body');

/* Get the top 10 Tf*Idf scores per message */
per_message_cassandra = foreach (group tfidf_all by message_id) {
sorted = order tfidf_all by value desc;
top_10_topics = limit sorted 10;
generate group, top_10_topics.(score, value);
}

Probability / Bayes
sent_replies = join sent_counts by (from, to), reply_counts by (from, to);
reply_ratios = foreach sent_replies generate sent_counts::from as from,
sent_counts::to as to,
(float)reply_counts::total/(float)sent_counts::tot
as ratio;
reply_ratios = foreach reply_ratios generate from, to, (ratio > 1.0 ? 1.0 : ratio) as ratio;

© Hortonworks Inc. 2012 Example with code here and macro here. 75

4.4) Add predictions to reports


5) Enable new actions


Example: Packetpig and PacketLoop
snort_alerts = LOAD '$pcap'
  USING com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconﬁg');

countries = FOREACH snort_alerts
  GENERATE
    com.packetloop.packetpig.udf.geoip.Country(src) as country,
    priority;

countries = GROUP countries BY country;

countries = FOREACH countries
  GENERATE
    group,
    AVG(countries.priority) as average_severity;

STORE countries into 'output/choropleth_countries' using PigStorage(',');

Code here.

Example: Packetpig and PacketLoop


• Amsterdam, March 20, 21st

• Call for papers now open!

• Submit a lightning talk!

• http://hadoopsummit.org/amsterdam/

• Discount coupons - 10% off!


Hortonworks Data Platform
• Simplify deployment to get
started quickly and easily

• Monitor, manage any size
cluster with familiar console
and tools

• Only platform to include data
1 integration services to
interact with any data

• Metadata services opens the
platform for integration with
existing applications

• Dependable high availability
architecture
 Reduce risks and cost of adoption
• Tested at scale to future proof
 Lower the total cost to administer and provision your cluster growth
 Integrate with your existing ecosystem


Hortonworks Training
The expert source for
Apache Hadoop training & certification

Role-based Developer and Administration training
– Coursework built and maintained by the core Apache Hadoop development team.
– The “right” course, with the most extensive and realistic hands-on materials
– Provide an immersive experience into real-world Hadoop scenarios
– Public and Private courses available

Comprehensive Apache Hadoop Certification
– Become a trusted and valuable
Apache Hadoop expert


Next Steps?

1 Download Hortonworks Data Platform
hortonworks.com/download

2 Use the getting started guide
hortonworks.com/get-started

3 Learn more… get support

Hortonworks Support
• Expert role based training • Full lifecycle technical support
• Course for admins, developers across four service levels
and operators • Delivered by Apache Hadoop
• Certification program Experts/Committers
• Custom onsite options • Forward-compatible

hortonworks.com/training hortonworks.com/support


Thank You!
Questions & Answers

Slides: http://slidesha.re/O8kjaF

Follow: @hortonworks and @rjurney
Read: hortonworks.com/blog


Paris HUG - Agile Analytics Applications on Hadoop

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (20)

Similaire à Paris HUG - Agile Analytics Applications on Hadoop

Similaire à Paris HUG - Agile Analytics Applications on Hadoop (20)

Plus de Hortonworks

Plus de Hortonworks (20)

Dernier

Dernier (20)

Paris HUG - Agile Analytics Applications on Hadoop

Notes de l'éditeur