Contenu connexe Similaire à Paris HUG - Agile Analytics Applications on Hadoop (20) Paris HUG - Agile Analytics Applications on Hadoop1. Agile Analytics Applications
Russell Jurney (@rjurney) - Hadoop Evangelist @Hortonworks
Formerly Viz, Data Science at Ning, LinkedIn
HBase Dashboards, Career Explorer, InMaps
© Hortonworks Inc. 2012
1
2. Agile Data - The Book (March, 2013)
Read it now on OFPS
A philosophy,
not the only way
But still, its good! Really!
© Hortonworks Inc. 2012 2
3. We go fast... but don’t worry!
• Examples for EVERYTHING on the Hortonworks blog:
http://hortonworks.com/blog/authors/russell_jurney
• Download the slides - click the links - read examples!
• If its not on the blog, its in the book!
• Order now: http://shop.oreilly.com/product/0636920025054.do
• Read the book NOW on OFPS:
• http://ofps.oreilly.com/titles/9781449326265/chapter_2.html
© Hortonworks Inc. 2012 3
4. Agile Application Development: Check
• LAMP stack mature
• Post-Rails frameworks to choose from
• Enable rapid feedback and agility
+ NoSQL
© Hortonworks Inc. 2012 4
6. Scientific Computing / HPC
• ‘Smart kid’ only: MPI, Globus, etc. until Hadoop
Tubes and Mercury (old school) Cores and Spindles (new school)
UNIVAC and Deep Blue both fill a warehouse. We’re back...
© Hortonworks Inc. 2012 6
8. Data Center as Computer
• Warehouse Scale Computers and applications
“A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient manner.”
Click here for a paper on operating a ‘data center as computer.’
© Hortonworks Inc. 2012 8
9. Hadoop to the Rescue!
Big data refinery / Modernize ETL
Audio, Web, Mobile, CRM,
Video, ERP, SCM, …
Images
New Data Business
Transactions
Docs, Sources
Text, & Interactions
XML
HDFS
Web
Logs,
Clicks
Big Data
Social, Refinery SQL NoSQL NewSQL
Graph,
Feeds
ETL
EDW MPP NewSQL
Sensors,
Devices,
RFID
Business
Spatial,
GPS Apache Hadoop
Intelligence
& Analytics
Events,
Other Dashboards, Reports,
Visualization, …
Page 7
I stole this slide from Eric. Update: He stole it from someone else.
© Hortonworks Inc. 2012 9
10. Hadoop to the Rescue!
• Easy to use! (Pig, Hive, Cascading)
• CHEAP: 1% the cost of SAN/NAS
• A department can afford its own Hadoop cluster!
• Dump all your data in one place: Hadoop DFS
• Silos come CRASHING DOWN!
• JOIN like crazy!
• ETL like whoah!
• An army of mappers and reducers at your command
• OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME!
© Hortonworks Inc. 2012 10
12. Analytics Apps: It takes a Team
• Broad skill-set to make useful apps
• Basically nobody has them all
• Application development is inherently collaborative
© Hortonworks Inc. 2012 12
13. Data Science Team
• 3-4 team members with broad, diverse skill-sets that overlap
• Transactional overhead dominates at 5+ people
• Expert researchers: lend 25-50% of their time to teams
• Pick relevant researchers. Leave them alone. They’ll spawn
new products by accident. Not just CS/Math. Design. Art?
• Creative workers. Run like a studio, not an assembly line
• Total freedom... with goals and deliverables.
• Work environment matters most: private, social & quiet space
• Desks/cubes optional
© Hortonworks Inc. 2012 13
14. How to get insight into product?
• Back-end has gotten t-h-i-c-k-e-r
• Generating $$$ insight can take 10-100x app dev
• Timeline disjoint: analytics vs agile app-dev/design
• How do you ship insights efficiently?
• How do you collaborate on research vs developer timeline?
© Hortonworks Inc. 2012 14
15. The Wrong Way - Part One
“We made a great design. Your job is to predict the future for it.”
© Hortonworks Inc. 2012 15
16. The Wrong Way - Part Two
“Whats taking you so long to reliably predict the future?”
© Hortonworks Inc. 2012 16
17. The Wrong Way - Part Three
“The users don’t understand what 86% true means.”
© Hortonworks Inc. 2012 17
18. The Wrong Way - Part Four
GHJIAEHGIEhjagigehganbanbigaebjnain!!!!!RJ(@J?!!
© Hortonworks Inc. 2012 18
19. The Wrong Way - Inevitable Conclusion
Plane Mountain
© Hortonworks Inc. 2012 19
21. Chief Problem
You can’t design insight in analytics applications.
You discover it.
You discover by exploring.
© Hortonworks Inc. 2012 21
22. -> Strategy
So make an app for exploring your data.
Iterate and publish intermediate results.
Which becomes a palette for what you ship.
© Hortonworks Inc. 2012 22
23. Data Design
• Not the 1st query that = insight, its the 15th, or the 150th
• Capturing “Ah ha!” moments
• Slow to do those in batch...
• Faster, better context in an interactive web application.
• Pre-designed charts wind up terrible. So bad.
• Easy to invest man-years in the wrong statistical models
• Semantics of presenting predictions are complex, delicate
• Opportunity lies at intersection of data & design
© Hortonworks Inc. 2012 23
24. How do we get back to Agile?
© Hortonworks Inc. 2012 24
26. Setup an environment where...
• Insights repeatedly produced
• Iterative work shared with entire team
• Interactive from day 0
• Data model is consistent end-to-end
• Minimal impedance between layers
• Scope and depth of insights grow
• Insights form the palette for what you ship
• Until the application pays for itself and more
© Hortonworks Inc. 2012 26
27. Value document > relation
Most data is dirty. Most data is semi-structured or un-structured. Rejoice!
© Hortonworks Inc. 2012 27
28. Value document > relation
Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction.
© Hortonworks Inc. 2012 28
29. Relational Data = Legacy?
• Why JOIN? Storage is fundamentally cheap!
• Duplicate that JOIN data in one big record type!
• ETL once to document format on import, NOT every job
• Not zero JOINs, but far fewer JOINs
• Semi-structured documents preserve data’s actual structure
• Column compressed document formats beat JOINs! (paper
coming)
© Hortonworks Inc. 2012 29
30. Value imperative > declarative
• We don’t know what we want to SELECT.
• Data is dirty - check each step, clean iteratively.
• 85% of data scientist’s time spent munging. See: ETL.
• Imperative is optimized for our process.
• Process = iterative, snowballing insight
• Efficiency matters, self optimize
© Hortonworks Inc. 2012 30
32. Ex. dataflow: ETL + email sent count
© Hortonworks Inc. 2012 (I can’t read this either. Get a big version here.) 32
33. Value Pig > Hive (for app-dev)
• Pigs eat ANYTHING
• Pig is optimized for refining data, as opposed to consuming it
• Pig is imperative, iterative
• Pig is dataflows, and SQLish (but not SQL)
• Code modularization/re-use: Pig Macros
• ILLUSTRATE speeds dev time (even UDFs)
• Easy UDFs in Java, JRuby, Jython, Javascript
• Pig Streaming = use any tool, period.
• Easily prepare our data as it will appear in our app.
• If you prefer Hive, use Hive.
But actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive...
See: HCatalog for Pig/Hive integration, and this post.
© Hortonworks Inc. 2012 33
34. Localhost vs Petabyte scale: same tools
• Simplicity essential to scalability: highest level tools we can
• Prepare a good sample - tricky with joins, easy with
documents
• Local mode: pig -l /tmp -x local -v -w
• Frequent use of ILLUSTRATE
• 1st: Iterate, debug & publish locally
• 2nd: Run on cluster, publish to team/customer
• Consider skipping Object-Relational-Mapping (ORM)
• We do not trust ‘databases,’ only HDFS @ n=3.
• Everything we serve in our app is re-creatable via Hadoop.
© Hortonworks Inc. 2012 34
38. 0.1) Documents via Relation ETL
enron_messages = load '/enron/enron_messages.tsv' as (
message_id:chararray,
sql_date:chararray,
from_address:chararray,
from_name:chararray,
subject:chararray,
body:chararray
);
enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray);
split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';
headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10;
with_headers = join headers by group, enron_messages by message_id parallel 10;
emails = foreach with_headers generate enron_messages::message_id as message_id,
CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date,
TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray),
enron_messages::subject as subject,
enron_messages::body as body,
headers::tos.(address, name) as tos,
headers::ccs.(address, name) as ccs,
headers::bccs.(address, name) as bccs;
store emails into '/enron/emails.avro' using AvroStorage(
Example here.
© Hortonworks Inc. 2012 38
39. 0.2) Serialize events from streams
class GmailSlurper(object):
...
def init_imap(self, username, password):
self.username = username
self.password = password
try:
imap.shutdown()
except:
pass
self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993)
self.imap.login(username, password)
self.imap.is_readonly = True
...
def write(self, record):
self.avro_writer.append(record)
...
def slurp(self):
if(self.imap and self.imap_folder):
for email_id in self.id_list:
(status, email_hash, charset) = self.fetch_email(email_id)
if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash):
print email_id, charset, email_hash['thread_id']
self.write(email_hash)
© Hortonworks Inc. 2012 Scrape your own gmail in Python and Ruby. 39
40. 0.3) ETL Logs
log_data = LOAD 'access_log'
USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader
AS (remoteAddr,
remoteLogname,
user,
time,
method,
uri,
proto,
bytes);
© Hortonworks Inc. 2012 40
41. 1) Plumb atomic events -> browser
(Example stack that enables high productivity)
© Hortonworks Inc. 2012 41
42. Lots of Stack Options with Examples
• Pig with Voldemort, Ruby, Sinatra: example
• Pig with ElasticSearch: example
• Pig with MongoDB, Node.js: example
• Pig with Cassandra, Python Streaming, Flask: example
• Pig with HBase, JRuby, Sinatra: example
• Pig with Hive via HCatalog: example (trivial on HDP)
• Up next: Accumulo, Redis, MySQL, etc.
© Hortonworks Inc. 2012 42
43. 1.1) cat our Avro serialized events
me$ cat_avro ~/Data/enron.avro
{
u'bccs': [],
u'body': u'scamming people, blah blah',
u'ccs': [],
u'date': u'2000-08-28T01:50:00.000Z',
u'from': {u'address': u'bob.dobbs@enron.com', u'name': None},
u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>',
u'subject': u'Re: Enron trade for frop futures',
u'tos': [
{u'address': u'connie@enron.com', u'name': None}
]
}
© Hortonworks Inc. 2012 Get cat_avro in python, ruby 43
44. 1.2) Load our events in Pig
me$ pig -l /tmp -x local -v -w
grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage();
grunt> describe enron_emails
emails: {
message_id: chararray,
datetime: chararray,
from:tuple(address:chararray,name:chararray)
subject: chararray,
body: chararray,
tos: {to: (address: chararray,name: chararray)},
ccs: {cc: (address: chararray,name: chararray)},
bccs: {bcc: (address: chararray,name: chararray)}
}
© Hortonworks Inc. 2012 44
45. 1.3) ILLUSTRATE our events in Pig
grunt> illustrate enron_emails
---------------------------------------------------------------------------
| emails |
| message_id:chararray |
| datetime:chararray |
| from:tuple(address:chararray,name:chararray) |
| subject:chararray |
| body:chararray |
| tos:bag{to:tuple(address:chararray,name:chararray)} |
| ccs:bag{cc:tuple(address:chararray,name:chararray)} |
| bccs:bag{bcc:tuple(address:chararray,name:chararray)} |
---------------------------------------------------------------------------
| |
| <1731.10095812390082.JavaMail.evans@thyme> |
| 2001-01-09T06:38:00.000Z |
| (bob.dobbs@enron.com, J.R. Bob Dobbs) |
| Re: Enron trade for frop futures |
| scamming people, blah blah |
| {(connie@enron.com,)} |
| {} |
| {} |
Upgrade to Pig 0.10+
© Hortonworks Inc. 2012 45
46. 1.4) Publish our events to a ‘database’
From Avro to MongoDB in one command:
pig -l /tmp -x local -v -w -param avros=enron.avro
-param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig
Which does this:
/* MongoDB libraries and configuration */
register /me/mongo-hadoop/mongo-2.7.3.jar
register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar
/* Set speculative execution off to avoid chance of duplicate records in Mongo */
set mapred.map.tasks.speculative.execution false
set mapred.reduce.tasks.speculative.execution false
define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */
/* By default, lets have 5 reducers */
set default_parallel 5
avros = load '$avros' using AvroStorage();
store avros into '$mongourl' using MongoStorage();
© Hortonworks Inc. 2012 Full instructions here. 46
47. 1.5) Check events in our ‘database’
$ mongo enron
MongoDB shell version: 2.0.2
connecting to: enron
> show collections
emails
system.indexes
> db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"})
{
" "_id" : ObjectId("502b4ae703643a6a49c8d180"),
" "message_id" : "<1731.10095812390082.JavaMail.evans@thyme>",
" "date" : "2001-01-09T06:38:00.000Z",
" "from" : { "address" : "bob.dobbs@enron.com", "name" : "J.R. Bob Dobbs" },
" "subject" : Re: Enron trade for frop futures,
" "body" : "Scamming more people...",
" "tos" : [ { "address" : "connie@enron", "name" : null } ],
" "ccs" : [ ],
" "bccs" : [ ]
}
© Hortonworks Inc. 2012 47
48. 1.6) Publish events on the web
require 'rubygems'
require 'sinatra'
require 'mongo'
require 'json'
connection = Mongo::Connection.new
database = connection['agile_data']
collection = database['emails']
get '/email/:message_id' do |message_id|
data = collection.find_one({:message_id => message_id})
JSON.generate(data)
end
© Hortonworks Inc. 2012 48
50. Whats the point?
• A designer can work against real data.
• An application developer can work against real data.
• A product manager can think in terms of real data.
• Entire team is grounded in reality!
• You’ll see how ugly your data really is.
• You’ll see how much work you have yet to do.
• Ship early and often!
• Feels agile, don’t it? Keep it up!
© Hortonworks Inc. 2012 50
51. 1.7) Wrap events with Bootstrap
<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">
</head>
<body>
<div class="container" style="margin-top: 100px;">
<table class="table table-striped table-bordered table-condensed">
<thead>
{% for key in data['keys'] %}
<th>{{ key }}</th>
{% endfor %}
</thead>
<tbody>
<tr>
{% for value in data['values'] %}
<td>{{ value }}</td>
{% endfor %}
</tr>
</tbody>
</table>
</div>
</body> Complete example here with code here.
© Hortonworks Inc. 2012 51
53. Refine. Add links between documents.
Not the Mona Lisa, but coming along... See: here
© Hortonworks Inc. 2012 53
54. 1.8) List links to sorted events
Use Pig, serve/cache a bag/array of email documents:
pig -l /tmp -x local -v -w
emails_per_user = foreach (group emails by from.address) {
sorted = order emails by date;
last_1000 = limit sorted 1000;
generate group as from_address, emails as emails;
};
store emails_per_user into '$mongourl' using MongoStorage();
Use your ‘database’, if it can sort.
mongo enron
> db.emails.ensureIndex({message_id: 1})
> db.emails.find().sort({date:0}).limit(10).pretty()
{
{
" "_id" : ObjectId("4f7a5da2414e4dd0645d1176"),
" "message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>",
" "from" : [
...
© Hortonworks Inc. 2012 54
56. 1.9) Make it searchable...
If you have list, search is easy with ElasticSearch and Wonderdog...
/* Load ElasticSearch integration */
register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar';
register '/me/elasticsearch-0.18.6/lib/*';
define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();
emails = load '/me/tmp/emails' using AvroStorage();
store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/
elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins');
Test it with curl:
curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1'
ElasticSearch has no security features. Take note. Isolate.
© Hortonworks Inc. 2012 56
57. From now on we speed up...
Don’t worry, its in the book and on the blog.
http://hortonworks.com/blog/
© Hortonworks Inc. 2012 57
60. 2) Create Simple Charts
• Start with an HTML table on general principle.
• Then use nvd3.js - reusable charts for d3.js
• Aggregate by properties & displaying is first step in entity
resolution
• Start extracting entities. Ex: people, places, topics, time series
• Group documents by entities, rank and count.
• Publish top N, time series, etc.
• Fill a page with charts.
• Add a chart to your event page.
© Hortonworks Inc. 2012 60
61. 2.1) Top N (of anything) in Pig
pig -l /tmp -x local -v -w
top_things = foreach (group things by key) {
sorted = order things by arbitrary_rank desc;
top_10_things = limit sorted 10;
generate group as key, top_10_things as top_10_things;
};
store top_n into '$mongourl' using MongoStorage();
Remember, this is the same structure the browser gets as json.
This would make a good Pig Macro.
© Hortonworks Inc. 2012 61
62. 2.2) Time Series (of anything) in Pig
pig -l /tmp -x local -v -w
/* Group by our key and date rounded to the month, get a total */
things_by_month = foreach (group things by (key, ISOToMonth(datetime))
generate flatten(group) as (key, month),
COUNT_STAR(things) as total;
/* Sort our totals per key by month to get a time series */
things_timeseries = foreach (group things_by_month by key) {
timeseries = order things by month;
generate group as key, timeseries as timeseries;
};
store things_timeseries into '$mongourl' using MongoStorage();
Yet another good Pig Macro.
© Hortonworks Inc. 2012 62
63. Data processing in our stack
A new feature in our application might begin at any layer... great!
omghi2u!
I’m creative! I’m creative too! where r my legs?
I know Pig! I <3 Javascript!
send halp
Any team member can add new features, no problemo!
© Hortonworks Inc. 2012 63
64. Data processing in our stack
... but we shift the data-processing towards batch, as we are able.
See real example here.
Ex: Overall total emails calculated in each layer
© Hortonworks Inc. 2012 64
67. 3.0) From charts to reports...
• Extract entities from properties we aggregated by in charts (Step 2)
• Each entity gets its own type of web page
• Each unique entity gets its own web page
• Link to entities as they appear in atomic event documents (Step 1)
• Link most related entities together, same and between types.
• More visualizations!
• Parametize results via forms.
© Hortonworks Inc. 2012 67
70. 3.3) Get people clicking. Learn.
• Explore this web of generated pages, charts and links!
• Everyone on the team gets to know your data.
• Keep trying out different charts, metrics, entities, links.
• See whats interesting.
• Figure out what data needs cleaning and clean it.
• Start thinking about predictions & recommendations.
‘People’ could be just your team, if data is sensitive.
© Hortonworks Inc. 2012 70
72. 4.0) Preparation
• We’ve already extracted entities, their properties and relationships
• Our charts show where our signal is rich
• We’ve cleaned our data to make it presentable
• The entire team has an intuitive understanding of the data
• They got that understanding by exploring the data
• We are all on the same page!
© Hortonworks Inc. 2012 72
74. 4.2) Think in different perspectives
• Networks
• Time Series
• Distributions
• Natural Language
• Probability / Bayes
© Hortonworks Inc. 2012 See here. 74
75. 4.3) Sink more time in deeper analysis
TF-IDF
import 'tfidf.macro';
my_tf_idf_scores = tf_idf(id_body, 'message_id', 'body');
/* Get the top 10 Tf*Idf scores per message */
per_message_cassandra = foreach (group tfidf_all by message_id) {
sorted = order tfidf_all by value desc;
top_10_topics = limit sorted 10;
generate group, top_10_topics.(score, value);
}
Probability / Bayes
sent_replies = join sent_counts by (from, to), reply_counts by (from, to);
reply_ratios = foreach sent_replies generate sent_counts::from as from,
sent_counts::to as to,
(float)reply_counts::total/(float)sent_counts::tot
as ratio;
reply_ratios = foreach reply_ratios generate from, to, (ratio > 1.0 ? 1.0 : ratio) as ratio;
© Hortonworks Inc. 2012 Example with code here and macro here. 75
78. Example: Packetpig and PacketLoop
snort_alerts = LOAD '$pcap'
USING com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig');
countries = FOREACH snort_alerts
GENERATE
com.packetloop.packetpig.udf.geoip.Country(src) as country,
priority;
countries = GROUP countries BY country;
countries = FOREACH countries
GENERATE
group,
AVG(countries.priority) as average_severity;
STORE countries into 'output/choropleth_countries' using PigStorage(',');
Code here.
© Hortonworks Inc. 2012 78
80. • Amsterdam, March 20, 21st
• Call for papers now open!
• Submit a lightning talk!
• http://hadoopsummit.org/amsterdam/
• Discount coupons - 10% off!
© Hortonworks Inc. 2012 80
81. Hortonworks Data Platform
• Simplify deployment to get
started quickly and easily
• Monitor, manage any size
cluster with familiar console
and tools
• Only platform to include data
1 integration services to
interact with any data
• Metadata services opens the
platform for integration with
existing applications
• Dependable high availability
architecture
Reduce risks and cost of adoption
• Tested at scale to future proof
Lower the total cost to administer and provision your cluster growth
Integrate with your existing ecosystem
© Hortonworks Inc. 2012 81
82. Hortonworks Training
The expert source for
Apache Hadoop training & certification
Role-based Developer and Administration training
– Coursework built and maintained by the core Apache Hadoop development team.
– The “right” course, with the most extensive and realistic hands-on materials
– Provide an immersive experience into real-world Hadoop scenarios
– Public and Private courses available
Comprehensive Apache Hadoop Certification
– Become a trusted and valuable
Apache Hadoop expert
© Hortonworks Inc. 2012 82
83. Next Steps?
1 Download Hortonworks Data Platform
hortonworks.com/download
2 Use the getting started guide
hortonworks.com/get-started
3 Learn more… get support
Hortonworks Support
• Expert role based training • Full lifecycle technical support
• Course for admins, developers across four service levels
and operators • Delivered by Apache Hadoop
• Certification program Experts/Committers
• Custom onsite options • Forward-compatible
hortonworks.com/training hortonworks.com/support
© Hortonworks Inc. 2012 83
84. Thank You!
Questions & Answers
Slides: http://slidesha.re/O8kjaF
Follow: @hortonworks and @rjurney
Read: hortonworks.com/blog
© Hortonworks Inc. 2012 84
Notes de l'éditeur \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n Hortonworks Data Platform (HDP) is the only 100% open source Apache Hadoop distribution that provides a complete and reliable foundation for enterprises that want to build, deploy and manage big data solutions. It allows you to confidently capture, process and share data in any format, at scale on commodity hardware and/or in a cloud environment. \n\nAs the foundation for the next generation enterprise data architecture, HDP delivers all of the necessary components to uncover business insights from the growing streams of data flowing into and throughout your business. HDP is a fully integrated data platform that includes the stable core functions of Apache Hadoop (HDFS and MapReduce), the baseline tools to process big data (Apache Hive, Apache HBase, Apache Pig) as well as a set of advanced capabilities (Apache Ambari, Apache HCatalog and High Availability) that make big data operational and ready for the enterprise. &#xA0;\n\nRun through the points on left&#x2026;\n \n \n \n