Architecting the Future of Big Data & Search - Eric Baldeschwieler
1. Architecting the Future of Big Data
and Search
Eric Baldeschwieler, Hortonworks
e14@hortonworks.com, 19 October 2011
2. What I Will Cover
§ Architecting the Future of Big Data and
Search
• Lucene, a technology for managing big data
• Hadoop, a technology built for search
• Could they work together?
§ Topics:
• What is Apache Hadoop?
• History and use Cases
• Current State
• Where Hadoop is Going
• Investigating Apache Hadoop and Lucene
3
4. Apache Hadoop is…
A set of open source projects owned
by the Apache Foundation that
transforms commodity computers
and network into a distributed service
• HDFS – Stores petabytes of data
reliably
• MapReduce – Allows huge
distributed computations
Key Attributes
• Reliable and redundant – Doesn’t slow down or lose data even as
hardware fails
• Simple and flexible APIs – Our rocket scientists use it directly!
• Very powerful – Harnesses huge clusters, supports best of breed analytics
• Batch processing-centric – Hence its great simplicity and speed, not a fit
for all use cases
5
7. MapReduce
§ MapReduce is a distributed computing programming model
§ It works like a Unix pipeline:
• cat input | grep | sort | uniq -c
> output
• Input | Map | Shuffle & Sort | Reduce | Output
§ Strengths:
• Easy to use! Developer just writes a couple of functions
• Moves compute to data
§ Schedules work on HDFS node with data if possible
• Scans through data, reducing seeks
• Automatic reliability and re-execution on failure
8
8
8. HDFS: Scalable, Reliable, Managable
Scale IO, Storage, CPU r Fault Tolerant & Easy management
• Add commodity servers & JBODs r Built in redundancy
• 4K nodes in cluster, 80 r Tolerate disk and node failures
r Automatically manage addition/
removal of nodes
Core Core r One operator per 8K nodes!!
Switch Switch
r Storage server used for computation
Switch Switch Switch
r Move computation to data
r Not a SAN
… r But high-bandwidth network access
to data via Ethernet
…
…
…
r Immutable file system
r Read, Write, sync/flush
r No random writes
9
9. HBase
§ Hadoop ecosystem “NoSQL store”
• Very large tables interoperable with Hadoop
• Inspired by Google’s BigTable
§ Features
• Multidimensional sorted Map
§ Table => Row => Column => Version => Value
• Distributed column-oriented store
• Scale – Sharding etc. done automatically
§ No SQL, CRUD etc.
§ billions of rows X millions of columns
• Uses HDFS for its storage layer
10
11. A Brief History
, early adopters 2006 – present
Scale and productize Hadoop
Apache
Hadoop
Other Internet Companies 2008 – present
Add tools / frameworks, enhance
Hadoop
…
Service Providers 2010 – present
Provide training, support, hosting
Cloudera, MapR
Microsoft
IBM, EMC, Oracle
…
Wide Enterprise Adoption Nascent / 2011
Funds further development, enhancements
12
12. Early Adopters & Uses
data
analyzing web logs analytics
advertising optimization machine learning
text mining web search mail anti-spam
content optimization
customer trend analysis
ad selection
video & audio processing
data mining
user interest prediction
social media
13
21. Adoption Drivers
§ Business drivers
• ROI and business advantage from mastering big data
• High-value projects that require use of more data Gartner predicts
800% data growth
• Opportunity to interact with customers at point of over next 5 years
procurement
§ Financial drivers
• Growing cost of data systems as percentage of IT
spend
• Cost advantage of commodity hardware + open source
§ Technical drivers
80-90% of data
• Existing solutions not well suited for volume, variety produced today
and velocity of big data is unstructured
• Proliferation of unstructured data
22
22. Key Success Factors
§ Opportunity
• Apache Hadoop has the potential to become a center of the
next generation enterprise data platform
• My prediction is that 50% of the world’s data will be stored in
Hadoop within 5 years
§ In order to achieve this opportunity, there is work to do:
• Make Hadoop easier to install, use and manage
• Make Hadoop more robust (performance, reliability,
availability, etc.)
• Make Hadoop easier to integrate and extend to enable a
vibrant ecosystem
• Overcome current knowledge gaps
§ Hortonworks mission is to enable Apache Hadoop to
become de facto platform and unified distribution for big data
23
23. Our Roadmap
Phase 1 – Making Apache Hadoop Accessible 2011
• Release the most stable version of Hadoop ever
• Hadoop 0.20.205
• Release directly usable code from Apache
• RPMs & .debs…
• Improve project integration
• HBase support
Phase 2 – Next-Generation Apache Hadoop 2012
• Address key product gaps (HA, Management…) (Alphas in Q4
2011)
• Ambari
• Enable ecosystem innovation via open APIs
• HCatalog, WebHDFS, HBase
• Enable community innovation via modular architecture
• Next Generation MapReduce, HDFS Federation
24
25. Developer Questions
§ We know we want to integrate Lucene into Hadoop
• How is this best done?
§ Log & merge problems (search indexes & HBase)
• Are there opportunities for Solr and HBase to share?
• Knowledge? Lessons learned? Code?
§ Hadoop is moving closer to online
• Lower latency and fast batch
§ Outsource more indexing work to Hadoop?
• HBase maturing
§ Better crawlers, document processing and serving?
26
26. Business Questions
§ Users of Hadoop are natural users of Lucene
• How can we help them search all that data?
§ Are users of Solr natural users of Hadoop?
• How can we improve search with Hadoop?
• How many of you use both?
§ What are the opportunities?
• Integration points? New projects? Training?
• Win-Win if communities help each other
27