High Speed Smart Data Ingest into Hadoop1. High Speed Smart Data Ingest
into Hadoop
Oleg Zhurakousky
Principal Architect @ Hortonworks
Twitter: z_oleg
© Hortonworks Inc. 2012
2. Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/real-time-hadoop
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
3. Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
4. Agenda
• Domain Problem
– Event Sourcing
– Streaming Sources
– Why Hadoop?
• Architecture
– Compute Mechanics
– Why things are slow – understand the problem
– Mechanics of the ingest
– Write >= Read | Else ;(
– Achieving Mechanical Sympathy –
– Suitcase pattern – optimize Storage, I/O and CPU
© Hortonworks Inc. 2012
6. Domain Events
• Domain Events represent the state of entities at a
given time when an important event occurred
• Every state change of interest to the business is
materialized in a Domain Event
Retail & Web
Point of Sale
Web Logs
Energy
Metering
Diagnostics
Telco
Network Data
Device Interactions
Financial
Services
Trades
Market Data
© Hortonworks Inc. 2012
Healthcare
Clinical Trials
Claims
Page 4
7. Event Sourcing
• All Events are sent to an Event Processor
• The Event Processor stores all events in an Event Log
• Many different Event Listeners can be added to the
Event Processor (or listen directly on the Event Log)
Web
Messaging
Batch
© Hortonworks Inc. 2012
Search
Event
Processor
Analytics
Apps
Page 5
8. Streaming Sources
• When dealing with Event data (log messages, call
detail records, other event / message type data) there
are some important observations:
– Source data is streamed (not a static copy), thus always in motion
– More then one streaming source
– Streaming (by definition) never stops
– Events are rather small in size (up to several hundred bytes), but
there is a lot of them
Hadoop was not design with that in mind
Telco Use Case
E
E
Geos
© Hortonworks Inc. 2012
E
– Tailing the syslog of a network device
– 200K events per second per streaming source
– Bursts that can double or triple event production
9. Why Hadoop for Event Sourcing?
• Event data is immutable, an event log is append-only
• Event Sourcing is not just event storage, it is event
processing (HDFS + MapReduce)
• While event data is generally small in size, event
volumes are HUGE across the enterprise
• For an Event Sourcing Solution to deliver its benefits
to the enterprise, it must be able to retain all history
• An Event Sourcing solution is ecosystem dependent –
it must be easily integrated into the Enterprise
technology portfolio (management, data, reporting)
© Hortonworks Inc. 2012
Page 7
11. Mechanics of Data Ingest & Why and which
things are slow and/or inefficient ?
• Demo
© Hortonworks Inc. 2012
Page 9
12. Mechanics of a compute
Atos, Portos, Aramis and the D’Artanian – of the compute
© Hortonworks Inc. 2012
13. Why things are slow?
• I/O | Network bound
• Inefficient use of storage
• CPU is not helping
• Memory is not helping
Imbalanced resource usage leads to limited growth potentials
and misleading assumptions about the capability of yoru
hardware
© Hortonworks Inc. 2012
Page 11
14. Mechanics of the ingest
• Multiple sources of data (readers)
• Write (Ingest) >= Read (Source) | Else
• “Else” is never good
– Slows the everything down or
– Fills accumulation buffer to the capacity which slows everything
down
– Renders parts or whole system unavailable which slows or brings
everything down
– Etc…
“Else” is just not good!
© Hortonworks Inc. 2012
Page 12
15. Strive for Mechanical Sympathy
• Mechanical Sympathy – ability to write software with
deep understanding of its impact or lack of impact on
the hardware its running on
• http://mechanical-sympathy.blogspot.com/
• Ensure that all available resources (hardware and
software) work in balance to help achieve the end
goal.
© Hortonworks Inc. 2012
Page 13
17. The Suitcase Pattern
• Storing event data in uncompressed form not feasible
– Event Data just keeps on coming, and storage is finite
– The more history we can keep in Hadoop the better our trend
analytics will be
– Not just a Storage problem, but also Processing
– I/O is #1 impact on both Ingest & MapReduce performance
• Suitcase Pattern
– Before we travel, we take our clothes off the
rack and pack them (easier to store)
– We then unpack them when we arrive and put
them back on the rack (easier to process)
– Consider event data “traveling” over the network to Hadoop
– we want to compress it before it makes the trip, but in a way that
facilitates how we intend to process it once it arrives
© Hortonworks Inc. 2012
18. What’s next?
• How much data is available or could be available to us
during the ingest is lost after the ingest?
• How valuable is the data available to us during the
ingest after the ingest completed?
• How expensive would it be to retain (not loose) the
data available during the ingest?
© Hortonworks Inc. 2012
Page 16
19. Data available during the ingest?
• Record count
• Highest/Lowest record length
• Average record length
• Compression ratio
But with a little more work. . .
• Column parsing
– Unique values
– Unique values per column
– Access to values of each column independently from the record
– Relatively fast column-based searches, without indexing
– Value encoding
– Etc…
Imagine the implications on later data access. . .
© Hortonworks Inc. 2012
Page 17
20. How expensive is to do that extra work?
• Demo
© Hortonworks Inc. 2012
Page 18
21. Conclusion - 1
• Hadoop wasn’t optimized to process <1 kB records
– Mappers are event-driven consumers: they do not have a running
thread until a callback thread delivers a message
– Message delivery is an event that triggers the Mapper into action
• “Don’t wake me up unless it is important”
– In Hadoop, generally speaking,
several thousand bytes to
several hundred thousand bytes
is deemed important
– Buffering records during collection allows us to collect more data
before waking up the elephant
– Buffering records during collection also allows us to compress the
whole block of records as a single record to be sent over the
network to Hadoop – resulting in lower network and file I/O
– Buffering records during collection also helps us handle bursts
© Hortonworks Inc. 2012
22. Conclusion - 2
• Understand your SLAs, but think ahead
– While you may have accomplished your SLA for the ingest your
overall system may still be in the unbalanced state
– Don’t settle on “throw more hardware” until you can validate that
you have approached the limits of your existing hardware
– Strive for mechanical sympathy. Profile your software to
understand its impact on the hardware its running on.
– Don’t think about the ingest in isolation. Think how the ingested
data is going to be used.
– Analyze, the information that is available to you during the ingest
and devise a convenient mechanism for storing it. Your data
analysts will thank you for that.
© Hortonworks Inc. 2012
Page 20
24. Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/realtime-hadoop