1. Making sense of streaming Big Data Flume – HBase Real-Time Big Data Analytics with SLA -Dani Abel Rayan
2. Who am I ? Interned with Cloudera. Flume Contributor. HBaseUser. Work with KarstenSchwan @ GaTech Joining as “Big Data Engineer” in a lead role to manage exponential growing data for makers of League of Legends (Multiplayer Online Battle Arena)- recently received 400M
3. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
4. Why near Real-Time? Activity stream data is a normal part of any website for reporting on usage of the site. Activity data is things like page views, information about what content was shown, searches, etc. This kind of thing is usually handled by logging the activity out to some kind of file and then periodically aggregating these files for analysis. In recent years, however, activity data has become a critical part of the production features of websites, and a slightly more sophisticated set of infrastructure is needed.
5. The Big Picture Why you want to build this ? - Customer retargeting
6. The Big Picture Content serving by measuring current audience interests. Product Patterns – Twitter Streams S4 is being used for applications such as personalization, user feedback, malicious traffic detection, and real-time search Location based streams – find out people matching specific threshold almost near real-time - RealTime Shopping/Restaurants discount So many possibilities!
7. Million Impressions in a sec … 100 nodes (soon to be 500) in CERCS Each one can generate 10,000 impressions in one second. Specific products are given “a known” impression rates and others are pseudo random The challenging task is to ensure, that we can bucket the Product impression in proper ColFamilies within a SLA of few seconds.
8. Whats the storage ? HBase Currently used for real-time analytics in companies like Facebook (also FB messaging), Yahoo!, Twitter The high-throughput stream of immutable activity data represents a real computational challenge as the volume may easily be 10x or 100x larger than the next largest data source on a site. Do we really need to store everything ? Nope! HBase has TTL for ColFamilies Wait! What are “Column Families” ?
13. HBase Architecture Why NoSQL ? Hbase ? Horizontal scalability Commodity Hardware Hadoop based Only one index possibile– RowKey Very high write load -> 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds. (at Facebook)
14. Flume Flume is a distributed, reliable, and available service for efficiently moving large amounts of data soon after the data is produced. The primary use case for Flume is as a logging system that gathers a set of log files on every machine in a cluster and aggregates them to a centralized persistent store such as the Hadoop Distributed File System (HDFS). The system was designed with these four key goals in mind: Reliability Scalability Manageability Extensibility
15. Where it is used ? Mozilla Shopzilla AOL Simple GEO Path ….
17. Flume Data Model Flume internally converts every external source of data into a stream of events. Events are Flume’s unit of data and are a simple and flexible representation. An event is composed of a body and metadata. The event body is a string of bytes representing the content of an event. For example, a line in a log file is represented as an event whose body was the actual byte representation of that line. The event metadata is a table of key / value pairs that capture some detail about the event, such as the time it was created or the name of the machine on which it originated. This table can be appended as an event travels along a Flume flow, and the table can be read to control the operation of individual components of that flow. For example, the machine name attached to an event can be used to control the output path where the event is written at the end of the flow. An event’s body can be up to 32KB long - although this limit can be controlled via a system property, it is recommended that it is not changed in order to preserve performance.
18. Flume – HBase connector This is challenging since we need to interface single dimensional key-value pairs into multi-dimensional key-value pair Many possible approaches: 1. "usage: hbase(quot;tablequot;, quot;rowkeyquot;, " + "quot;cf1quot;," + " quot;c1quot;, quot;val1quot;[,quot;cf2quot;, quot;c2quot;, quot;val2quot;, ....]{, " + KW_BUFFER_SIZE + "=int, " + KW_USE_WAL + "=true|false})"; 2. usage: attr2hbase(quot;tablequot; [,quot;sysFamilyquot;[, quot;writeBodyquot;[,quot;attrPrefixquot;[,quot;writeBufferSizequot;[,quot;writeToWalquot;]]]]])"; https://issues.cloudera.org/browse/FLUME-6
19. The Story so far … for a demo Deployed Flume Agent Nodes in ~ 100 machines Agents monitor a “specific” log directory in all the machines. Any logfile matching the *.hb” suffix will be continuously tailed. Deployed Flume Collector Nodes in 5 machines One HBase – psuedo distributed
20.
21. Demo Single Event - Choose a Machine 7,9,10,11,12 - Choose a VM 2,3, ….20 1000’s of Event - Lets do it again on multiple machines Failover chain Demo Inactive Demo
22. Evaluation Flume is an awesome product but it isn’t perfect and there are instances wherein certain flows though they appear “ACTIVE” aren’t spewing out any data and sometimes are inactive or work slower because of long queues. No SLA guarantees So are the other products like S4 from Yahoo! And Kafka from LinkedIN
24. SLA Monalytics - combined monitoring and analysis systems used for managing large-scale data center systems. Due to the scale and complexity on commodity software/hardware in data centers, the performance problems in the streaming MapReduce system are inevitable. Solving those problems is extremely hard and much more stringent, if you need to meet SLA: “We will bucket the impressions in less than 2sec, no matter what”
25. So many moving parts Still need to be Adaptable Needs to be configurable with new stores End to End Monitoring of SLA Reliability Garbage Collection Pauses Good use case for Monalytics New technologies are really cool to implement, but once you past the initial honeymoon phase, complexities start surfacing up and one either enters the “cul-de-sac” mode to fall-back on the more traditional methods.
27. Similar systems to Flume S4 from Yahoo! Kafka from LinkedIN FlumeBase – an extension to support SQL like constructs operating on Data Streams – another startup