Jive is using Flume to deliver the content of a social web (250M messages/day) to HDFS and HBase. Flume's flexible architecture allows us to stream data to our production data center as well as Amazon's Web Services datacenter. We periodically build and merge Lucene indices with Hadoop jobs and deploy these to Katta to provide near real time search results. This talk will explore our infrastructure and decisions we've made to handle a fast growing set of real time data feeds. We will further explore other uses for Flume throughout Jive including log collection and our distributed event bus.
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ecosystem - Lance Riedel & Brent Halsey - Jive Software
1. Storing and Indexing Social Media Content in the Hadoop Ecosystem Lance Riedel Brent Halsey Jive Software
2. Jive: Social Networking for the Enterprise Engage Employees Engage Customers Engage the Social Web What Matters Apps
3. Jive Social Media Engagement stores social media for monitoring (e.g. brand sentiment), searching, and analysis Jive Social Media Monitoring Overview
16. Flume Overview: The Canonical Use Case Flume Agent tier Collector tier Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Collector Collector Collector server server server server server server server server server server server server HDFS
18. Katta – distributed Lucene Katta Master Index 2 Katta Node Index 1 Index 2 Katta Node Index 1 Index 2 Hadoop HDFS Raw.seq Index 1 Katta Node Index 1
26. Distributed Lucene Indexer Job Map Map Map Map Raw Events Input HDFS Blocks Index 1 Index 2 Index 3 Index 4
27. Distributed Lucene Indexer Job Map Map Map Map Reduce Reduce Raw Events Input HDFS Blocks Shuffle/ Sort Key -> shard number Value -> path to index Shard 1 Shard 2 Index 1 Index 2 Index 3 Index 4
28. 5 Minute Index Deployment Incremental Indexer Job Raw.seq
42. Zoie Flume Sink Jetty Server 0-5 min 1 Search Broker Katta Zoie Sink
43. Zoie Flume Sink Jetty Server 0-5 min 5-10 min 2 1 Search Broker Katta Zoie Sink
44. Zoie Flume Sink Jetty Server 0-5 min 5-10 min 10-15 min 3 2 1 Search Broker Katta Zoie Sink
45. Zoie Flume Sink Jetty Server 0-5 min 5-10 min 10-15 min 4 3 2 1 > 15 min Search Broker Katta Zoie Sink
46. Real-time Search and Indexing Zoie Flume Sink 5 Minute 10 Second Index 2 Hadoop HDFS Job Controller HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search Results
47.
48.
49.
50.
Notes de l'éditeur
Collecting content from twitter, facebook, blogs, and news outlets. Allow our users to search on this content, monitor it, and analyze it.
Screen shot of the app shows a user's list of monitors and content matching those monitors. Users can filter by sentiment and by the content source. They can engage in social conversations through twitter and facebook. And they can create discussions within Jive SBS.
Users can analyze social media trends over time with graph views for sentiment and content sources.
Old system takes data from content sources and throws it on a queue. Queue acts as a buffer to processors that process the content and insert it into a MySQL DB. Some fault tolerance with multiple servers connecting to multiple queues. But required a fair bit of monitoring and manual intervention when problems arise.
Limited because we throw away most of our content. Pushing the limits of MySQL can be painful.
Wanted to store all content (limited window), search it, and analyze it.
Chose HBase for random lookup. HDFS for chronological streaming. Katta for distributing Lucene shards. Hadoop for running map reduce.
Built out prototype of new system using Amazon's EC2 and needed a way to stream data into these servers. Internal / External IP addresses of EC2 made it difficult to connect directly to HDFS and HBase. Flume provided this connectivity along with desirable delivery guarantees.
Additionally, can fan out the data to bring data into EC2 along with our production system.
Additionally, can fan out the data to bring data into EC2 along with our production system.
KATTA For those not familiar with Katta, it is a distributed search engine that has two major responsibilities The first is distributing indexes from HDFS to any number of katta nodes. Katta nodes can run across as many machines as you want, easy to add more, and katta will redistribute indexes if nodes fail Katta has a highly customizable distribution policy – you can round robin, have hot/cold topologies where newer indexes are placed on faster machines As part of the distribution there is also replication of indexes for increased load performance and failover All of this is managed through zookeeper, so it is quite resilient, and does a very good job at keeping indexes where zookeeper says it should The second responsibility of katta is to take a single search request and send the request to every katta node and gather the results
OVERVIEW OF SEARCH – 30 days of twitter, facebook, major news and blogs Next few slides are going to show how we tackled searching a moving window of 30 days of twitter (full firehose), public facebook feed, and Spinn3r (which includes all major news and blog sites) SEARCH IS USED – INVESTIGATE MONITOR CREATION, ADHOC ANALYTICS -search is used to investigate what monitor to create, so searching historical data is of course key -also allows to do ad-hoc analytics over recent history. Show me sentimate, or raw counts for an ad-hoc query over the last 30 days
TRANSITION – OTHER REQUIREMENTS NEED FLEXIBILITY Other requirements of course pop up, so it was good that we chose Flume so that we could add easily add on new functionality One of the key customization areas of Flume are the custom sources sinks and decorators you can supply SOURCES OVERVIEW Sources allow you to create custom hooks into data providers. There is a huge list of sources provided out of the box from tailing files to avro http end points where you can send raw events to flume over http with a flume event avro schema SINK OVERVIEW Sinks allow you to create custom places to put the events.. Again there are a slew of out of the box sinks such as hbase and hdfs DECORATOR OVERVIEW And then there are decorators that you can place pretty much add anywhere in the topology where you are allowed to inspect each event and add meta data, change the contents, or throw them on the floor SOME OF OUR OWN Want to highlight a few customizations we did: (rest on slide)