The world is going real-time. MapReduce, SQL-on-Hadoop and similar batch processing tools are fine for analyzing and processing data after the fact — but sometimes you need to process data continuously as it comes in, and react to it within a few seconds or less. How do you do that at Hadoop scale?
Apache Samza is an open source stream processing framework designed to solve these kinds of problems. It is built upon YARN/Hadoop 2.0 and Apache Kafka. You can think of Samza as a real-time, continuously running version of MapReduce.
Samza has some unique features that make it powerful. It provides high performance for stateful processing jobs, including aggregation and joins between many input streams. It is designed to support an ecosystem of many different jobs written by different teams, and it isolates them from each other, so that one badly behaved job can’t affect the others.
48. References (fun stuff to read)!
1. Martin Kleppmann:“Designing data-intensive applications.” O’Reilly Media, to appear in 2015. http://dataintensive.net!
2. Jay Kreps:“Why local state is a fundamental primitive in stream processing.” 31 July 2014. http://radar.oreilly.com/2014/07/why-local-
state-is-a-fundamental-primitive-in-stream-processing.html!
3. Jay Kreps:“I ︎ Logs.” O'Reilly Media, September 2014. http://shop.oreilly.com/product/0636920034339.do!
4. Nathan Marz and James Warren:“Big Data: Principles and best practices of scalable realtime data systems.” Manning MEAP, to appear
January 2015. http://manning.com/marz/!
5. Jakob Homan:“Real time insights into LinkedIn's performance using Apache Samza.” 18 Aug 2014. http://engineering.linkedin.com/samza/
real-time-insights-linkedins-performance-using-apache-samza!
6. Martin Kleppmann:“Moving faster with data streams:The rise of Samza at LinkedIn.” 14 July 2014. http://engineering.linkedin.com/stream-
processing/moving-faster-data-streams-rise-samza-linkedin!
7. Praveen Neppalli Naga:“Real-time Analytics at Massive Scale with Pinot.” 29 Sept 2014. http://engineering.linkedin.com/analytics/real-
time-analytics-massive-scale-pinot!
8. David He:“Monitor and Improve Web Performance Using RUM DataVisualization.” 19 Sept 2014. http://engineering.linkedin.com/
performance/monitor-and-improve-web-performance-using-rum-data-visualization!
9. Lili Wu, Sam Shah, Sean Choi, Mitul Tiwari, and Christian Posse:“The Browsemaps: Collaborative Filtering at LinkedIn,” at 6th Workshop
on Recommender Systems and the Social Web, Oct 2014. http://ls13-www.cs.uni-dortmund.de/homepage/rsweb2014/papers/
rsweb2014_submission_3.pdf!
10. Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.:“All Aboard the Databus!,” at ACM Symposium on Cloud Computing (SoCC),
October 2012. http://www.socc2012.org/s18-das.pdf!
11. Apache Samza documentation. http://samza.incubator.apache.org!