Contenu connexe Similaire à Twitter Storm: Ereignisverarbeitung in Echtzeit (20) Plus de Guido Schmutz (20) Twitter Storm: Ereignisverarbeitung in Echtzeit1. 2013 © Trivadis
BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN
WELCOME Twitter Storm:
Ereignisverarbeitung in
Echtzeit
Guido Schmutz
Java Forum Stuttgart 2013
4.7.2013
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
1
2. 2013 © Trivadis
Trivadis ist führend bei der IT-Beratung, der Systemintegration,
dem Solution-Engineering und der Erbringung von IT-Services
mit Fokussierung auf und Technologien
im D-A-CH-Raum.
Unsere Leistungen erbringen wir aus den strategischen Geschäftsfeldern:
Trivadis Services übernimmt den korrespondierenden Betrieb
Ihrer IT Systeme.
Unser Unternehmen
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
B E T R I E B
2
3. 2013 © Trivadis
Mit über 600 IT- und Fachexperten bei Ihnen vor Ort
3
11 Trivadis Niederlassungen mit
über 600 Mitarbeitenden
200 Service Level Agreements
Mehr als 4'000 Trainingsteilnehmer
Forschungs- und Entwicklungs-
budget: CHF 5.0 / EUR 4 Mio.
Finanziell unabhängig und
nachhaltig profitabel
Erfahrung aus mehr als 1'900
Projekten pro Jahr bei über 800
Kunden
Stand 12/2012
Hamburg
Düsseldorf
Frankfurt
Freiburg
München
Wien
Basel
ZürichBern
Lausanne
3
Stuttgart
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
3
4. 2013 © Trivadis
Guido Schmutz
• Working for Trivadis for more than 16 years
• Oracle ACE Director for Fusion Middleware and SOA
• Co-Author of different books
• Consultant, Trainer Software Architect for Java, Oracle, SOA,
EDA, BigData und FastData
• Member of Trivadis Architecture Board
• Technology Manager @ Trivadis
• More than 20 years of software development
experience
• Contact: guido.schmutz@trivadis.com
• Blog: http://guidoschmutz.wordpress.com
• Twitter: gschmutz
4.7.2013
4
Twitter Storm: Ereignisverarbeitung in Echtzeit
5. 2013 © Trivadis
Big Data Definition (Gartner et al)
14.02.2013
Big Data 4 Sales
5
Velocity
Tera-, Peta-, Exa-, Zetta-, Yota- bytes and constantly growing
“Traditional” computing in RDBMS
is not scalable enough.
We search for “linear scalability”
“Only … structured information
is not enough” – “95% of produced data in
unstructured”
Characteristics of Big Data: Its
Volume, Velocity and Variety in
combination
+ Veracity (IBM) - information uncertainty
+ Time to action ? – Big Data + Event Processing = Fast Data
7. 2013 © Trivadis
Velocity
24. April 2013
Big Data und Fast Data
7
§ Velocity requirement examples:
§ Recommendation Engine
§ Predictive Analytics
§ Marketing Campaign Analysis
§ Customer Retention and Churn Analysis
§ Social Graph Analysis
§ Capital Markets Analysis
§ Risk Management
§ Rogue Trading
§ Fraud Detection
§ Retail Banking
§ Network Monitoring
§ Research and Development
8. 2013 © Trivadis
Agenda
1. What is Twitter Storm
2. Main Concepts
3. Trident
4. BigData and FastData combined – Lambda Architecture
5. Summary
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
8
9. 2013 © Trivadis
What is Twitter Storm ?
A platform for doing analysis on streams of data as they come in, so you
can react to data as it happens.
• A highly distributed real-time computation system
• Provides general primitives to do real-time computation
• To simplify working with queues & workers
• scalable and fault-tolerant
• complementary to Hadoop
Originated at Backtype, acquired by Twitter in 2011
Open Sourced late 2011
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
9
10. 2013 © Trivadis
Hadoop vs. Storm
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
10
Storm
Real-time processing
Topologies run forever
Stateless nodes
Scalable
Guarantees no data loss
Open Source
=> Fast, reactive, real time procesing
Hadoop
Batch processing
Jobs runs to completion
Stateful nodes
Scalable
Guarantees no data loss
Open Source
=> big batch processing
=
!=
11. 2013 © Trivadis
Idea: Simplify dealing with queue & workers
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
11
Scaling is painful – queue partitioning & worker deploy
Operational overhead – worker failures & queue backups
No guarantees on data processing
12. 2013 © Trivadis
What does Storm provide?
§ At least once message processing
§ Fault-tolerant
§ Horizontal scalability
§ No intermediate queues
§ Less operational overhead
§ „Just works“
§ Broad set of use cases
§ Stream processing
§ Continous computation
§ Distributed RPC
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
12
13. 2013 © Trivadis
Agenda
1. What is Twitter Storm
2. Main Concepts
3. Trident
4. BigData and FastData combined – Lambda Architecture
5. Summary
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
13
14. 2013 © Trivadis
Main Concepts – Shown based on Twitter example
Show the tweets related to Java Forum (use hashtag #JFS2013) and show
the count of the hashtags used
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
14
not showing #JFS2013
15. 2013 © Trivadis
Main Concepts - Streams
Stream
• an unbounded sequence of tuples
• core abstraction in Storm
• Defined with a schema that names the
fields in the tuple
• Value must be serializable
• Every stream has an id
Topology
• graph where each node is a spout or bolt
• edges indicating which bolt subscribes
to which streams
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
15
Source of
stream A
Source of
stream B
Subscribes: A
Emits: C
Subscribes: A
Emits: D
Subscribes: A & B
Emits: -
Subscribes: C & D
Emits: -
T T T T T T T T
16. 2013 © Trivadis
Worker
• executes a subset of a topology, may run one or more executors for one or
more components
Executor
• thread spawned by worker
Task
• performs the actual data processing
Main Concepts – Worker, Executor, Task
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
16
Worker Process
Task
Task
Task
Task
17. 2013 © Trivadis
Main Concepts - Spout
A spout is the component which acts as the source of streams in a
topology
Generally spouts read tuples from an external source and emit them into
the topology
Spouts can emit more than one stream
Step 1: Use a Spout to subscribe to the Twitter Filter Streaming API to get
all the tweets containing the hashtag #JFS2013
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
17
Twitter
Stream
Tweets tweet
Tweet
Stream
18. 2013 © Trivadis
Main Concepts - Spout
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
18
Twitter
Stream
Tweets tweet
Tweet
Stream
By calling this method, Storm requests that the
Spout emits tuples to the output collector
Called when a task for this
component is initialized
Used to declare streams and their schmeas. it is possible to declare
serveral streams and specifying the one to use when emiting tuples
19. 2013 © Trivadis
Bolt is doing the processing of the message
• filtering, functions, aggregations, joins, talking to databases….
• Can do simple stream transformations
• Complex transformations often require multiple steps and therefore multiple
bolts
Bolts can emit one or more streams
Step 2: Use a bolt to retrieve the hashtags for each tweet
Main Concepts - Bolt
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
19
Twitter
Stream
Tweets tweet
Hashtag
Splitter
Tweet
Stream
hashtag
20. 2013 © Trivadis
Main Concepts - Bolt
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
20
Twitter
Stream
Tweets tweet
Hashtag
Splitter
Tweet
Stream
hashtag
Execute is called for each tuple
arriving on the input stream
Declares the output fields for the component and
is called when bolt is created
21. 2013 © Trivadis
Main Concepts - Bolt
Step 3: Use a bolt to count the occurences of the hashtag in Redis NoSQL
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
21
Twitter
Stream
Tweets tweet
Hashtag
Splitter
Tweet
Stream
hashtag Hashtag
Counter
22. 2013 © Trivadis
Main Concepts - Bolt
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
22
Twitter
Stream
Tweets tweet
Hashtag
Splitter
Tweet
Stream
hashtag Hashtag
Counter
Prepare is called when bolt is
created
Execute is called for each tuple
arriving on the input stream
23. 2013 © Trivadis
Main Concepts - Topology
Step 4: Setup the topology with the necessary groupings (distribution)
Each Spout or Bolt are running N instances in parallel
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
23
Twitter
Stream
Tweets
Hashtag
Splitter
Tweet
Stream
Hashtag
Counter
Hashtag
Splitter
Hashtag
Counter
Shuffle Fields
Shuffle grouping is random grouping
Fields grouping is grouped by value, such that equal value results in equal task
All grouping replicates to all tasks
Global grouping makes all tuples go to one task
None grouping makes bolt run in the same thread as bold/spout it subscribes to
Direct grouping producer (task that emits) controls which consumer will receive
24. 2013 © Trivadis
Main Concepts - Topology
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
24
Create Stream called „tweet-stream“
Run this bolt by 2 workers
Run this spout
by 1 worker
Subscribe to stream „tweet-stream“
using shuffle grouping
Create stream called „hashtag-splitter“
Subscribe to stream „hashtag-splitter“
using fields grouping on field „hashtag“
25. 2013 © Trivadis
Sample – How does it work ?
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
25
Twitter
Stream
Mein Präsentation
„Twitter Storm:
Ereignisverarbeitung
in Echtzeit“ Morgen
um 15:35 am Java
Forum Stuttgart
#JFS2013 #storm
#fastdata CU there!
Hashtag
Splitter
Tweet
Stream
Hashtag
Counter
Hashtag
Splitter
Hashtag
Counter
26. 2013 © Trivadis
Sample – How does it work ?
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
26
Twitter
Stream
Mein Präsentation
„Twitter Storm:
Ereignisverarbeitung
in Echtzeit“ Morgen
um 15:35 am Java
Forum Stuttgart
#JFS2013 #storm
#fastdata CU there!
Hashtag
Splitter
Tweet
Stream
Hashtag
Counter
Hashtag
Splitter
Hashtag
Counter
storm
fastdata
jsf2013
27. 2013 © Trivadis
Sample – How does it work ?
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
27
Twitter
Stream
Mein Präsentation
„Twitter Storm:
Ereignisverarbeitung
in Echtzeit“ Morgen
um 15:35 am Java
Forum Stuttgart
#JFS2013 #storm
#fastdata CU there!
Hashtag
Splitter
Tweet
Stream
Hashtag
Counter
Hashtag
Splitter
Hashtag
Counter
storm
fastdata
jfs2013
INCR
jfs2013
INCR
storm
INCR
fastdata storm = 1
fastdata =1
jfs2013= 1
28. 2013 © Trivadis
Agenda
1. What is Twitter Storm
2. Main Concepts
3. Trident
4. BigData and FastData combined – Lambda Architecture
5. Summary
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
28
29. 2013 © Trivadis
What is Trident?
• High-Level abstraction on top of storm
• Makes it easier to build topologies
• Core data model is the stream
• Processed as a series of batches
• Stream is partitioned among nodes in cluster
• 5 kinds of operations in Trident
• Operations that apply locally to each partition and cause no network
transfer
• Repartitioning operations that don‘t change the contents
• Aggregation operations that do network transfer
• Operations on grouped streams
• Merges and Joins
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
29
30. 2013 © Trivadis
What is Trident?
Supports
• Joins
• Aggregations
• Grouping
• Functions
• Filters
Similar to Hadoop and Pig/Cascading
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
30
31. 2013 © Trivadis
Trident Concepts - Topology
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
31
Twitter
Stream
Tweets tweet
Hashtag
Splitter
Tweet
Stream
hashtag Hashtag
Normalizer
Persistent
Aggregate
hashtag
groupBylocal
Bolt Bolt
32. 2013 © Trivadis
Trident Concepts - Function
• takes in a set of input fields and emits zero or more tuples as output
• fields of the output tuple are appended to the original input tuple in the
stream
• If a function emits no tuples, the original input tuple is filtered out
• Otherwise the input tuple is duplicated for each output tuple
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
32
33. 2013 © Trivadis
Agenda
1. What is Twitter Storm
2. Main Concepts
3. Trident
4. BigData and FastData combined – Lambda Architecture
5. Summary
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
33
34. 2013 © Trivadis
Lambda Architecture
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
34
Immutable
data
Batch
View
Query
Data
Stream
Realtime
View
Incoming
Data
Serving Layer
Speed Layer
Batch Layer
A
B
C D
E
F
G
35. 2013 © Trivadis
Lambda Architecture
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
35
Speed Layer
Precompute
Views
query
Source: Marz, N. & Warren, J. (2013) Big Data. Manning.
Batch Layer
Precomputed
information
All data
Incremented
information
Process stream
Incoming
Data
Batch
recompute
Realtime
increment
Serving Layer
batch view
batch view
real time view
real time view
Merge
36. 2013 © Trivadis
Lambda Architecture
24. April 2013
Big Data und Fast Data
36
one possible product/framework mapping
Speed Layer
Precompute
Views
query
Batch Layer
Precomputed
information
All data
Incremented
information
Process stream
Incoming
Data
Batch
recompute
Realtime
increment
Serving Layer
batch view
batch view
real time view
real time view
Merge
37. 2013 © Trivadis
Agenda
1. What is Twitter Storm
2. Main Concepts
3. Trident
4. BigData and FastData combined – Lambda Architecture
5. Summary
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
37
38. 2013 © Trivadis
Summary
Express real-time processing naturally
Not a Hadoop/MapReduce competitor
Supports other languages
Hard to debug
Object Serialization
Didn‘t cover
§ Reliability through guaranteed message processing
§ Distributed RPC
§ Storm cluster setup and deploy
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
38
39. 2013 © Trivadis
Resources
Website (http://storm-project.net/)
Wiki (https://github.com/nathanmarz/storm/wiki)
Doc (https://github.com/nathanmarz/storm/wiki/Documentation)
Storm-starter project (https://github.com/nathanmarz/storm-starter)
Mailing list (http://groups.google.com/group/storm-user)
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
39
40. 2013 © Trivadis
BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN
Thank You!
Trivadis AG
Guido Schmutz
guido.schmutz@trivadis.com
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
40