5. Spark Model
Write programs in terms of transformations
on distributed datasets
!
Resilient Distributed Datasets (RDDs)
• Collections of objects that can be stored in
memory or disk across a cluster
• Parallel functional transformations (map,
filter, …)
• Automatically rebuilt on failure
7. 2.1 Goals
1. Scalability to hundreds of nodes
2. Minimal cost beyond base processing
3. Second-scale latency
4. Second-scale recovery from faults and
stragglers
16. 3.3. D-Stream API(1/3)
• Transformations: 新しいD-Streamを作る
• paris = words.map(w => (w, 1))
• counts = pairs.reduceByKey((a, b) => a +
b)
Stateless API
17. 3.3. D-Stream API(2/3)
Stateful API
ex.
pairs.reduceByWindow( 5s , (a,b) => a + b)
pairs.reduceByWindow( 5s , (a,b) => a + b, (a,b) => a - b)
Incremental aggregation: