Mattingly "AI & Prompt Design: The Basics of Prompt Design"
In-situ MapReduce for Log Processing
1. In-situ MapReduce for Log Processing
Speaker: LIN Qian
http://www.comp.nus.edu.sg/~linqian
2. Log analytics
• Data centers with 1000s of
servers
• Data-intensive computing:
Store and analyze TBs of logs
Examples:
• Click logs
– ad-targeting, personalization
• Social media feeds
– brand monitoring
• Purchase logs
– fraud detection
• System logs
– anomaly detection, debugging 1
3. Log analytics today
• “Store-first-query later” Servers
Problems:
• Scale
– Stress network and
disks
Store first ...
• Failures
– Delay analysis or
process incomplete ... query later
data
• Timeliness
MapReduce
– Hinder real-time apps
Dedicated cluster
2
4. In-situ MapReduce (iMR)
Idea: Servers
• Move analysis to the
servers
• MapReduce for continuous MapReduce
data
• Ability to trade fidelity for
latency
Optimized for:
• Highly selective workloads
– e.g., up to 80% data
filtered or summarized!
• Online analytics
– e.g., ad re-targeting based
on most recent clicks Dedicated cluster
3
5. An iMR query
The same:
• MapReduce API
– map(r) {k,v} : extract/filter data
– reduce({k,v[]}) v’ : data aggregation
– combine({k,v[]}) v’ : early, partial aggregation
The new:
• Provides continuous results
– because logs are continuous
4
6. Continuous MapReduce
Log entries
• Input
– An infinite stream of logs ...
Time
0’’ 30’’ 60’’ 90’’
• Bound input with sliding
windows
Map
– Range of data (R) Combine
– Update frequency (S)
• Output
Reduce
– Stream of results, one
for each window
5
7. Processing windows in-network
Overlapping data
User’s reduce function
...
Time
0’’ 30’’ 60’’ 90’’
Map
Combine
...
Reduce
Aggregation tree for efficiency 6
8. Efficient processing with panes
P1 P2 P3 P4 P5
• Divide window into
panes (sub-windows) ...
– Each pane is Time
0’’ 30’’ 60’’ 90’’
processed and sent
only once
– Root combines panes Map
to produce window Combine
• Eliminate redundant P1
P2
work P3
P4
– Save CPU & network
P5
resources, faster
analysis Reduce
7
9. Impact of data loss on analysis
• Servers may get
P1 P2 P3 P4 P5
overloaded or fail ...
X
Challenges:
• Characterize
Map
Combine
incomplete results
• Allow users to
trade fidelity for
latency Reduce
? 8
10. Quantifying data fidelity
• Data are naturally
distributed
– Space (server nodes)
– Time (processing window)
• C2 metric
– Annotates result windows
with a “scoreboard”
9
11. Trading fidelity for latency
• Use C2 to trade fidelity for
latency
– Maximum latency requirement
– Minimum fidelity requirement
• Different ways to meet
minimum fidelity
– 4 useful classes of C2
specifications
10
12. Minimizing result latency
• Minimum fidelity with earlier results
• Give freedom to decrease latency
– Return the earliest data available
• Appropriate for uniformly distributed
events
11
13. Sampling non-uniform events
• Minimum fidelity with random sampling
• Less freedom to decrease latency
– Included data may not be the first
available
• Appropriate even for non-uniform data
12
14. Correlating events across time and space
Leverage knowledge about data distribution
• Temporal completeness
– Include all data from a
node or no data at all
• Spatial completeness
– Each pane contains data
from all nodes
13
15. Prototype
• Build upon Mortar
– Sliding windows
– In-network aggregation trees
• Extended to support:
– MapReduce API
– Pane-based processing
– Fault tolerance mechanisms
14
16. Processing data in-situ
• Useful when ...
• Goal: use available resources intelligently
• Load shedding mechanism
– Nodes monitor local processing rate
– Shed panes that cannot be processed on
time
• Increase result fidelity under time and
resource constraints
15
17. Evaluation
• System scalability
• Usefulness of C2 metric
– Understanding incomplete results
– Trading fidelity for latency
• Processing data in-situ
– Improving fidelity under load with load
shedding
– Minimizing impact on services
16
18. Scaling
• Synthetic input data, reducer of word
count
• 3 reducers provide sufficient processing
to handle the 30 map tasks
17
19. Exploring fidelity-latency trade-offs
Data loss affects accuracy of
distribution
100%
accuracy
• Temporal completeness
• Spatial completeness and
random sampling >25%
decrease
C2 allows to trade fidelity for
lower latency
18
20. In-situ performance
• iMR side-by-side with a
real service (Hadoop)
560%
• Vary CPU allocated to iMR
– Result fidelity
– Hadoop performance (job
throughput) <11% overhead
19
21. Conclusion
• In-situ architecture processes logs at the
sources, avoids bulk data transfers, and
reduces analysis time
• Model allows incomplete data under failures
or server load, provides timely analysis
• C2 metric helps understand incomplete data
and trade fidelity for latency
• Pro-actively sheds load, improves data fidelity
under resource and time constraints
20