2. Berkeley’s AMPLab
2011 – 2016
• Mission: “Make sense of big data”
• 8 faculty, 60+ students
Governmental and industrial founding
2
Algorithms
Machine
s
People
3. AMPLab Goal and Impact
3
Goal: Next generation of open source
data analytics stack for industry & academia
Berkeley Data Analytics Stack (BDAS)
…
8. Why?
What does this mean?
• Faster decisions better than slower decisions
• Decisions on fresh data better than decisions on stale data
• Decisions on personalized data better than on generic data
8
Data only as valuable as the decisions it
enables
9. Goal
Real-time decisions
on live data
with strong security
9
decide in ms
the current state of the
environment
privacy, confidentiality, integrity
10. Typical decision system
10
Decision System DecisionData
Preproces
s
(e.g., train)
Intermediate
data
(e.g., model)
Query
engine
Automatic
decision engine
update latency decision latency
Want low update latency & low decision
latency
11. Why is it hard?
Want high quality decisions
• Sophisticated, e.g., fraud, forecast, fleet of drones
• Accuracy, low false positives and negatives
• Robust to noisy and unforseen data
Want low latency for both updates and decisions
Want strong security: privacy, confidential, integrity
11
12. Example: Zero-time defense
12
Problem: zero-day attacks can compromise
millions of hosts in seconds
Solution: analyze network flows to detect
attacks and patch hosts/software in real-time
• Intermediate data: create attack model
• Decision: detect attack, patch
Quality sophisticated, accurate, robust
Latency update (sec ) / decision (ms)
Security privacy (encourage users to share logs),
integrity
13. Application Quality
Latency
SecurityUpdate Decisio
n
Zero-time defense sophisticated, accurate, robust sec ms privacy, integrity
Parking assistant sophisticated, robust sec sec privacy
Disease discovery sophisticated, accurate hours sec/min privacy, integrity
IoT (smart buildings) sophisticated, robust min/hour sec privacy, integrity
Earthquake warning sophisticated, accurate, robust min ms integrity
Chip manufacturing sophisticated, accurate, robust min sec/min confidentiality, integrity
Fraud detection sophisticated, accurate min ms privacy, integrity
“Fleet” driving sophisticated, accurate, robust sec sec privacy, integrity
Virtual companion sophisticated, robust min/hour sec integrity
Video QoS at scale sophisticated min ms/sec privacy, integrity
14. Challenges
14
Automated decisions
on live data are hard
Poor security: exploits
are daily occurrences
One-off solutions,
expensive, slow to build
Real-time, sophisticated decisions
that guarantee worst-case behavior
on noisy and unforseen live data
Ensure privacy and integrity
without impacting functionality
General platform:
Secure Real-time Decision Stack
RISE Lab
15. Research directions
Systems: 100x lower latency, 1,000x higher
concurrency than today’s Spark
Machine learning: Robust, on-line ML algorithms
Security: achieve privacy, confidentiality, and integrity
without impacting performance or functionality
15
17. Streaming
Micro-batching vs. record-at-a-time
Micro-batching (e.g., Spark) inherits batch’s properties
• fault-tolerance
• straggler mitigation
• optimizations
• unification with other libraries
Record-at-a-time (e.g., Storm, Flink), typically lower
latency
17
18. Yahoo’s streaming benchmark
Input: 20M JSON ad-events / second, 100 campaigns
Output: ad counts per campaign over a 10sec window
Latency: (end of window) – (time last event was processed)
SLA: 1sec
Findings: Storm, Flink provide indeed lower latency than
Spark
18
Streaming systemads
ad counts
per campaign
23. Drizzle
Goal: reduce Spark streaming latency by at least 10x
Key observation: consecutive iterations use same
DAG
Solution: push scheduling decisions to workers
23
37. Opaque: two modes
Encryption mode
• Protect against compromised software (e.g., OS)
• Full data encryption, authentication, and computation
verification in hardware enclave
Oblivious mode
• Additionally, hide data access pattern
37
39. Opaque: Big Data Benchmark
39
0.01
0.1
1
10
100
Query 1 Query 2 Query 3
Runtime(s)
SparkSQL Opaque encryption Opaque oblivious
Encrypted
operators
implemented in
C++
40. Opaque: Big Data Benchmark
40
0.01
0.1
1
10
100
Query 1 Query 2 Query 3
Runtime(s)
SparkSQL Opaque encryption Opaque oblivious
Up to 100x slower
but 1,000x faster
than state-of-the-
art
41. Next AMPLab: RISELab
Already promising results
Expect much more over the next five years!
41
Goal: develop Secure Real-time Decision Stack,
an open source platform, tools and algorithms
for real-time decisions on live data with strong
security
Good morning, and welcome to Spark Summit Europe. It’s fantastic to see so much excitement.
Today, I’m very excited to tell you about our plans at Berkeley following AMPLab.
But first let me say a few words about AMP Lab as many of you might not be familiar with it.
AMP Lab is a project at University of California at Berkeley that started in 2011 and it will end this year. The vision of AMPLab was to make sense of big data by using a holistic approach involving algorithms (in particular ML algos), machines, ie, systems, and people, ie crowd sourcing. Hence AMP, the name of the lab.
The lab was founded both by governemnt, NSF and Darpa, and almost 40 other companies, including Amazon, Giigle, IBM, SAP, and many others.
The goal of the lab was to build next gen….
So why should you care? Well, AMP Lab was the original place where Apache Spark was developed, which all of us are using today. In addition, at AMPLab we developed a bunch of other systems that you might have heard of or you might have used, including Apache Mesps, and Alluxio, former known as Tachyon.
As you know there are basically two types of streaming systems: record-at-a-time and micro-batching.
Spark is an example of micro-batching system. As such, like Matei emphasized in his talk, it inherits all the desirable properties of batching model: fault-tolerance, straggler mitigation, optimization, and unification with other libraries.
In contrast, record-at-a-time systems like as Storm and Flink provide typically a lower latency.