2. Who am I? @postwait on twitter
Author of “Scalable Internet Architectures”
Pearson, ISBN: 067232699X
Contributor to “Web Operations”
O’Reilly, ISBN: 1449377440
Founder of OmniTI, Message Systems, Fontdeck, & Circonus
I like to tackle problems that are “always on” and “always growing.”
I am an Engineer
A practitioner of academic computing.
IEEE member and Senior ACM member.
On the Editorial Board of ACM’s Queue magazine.
On the ACM professions board.
2
3. What is BigData?
• Few agree.
• I say it is any data-related problem that
can’t be solved (well) on one machine.
• Never use a distributed system to solve a problem
that can be easily solved on a single system:
• performance
• simplicity
• debugability
3
4. Framing the data problem
• events... to make it web related, lets say it is web activity
• for every user action, we have an event
• an event is composed of about 20-30 known attributes
(say ~400 bytes)
• url, referrer, site category,
• ip address, ASN, geo location info,
• user-perceived performance info (like load time)
4
5. Framing the volume problem
• We see 100 of these per second on a site
• Easy problem (more or less)
• We run SaaS, so we need to support 2000 customers:
• 200,000 events/second
(or 30x = 6,000,000 column appends/second)
5
6. What do we want?
• I want answers, dammit
• I would like to know what is slow (or fast) by
• ASN
• geo location
• browser type
• I’d also like to know given an event:
• is it outside the average +/- 2 x σ
• over the last 5 minutes
6
7. What else do we want?
• I want answers now, dammit
7
8. What else do we want?
• I want answers now, dammit
defined: not later
7
9. What is real-time?
• The correctness of the answer depends on both the logical
correctness of the result and temporal proximity of the result and
the question.
• hard real-time: old answers are worthless.
• soft real-time: old answers are worth less.
8
10. Real-time on the Internet
• Hard real-time systems on the Internet;
this sort of thing ain’t my bag, baby!
• Someone is just going to get hurt.
9
11. Soft real-time?
• We need soft real-time systems any time we are going to react to a user.
• If the answer is either wrong or late, it is less relevant to them.
• The problems we look at have temporal constraints ranging from
5 seconds (counters and statistics) to
1 second (fraud detection) to
10 milliseconds (user-action reaction) and
everywhere in between.
10
12. Enter CEP
• Complex Event Processing...
• Queries always running.
• Tuples introduced.
• Tuples emitted.
• ’s Esper is my hero.
11
14. More concretely
• node.js listens for web requests and submits data to Esper via AMQP
• Esper runs “magic”
• The output of that magic is pushed back via AMQP
• node.js listens and returns data back over JSONP.
13
20. First steps for simplicity
• I want to create a view on 30 minutes of data for a specific client and
populate that view with those “hit” events:
create window fl9875309_hit30m.win:time(30 minute) as hit
insert into fl9875309_hit30m select * from hit(_ls_part='fl9875309')
• Some useful thoughts:
• data flowing into this window: “istream”
• data also flowing out of this window (after 30 minutes): “rstream”
• if you are interested in both streams, we call it: “irstream”
15
21. Asking a question:
• EPL, as you can see looks much like SQL... so
select count(*) from fl9875309_hit30m
• SQLers will be very surprised by the result of this...
• ideas?
• Hint: this query runs forever and emits results as available
• Esper defaults to use the istream of events form which it selects
• So:
• this statement emits a result on each event entering the window
• and the return set is the total number of events within the window
• We really wanted:
select irstream count(*) from fl9875309_hit30m
16
22. Asking a (cooler) question:
• I’d like to know the view volume by referring site.. so
select irstream referrer_host, count(*) as views
from fl9875309_hit30m
where referrer_host <> url_host
group by referrer_host
• This outputs on any event entering or leaving the window... but,
• it only outputs the group that is being updated by the event(s)
entering and/or leaving the window...
• (perhaps) not so useful
17
23. Snapshots
• Sometimes you want to see the complete state.
• Given that we’re asynch, we can decouple the output from the input.
• Let’s get the top 10 referrers, every 5 seconds.
select irstream referrer_host, count(*) as views
from fl9875309_hit30m
where referrer_host <> url_host
group by referrer_host
output snapshot every 5 seconds
order by count(*) desc
limit 10
18
25. Finding anomalies...
• Note: this is very very simplistic.
• I’d like to break the dataset out by network (AS)
19
26. Finding anomalies...
• Note: this is very very simplistic.
• I’d like to break the dataset out by network (AS)
• I’d like to find individual hits whose load_time is
greater than the average + 3 times the standard deviation
19
27. Finding anomalies...
• Note: this is very very simplistic.
• I’d like to break the dataset out by network (AS)
• I’d like to find individual hits whose load_time is
greater than the average + 3 times the standard deviation
• I’d like details about the hit’s IP, browser and load_time
19
28. Finding anomalies...
• Note: this is very very simplistic.
• I’d like to break the dataset out by network (AS)
• I’d like to find individual hits whose load_time is
greater than the average + 3 times the standard deviation
• I’d like details about the hit’s IP, browser and load_time
select asn_orgname, browser_version, ip, load_time,
average, stddev, datapoints as sample_size
from fl9875309_hit30m(load_time is not null)
.std:groupwin(asn_orgname)
.stat:uni(load_time, ip, browser_version, load_time) as s
where s.load_time > s.average + 3 * s.stddev
19
29. Finding anomalies...
• Note: this is very very simplistic.
• I’d like to break the dataset out by network (AS)
• I’d like to find individual hits whose load_time is
greater than the average + 3 times the standard deviation
• I’d like details about the hit’s IP, browser and load_time
select asn_orgname, browser_version, ip, load_time,
average, stddev, datapoints as sample_size
from fl9875309_hit30m(load_time is not null)
.std:groupwin(asn_orgname)
.stat:uni(load_time, ip, browser_version, load_time) as s
where s.load_time > s.average + 3 * s.stddev
19
30. Finding anomalies...
• Note: this is very very simplistic.
• I’d like to break the dataset out by network (AS)
• I’d like to find individual hits whose load_time is
greater than the average + 3 times the standard deviation
• I’d like details about the hit’s IP, browser and load_time
select asn_orgname, browser_version, ip, load_time,
average, stddev, datapoints as sample_size
from fl9875309_hit30m(load_time is not null)
.std:groupwin(asn_orgname)
.stat:uni(load_time, ip, browser_version, load_time) as s
where s.load_time > s.average + 3 * s.stddev
19
31. Finding anomalies...
• Note: this is very very simplistic.
• I’d like to break the dataset out by network (AS)
• I’d like to find individual hits whose load_time is
greater than the average + 3 times the standard deviation
• I’d like details about the hit’s IP, browser and load_time
select asn_orgname, browser_version, ip, load_time,
average, stddev, datapoints as sample_size
from fl9875309_hit30m(load_time is not null)
.std:groupwin(asn_orgname)
.stat:uni(load_time, ip, browser_version, load_time) as s
where s.load_time > s.average + 3 * s.stddev
19
32. Mapping it all out.
• Looking at performance: a world’s-eye view
33. What’s this all mean?
• Big data is all relative.
• 100 records/s at 400 bytes each is... ~3GB/day or ~1TB/year
• 100,000 records/s is... ~3TB/day or 1PB/year
• 500,000 records/s is... ~15TB/day or 5PB/year
• Which is big data? you choose.
• The technology that can act on this in real-time exists and is different
that the technologies to store it and crunch it.
• Don’t think big... think efficient.
34. Thank You
• Thanks you
• Thank you
• Thanks you
• Consider attending:
Surge 2011
discussing scalability matters,
because scalability matters
• Thank you!