Esperwhispering

Esperwhispering:
Using Esper to Find Problems in Real-time Data

/ Real-time and real(ly) big

1

Who am I? @postwait on twitter

Author of “Scalable Internet Architectures”
Pearson, ISBN: 067232699X

Contributor to “Web Operations”
O’Reilly, ISBN: 1449377440

Founder of OmniTI, Message Systems, Fontdeck, & Circonus
I like to tackle problems that are “always on” and “always growing.”

I am an Engineer
A practitioner of academic computing.
IEEE member and Senior ACM member.
On the Editorial Board of ACM’s Queue magazine.
On the ACM professions board.

2

What is BigData?

• Few agree.

• I say it is any data-related problem that
can’t be solved (well) on one machine.

• Never use a distributed system to solve a problem
that can be easily solved on a single system:

• performance

• simplicity

• debugability

3

Framing the data problem

• events... to make it web related, lets say it is web activity

• for every user action, we have an event

• an event is composed of about 20-30 known attributes
(say ~400 bytes)

• url, referrer, site category,

• ip address, ASN, geo location info,

• user-perceived performance info (like load time)

4

Framing the volume problem

• We see 100 of these per second on a site

• Easy problem (more or less)

• We run SaaS, so we need to support 2000 customers:

• 200,000 events/second
(or 30x = 6,000,000 column appends/second)

5

What do we want?

• I want answers, dammit

• I would like to know what is slow (or fast) by

• ASN

• geo location

• browser type

• I’d also like to know given an event:

• is it outside the average +/- 2 x σ

• over the last 5 minutes

6

What else do we want?

• I want answers now, dammit

7

What else do we want?

• I want answers now, dammit

defined: not later

7

What is real-time?

• The correctness of the answer depends on both the logical
correctness of the result and temporal proximity of the result and
the question.

• hard real-time: old answers are worthless.

• soft real-time: old answers are worth less.

8

Real-time on the Internet

• Hard real-time systems on the Internet;
this sort of thing ain’t my bag, baby!

• Someone is just going to get hurt.

9

Soft real-time?

• We need soft real-time systems any time we are going to react to a user.

• If the answer is either wrong or late, it is less relevant to them.

• The problems we look at have temporal constraints ranging from
5 seconds (counters and statistics) to
1 second (fraud detection) to
10 milliseconds (user-action reaction) and
everywhere in between.

10

Enter CEP

• Complex Event Processing...

• Queries always running.

• Tuples introduced.

• Tuples emitted.

• ’s Esper is my hero.

11

Typical (OmniTI) Esper deployment:

custom Java glue

Application
Infrastructure
Cloud

12

More concretely

• node.js listens for web requests and submits data to Esper via AMQP

• Esper runs “magic”

• The output of that magic is pushed back via AMQP

• node.js listens and returns data back over JSONP.

13

What our event really looks like:
{
'_ls_part': { 'type': 'String' },

'url_schema': { 'type': 'String' },
'url_host': { 'type': 'String' },
'url': { 'type': 'String' },
'referrer_schema': { 'type': 'String' },
'referrer_host': { 'type': 'String' },
'referrer_path': { 'type': 'String' },
'ip': { 'type': 'String' },
'method' : { 'type': 'String' },
'http_version' : { 'type': 'String' }, 'asn': { 'type': 'Integer' },
'browser': { 'type': 'String' }, 'asn_orgname': { 'type': 'String' },
'browser_version': { 'type': 'String' }, 'map_id': { 'type': 'String' },
'geoip_longitude': { 'type': 'Double' },
'red_time': { 'type': 'Double' }, 'geoip_latitude': { 'type': 'Double' },
'dns_time': { 'type': 'Double' }, 'geoip_country_code': { 'type': 'String' },
'con_time': { 'type': 'Double' }, 'geoip_continent_code': { 'type': 'String' },
'req_start': { 'type': 'Double' }, 'geoip_region': { 'type': 'String' },
'res_start': { 'type': 'Double' }, 'geoip_metro_code': { 'type': 'Integer' },
'res_end': { 'type': 'Double' }, 'geoip_country': { 'type': 'String' },
'dom_time': { 'type': 'Double', }, 'geoip_city': { 'type': 'String' },
'load_time': { 'type': 'Double', }, 'geoip_area_code': { 'type': 'Integer' }
}

14

{
Client Token
}

14

{
Client Token
HTTP Info
}

14

{
Client Token
HTTP Info
}
User Perceived Performance Data
14

{
Client Token
HTTP Info
User Location
}
User Perceived Performance Data
14

First steps for simplicity

• I want to create a view on 30 minutes of data for a specific client and
populate that view with those “hit” events:

create window fl9875309_hit30m.win:time(30 minute) as hit
insert into fl9875309_hit30m select * from hit(_ls_part='fl9875309')

• Some useful thoughts:

• data flowing into this window: “istream”

• data also flowing out of this window (after 30 minutes): “rstream”

• if you are interested in both streams, we call it: “irstream”

15

Asking a question:

• EPL, as you can see looks much like SQL... so

select count(*) from fl9875309_hit30m

• SQLers will be very surprised by the result of this...

• ideas?

• Hint: this query runs forever and emits results as available

• Esper defaults to use the istream of events form which it selects

• So:

• this statement emits a result on each event entering the window

• and the return set is the total number of events within the window

• We really wanted:

select irstream count(*) from fl9875309_hit30m

16

Asking a (cooler) question:

• I’d like to know the view volume by referring site.. so

select irstream referrer_host, count(*) as views
from fl9875309_hit30m
where referrer_host <> url_host
group by referrer_host

• This outputs on any event entering or leaving the window... but,

• it only outputs the group that is being updated by the event(s)
entering and/or leaving the window...

• (perhaps) not so useful

17

Snapshots

• Sometimes you want to see the complete state.

• Given that we’re asynch, we can decouple the output from the input.

• Let’s get the top 10 referrers, every 5 seconds.

select irstream referrer_host, count(*) as views
from fl9875309_hit30m
where referrer_host <> url_host
group by referrer_host
output snapshot every 5 seconds
order by count(*) desc
limit 10

18

Finding anomalies...

• Note: this is very very simplistic.

19



• I’d like to break the dataset out by network (AS)

19




• I’d like to find individual hits whose load_time is
greater than the average + 3 times the standard deviation

19





• I’d like details about the hit’s IP, browser and load_time

19





• I’d like details about the hit’s IP, browser and load_time

select asn_orgname, browser_version, ip, load_time,
average, stddev, datapoints as sample_size
from fl9875309_hit30m(load_time is not null)
.std:groupwin(asn_orgname)
.stat:uni(load_time, ip, browser_version, load_time) as s
where s.load_time > s.average + 3 * s.stddev

19

Mapping it all out.

• Looking at performance: a world’s-eye view

What’s this all mean?

• Big data is all relative.

• 100 records/s at 400 bytes each is... ~3GB/day or ~1TB/year

• 100,000 records/s is... ~3TB/day or 1PB/year

• 500,000 records/s is... ~15TB/day or 5PB/year

• Which is big data? you choose.

• The technology that can act on this in real-time exists and is different
that the technologies to store it and crunch it.

• Don’t think big... think efficient.

Thank You

• Thanks you

• Thank you

• Thanks you

• Consider attending:
Surge 2011
discussing scalability matters,
because scalability matters

• Thank you!

Esperwhispering

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Esperwhispering

Similaire à Esperwhispering (20)

Plus de Theo Schlossnagle

Plus de Theo Schlossnagle (20)

Dernier

Dernier (20)

Esperwhispering

Notes de l'éditeur