Contenu connexe Similaire à How to Create 80% of a Big Data Pilot Project (20) Plus de Greg Makowski (6) How to Create 80% of a Big Data Pilot Project1. © 2015 ligaDATA, Inc. All Rights Reserved.
October 2015
Download, Forums, Docs, Events http://Kamanja.org
Meet 80% of the Needs of a Pilot Project
With a CC Fraud Detection Example
By Greg Makowski
ACM Data Science Camp, Saturday 10/24/2015
http://www.sfbayacm.org/event/silicon-valley-data-science-camp-2015
http://kamanja.org/white-papers/
2. © 2015 ligaDATA, Inc. All Rights Reserved.
2
ligaDATA
Summary
Ques%on)
How
to
help
a
OSS
pilot
evalua0on
go
faster?
Answer)
Develop
“design
pa:erns”
for
applica0ons
Pick
a
specific
app
(Credit
Card
Fraud
Detec0on)
Get
data
(end
up
genera0ng
it)
Need
to
vary
arch
config
(like
performance
tes0ng)
Given
requirements,
generate
a
mul0-‐node
example
pilot
system,
involving
many
OSS
components
PMML
can
abstract
the
produc0on
step
from
model
building
3. © 2015 ligaDATA, Inc. All Rights Reserved.
3
ligaDATA
Problem
When
evalua0ng
any
new
data
mining
or
big
data
soPware,
companies
want
to
“try
it
out”
and
see
how
it
meets
their
requirements.
A
common
step
is
a
pilot
project.
A
pilot
would
commonly
involve
integra0on
with
related
soPware
systems.
Open
Source
SoPware
(OSS)
may
come
with
examples.
Need
an
example
“produc%on
system”
Q)
What
can
be
done
to
shorten
the
0me
to
finish
a
Pilot?
4. © 2015 ligaDATA, Inc. All Rights Reserved.
4
ligaDATA
Problem:
Questions to be answered from Pilot
How
fast
is
it?
It
depends
(yes,
that
is
an
annoying
answer)
How
to
configure
the
system
with
other
OSS
soBware?
It
depends
(yes,
that
is
an
annoying
answer)
5. © 2015 ligaDATA, Inc. All Rights Reserved.
5
ligaDATA
Problem:
Questions to be answered from Pilot
How
fast
is
it?
It
depends
(yes,
that
is
an
annoying
answer)
Show
example
configs
with
performance
results
How
to
configure
the
system
with
other
OSS
soBware?
It
depends
(yes,
that
is
an
annoying
answer)
Consider
different
applica%on
“design
paCerns”
How
will
the
system
grow
as
complexity
grows?
The
answer
is
specific
per
design
pa:ern
How
should
DevOps
monitor
and
manage?
6. © 2015 ligaDATA, Inc. All Rights Reserved.
6
ligaDATA
Kamanja Platform
Storage
Ouput
Queues
Input
Queues
Decisioning
Ac0ons
CDC,
Logs,
Apps
Next Best
Action
Batch
Stores
Application
Updates
Decision
Engine
Admin
Management
kamanja
Databases
ESBs
Alerts &
Notifications
Social
3rd Party
Data
Sources
Data Store
7. © 2015 ligaDATA, Inc. All Rights Reserved.
7
ligaDATA
See Kamanja.org, and github
Kamanja
is
used
as
an
example,
The
process
is
in
this
talk
is
general
and
can
be
broadly
applied
to
other
OSS.
Kamanja
is
a
big
data
con0nuous
decisioning
system
Apache
license,
available
on
github
8. © 2015 ligaDATA, Inc. All Rights Reserved.
8
ligaDATA
Application Design Pattern
Departmental Model Scoring Application
Scaling
challenges
transac0on
growth
and
type
(quan0ty
&
speed)
model
complexity
(hybrid
systems)
quan0ty
of
models:
10’s
to
10k’s
for
most
models,
most
fields,
need
to
access
the
data
store
for
preprocessing
Input
queue
Model
Scoring
Real time
Output
Queue
Cache +
Data Store
Managementand
ControlSystem
Financial
Log
Consumer
Business
Preprocessing
& Scores
Reporting
Analysis
Lambda
Architecture
Combines
Real time
And Batch
PMML
9. © 2015 ligaDATA, Inc. All Rights Reserved.
9
ligaDATA
Application Design Pattern
Social Network Analysis
Scaling
challenges
transac0on
growth
and
source
(quan0ty
&
speed)
model:
sen0ment,
graph
quan0ty
of
models:
a
few
data
store
lookup
for
base
user
info
Input
queue
Model
Scoring
Real Time
Charting, Alerting
Cache +
Data Store
Managementand
ControlSystem
Twitter
Facebook
:
User baseline
Network
Trend Analysis
Deep Dive
Java, Scala
10. © 2015 ligaDATA, Inc. All Rights Reserved.
10
ligaDATA
Application Design Pattern
Text Mining, Search
Scaling
challenges
transac0on
growth
some
projects:
very
heavy
compu0ng
for
NLP
parsing
quickly
score
on
tagged
results
Input
queue
Model
Scoring
Output
Queue
Cache +
Data Store
Managementand
ControlSystem
Pages
Documents
Posts
Tweets
Java, Stanford NLP
Parse trees
Inverted indexes
Trending topics
Update Thesaurus
Docs ßà Topics
11. © 2015 ligaDATA, Inc. All Rights Reserved.
11
ligaDATA
Details on Departmental Scoring:
Credit Card Fraud Detection System
How
to
develop
an
example
system?
There
is
no
public
data.
Private
won’t
be
shared
Generate
the
data
(then
can
also
test
BIG
DATA)
Focus
on
5
use
cases
of
“normal”
and
5
“fraud”
Configuring
architecture
can
be
used
for
1)
Performance
tes0ng
for
different
requirements
2)
Pilot
system,
example
included
w/
Kamanja
Train
models,
generate
PMML
for
scoring
12. © 2015 ligaDATA, Inc. All Rights Reserved.
12
ligaDATA
Credit Card Fraud Detection System
FRAUD Use Cases
Fraudster
extrac%ng
value
out
of
hacked
card
Likely
a
first
“test”
of
CC
info.
iTunes
or
unmanned
gas
pump
w/o
camera
Drain
account
up
to
CC
limit
in
15
min,
up
to
2-‐3
days
Purchase
things
“easy
to
cash
out
or
resell”
–
launder
money
giP
cards,
gems,
jewelry,
small
electronics
easy
to
sell,
burner
phones
F1)
Elder
abuse
–
either
PII
or
CC
info
gets
copied
Fraudster
opens
first
web
or
mobile
account
(surprising
for
grandmother)
Higher
credit
limit,
long
0me
with
no
web/mobile
Long
0me
CC
holder
(high
tenure),
li:le
spend
varia0on
F2)
Hacker
bought
PII
(Personally
Iden0fiable
Informa0on)
Fraudster
used
PII
to
apply
for
a
new
account
new
account
likely
has
a
lower
credit
limit
Over
1st
month,
slowly
changes
PII
to
fraudsters
to
not
alert
vic0m
use
in
“card
not
present”
situa0ons
13. © 2015 ligaDATA, Inc. All Rights Reserved.
13
ligaDATA
Credit Card Fraud Detection System
FRAUD Use Cases
F3)
Physical
clone
Fraudster
may
have
bought
CC
info
online
($1/account)
or
copied
mag
strip
from
the
vic0m
in
the
store.
Fraudster
card
use
can
be
concurrent
with
normal
consumer
use
–
or
very
different
place
and
0me
zone
F4)
Rare
Behavior
(may
be
part
of
other
use
cases)
Unusual
0me
of
day,
geography,
spending
by
type
of
goods
/
services
F5)
Risky
Behavior
–
fraudster
may
visit
blacklisted
web
page
Fraudster
is
engaging
with
Geography
changes
are
not
plausible
(noon
in
San
Jose,
1pm
in
Hong
Kong)
Relate
to
past
labeled
cases
of
CC
fraud.
14. © 2015 ligaDATA, Inc. All Rights Reserved.
14
ligaDATA
Credit Card Fraud Detection System
NORMAL Use Cases
1) Steady
State
use
–
the
CC
use
by
these
people
is
fairly
consistent
and
stable.
Can
have
a
die
vei
2) New
Card,
1st
month
–
this
example
is
setup
to
make
it
difficult
to
compare
with
fraudulently
opened
new
cards.
Spending
may
max
out
3)
Young
and
star%ng
singles
or
newly
married.
These
people
don’t
have
much
of
a
credit
ra0ng
More
likely
to
use
web
and
mobile
channels.
More
likely
to
wander
to
dangerous
areas
of
the
web.
Likely
to
spend
in
a
bigger
array
of
categories
Possibly
many
geographic
loca0ons
4)
Normal
Case,
Family
–
Medium
to
higher
income
limit,
many
don’t
hit
limit
Low
to
moderate
showing
up
in
new
geographies,
or
spending
on
new
catagor.
5)
Work
Travel
–
Work
in
sales
or
consul0ng.
New
loca0ons
are
no
surprise.
Higher
spending
limit
and
amounts,
many
flight,
hotel,
car
rental,
high
mobile
15. © 2015 ligaDATA, Inc. All Rights Reserved.
15
ligaDATA
Pilot Project & Performance Testing
Credit Card Fraud Detection
Input
queue
Model
Scoring
Real time
Output
Input
queue
Real time
OutputInput
queue
Input
queue
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Real time
Output
Real time
Output
Input
queue
Model
Scoring
Real time
Output
Cache +
Data Store
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
1 Kafka
1 Kamanja
1 Kafka
~3 Kafka
16 Kamanja
~3 Kafka
Add
Preprocessing
Logic and
HBase table lookup
16. © 2015 ligaDATA, Inc. All Rights Reserved.
16
ligaDATA
Performance Testing – Model Node
Credit Card Fraud Detection
Fields
per
record:
tes0ng
network
speed
between
nodes
30,
120,
480
fields
(yes,
could
go
10k,
100k)
Single
model
complexity:
tes0ng
compute
load
Small,
Medium
&
Large
(100,
2k,
32.5k
elements)
Preprocessing
lookup
tables:
tes0ng
cache
to
HB
&
netwrk
none,
some
Ensemble
Models
per
score:
tes0ng
compute
&
network
1,
5,
20
Number
of
Models
in
department:
1,
10,
100
17. © 2015 ligaDATA, Inc. All Rights Reserved.
17
ligaDATA
Solution to Developer Questions
(How Fast, How to Configure?)
How
many
fields
per
record?
30,
120,
480
(SML)
What
model
complexity?
100,
2k,
32.5k
(SML)
Is
data
already
preprocessed?
Yes,
No
(YN)
Average
models
/
ensemble?
1,
5,
20
(SML)
How
many
models
in
the
department?
1,
10,
100
(SML)
What
language?
PMML,
Java,
Scala
(I
want
to
create
a
table
like…)
Requirements
à
Then
need
configure
For
speed
rec/s
S,S,Y,M,S
1
Kaf,
1
Kam,
1
Kaf
1.1mm
M,L,Y,M,S
1
Kaf,
1
Kam,
1
Kaf
200K
L,L,N,L,L
3
Kaf,
16
Kam,
1
Kaf,
3HB
1.6mm
Generate
Architecture
and
run
an
80%
relevant
Pilot
18. Text or
Twitter
API
Java 1
and GUI Kafka
Java 3 for
analysis
Data
Store
Java calls API, and
Kafka producer
Tweets returned in
JSON
JSON tweets sent to Kafka Kafka JSON
to Kamanja
JSON with features saved in DB
JAVA: Every “time window”, queries the DB to aggregate (i.e. count (tags) by (tags) by..)
JSON returns the aggregate query results to JAVA
JSON query results to Kafka
JSON results of rule scoring, alert text
13 Tomcat web
service displays data
and charts
Matched_tags_
per_text
table
results to Java 3 for scoring,
with thresholds
Alerts table
Save results to DB
JAVA 1: check for updates to the alerts table
Kamanja
1
2
3 4
5
6
7
8 9
11
12
10
Java 2 for
Features
Sentiment or
Stanford NLP
Social Netowork Analysis:
Example System Configuration
ligaDATA
19. 19
© 2015 ligaDATA, Inc. All Rights Reserved.
ligaDATA
Scoring
Engine
(Kamanja)
PMML Diagram
Predictive Modeling Markup Language
Training & test data
(batch)
Data
Mining
Tool File, Save As
PMML
PMML
File
PMML
Producer
(18 available)
PMML
FileScoring data
(real time streaming)
Output data has
new score field
Training Project Phase
Production Scoring Project Phase
Full model
specification
PMML Consumer
20. 20
© 2015 ligaDATA, Inc. All Rights Reserved.
ligaDATA
Given industry fragmentation,
PMML is a solution for Data Mining scoring
PMML Producers (18 data mining packages)
• R (Rattle, PMML)*
• RapidMiner
• KNIME*
PMML Consumers (12 co)
• Zementis
• IBM SPSS
• KNIME
• Microstrategy
• SAS
• Kamanja* (Open Source)
• Spark (MLib)* * = Open Source
• Weka*
• SAS Enterprise Miner
PREDICTIVE
Naïve Bayes
Neural Net
Regression
Rules
Scorecard
Sequence
SVM
Time Series
Trees
DESCRIPTIVE / OTH
Association Rules
Cluster, K-Nearest Nb
Text Models
model ensembles &
composition
(i.e. Gradient Boosting)
21. © 2015 ligaDATA, Inc. All Rights Reserved.
21
ligaDATA
Summary
Ques%on)
How
to
help
a
OSS
pilot
evalua0on
go
faster?
Answer)
Develop
“design
pa:erns”
for
applica0ons
Pick
a
specific
app
(Credit
Card
Fraud
Detec0on)
Get
data
(end
up
genera0ng
it)
Need
to
vary
arch
config
(like
performance
tes0ng)
Given
requirements,
generate
a
mul0-‐node
example
pilot
system,
involving
many
OSS
components
PMML
can
abstract
the
produc0on
step
from
model
building
22. © 2015 ligaDATA, Inc. All Rights Reserved.
Try out
Kamanja
© 2015 ligaDATA, Inc. All Rights Reserved.
CONFIDENTIAL
Download, Forums, Docs, Events http://Kamanja.org
ligaDATA
http://kamanja.org/white-papers/
23. Kamanja: 220k to 230k messages / second
CONFIGURATION:
• 16 core box, using Solid State Disc
• Sample Tool to generate messages of size 1k (not being reduced)
• Data Mining uses 100’s to 100k fields – not 100 byte message
• Kafka Queue
• 3 input queues, each queue has 8 partitions
• Kamanja Engine
• Using the remaining 12-13 cores
• Not saving score results per record in this test
SO WHAT? COMPARISON:
• Storm is currently the lowest latency Apache big data system
• Storm integration, got up to 90k to 100k for same data
• Kamanja is 2.4 times faster than Storm = (225k/95k) in this test
• Spark streaming is with mini-batches, with higher latency than Storm or Kamanja
Why is Kamanja faster
than Storm?
Storm reads the data from
the input queue (sprout)
and passes that to Bolts.
Each pass between sprout
to bolt they serialize &
deserialize the data. There
is other overhead.
Kamanja:
One Speed Analysis
ligaDATA