These days companies are collecting more and more data. It’s up to data scientists to create business value out of that data. Typically this is done by training models based on historical data stored on HDFS. Once the model has been trained it is ready to be scored. At ING Bank we need to score models in real time, blocking potential fraudulent transactions before causing damage to either the customer or the bank. As fraudsters invent new ways to commit fraud, we also need to add new models on a running system, without downtime. In this talk we’ll present our implementation of a real time streaming analytics platform that enables us to dynamically change the behaviour of our stateful Flink application. The end result is an environment where end users are provided a DSL they can use to dynamically stream in new models into the Flink job as well as to change the transformations within the operators. This will give them full control of the streaming analytics platform at runtime.
2. IT Chapter Lead within the Fraud & Cybersecurity
department, based in Amsterdam
Before ING implemented Enterprise Software,
mainly knowledge management and CRM related
Background in: Scala, Java, C# (MCSD), Tomcat, Websphere,
Oracle, Cassandra and now….Flink
https://www.linkedin.com/in/erik-de-nooij-93ab1a/
Erik.g.de.Nooij@ing.nl
Who Am I?
2
4. Worldwide
35 Million customers
51.000 Employees
Presence in over 40 countries
Netherlands
9 Million Customers
Billion logins yearly on https://www.ing.nl
1 million transactions per day
About ING
4
Market leaders Benelux
Growth markets
Commercial Banking
Challengers
The Netherlands
5. Threats
Individuals Small groups worldwide groups Organized crime
Manual detection
Rule based detection
Model based detection
Criminal
organizationResponse
Scanomaly detection
Fake ID Skimming Phishing APT
?
2008 2010 2012 2014
2017
Threats related to fraud & cybersecurity
5
7. Support various types of (ML) models
Tools to create models versus scoring models
One codebase, SaaS deployment model
Make changes instantly (no downtime)
Multiple domains
Goals
7
8. Support various types of (ML) models
One codebase, SaaS deployment model
Pre-processor, Decoupled architecture
Make changes instantly (no downtime)
Multiple domains
Goals
8
9. Support various types of (ML) models
One codebase, SaaS deployment model
Make changes instantly (no downtime)
Use case
Feature extraction
Enriching streams
End user tooling
Demo
Multiple domains
Goals
9
10. Support various types of (ML) models
One codebase, SaaS deployment model
Make changes instantly (no downtime)
Multiple domains
examples
Goals
10
13. The Predictive Model Markup Language (PMML)
is an XML-based predictive model interchange format
Predictive Model Markup Language (PMML)
13
<SimpleRule score="Alert" weight="1.0">
<CompoundPredicate booleanOperator="and">
<SimplePredicate field="field1" operator="greaterThan" value="500"/>
<SimplePredicate field="field2" operator="equal" value="1"/>
<SimplePredicate field="field3" operator="greaterThan" value="1"/>
</CompoundPredicate>
</SimpleRule>
if field1 > 500
AND
field2 == 1
AND
field3 > 1
14. The Predictive Model Markup Language (PMML)
is an XML-based predictive model interchange format
Predictive Model Markup Language (PMML)
14
16. Parse the pmml file(s)
Pass on the Feature Set to the model(s)
Run the ‘predict’ function which returns the output of the model(s)
16
Model scoringusing OpenScoring.iolibrary
Control stream
Data stream
Score
Feature sets
model
scoring
17. Supportedmodels
17
Supported models(*)
Association rules Regression
Cluster model Rule set
General regression Scorecard
Naive Bayes Support Vector Machine
k-Nearestneighbours Tree model
Neural network Ensemble model
(*) supported models by http://openscoring.io/
18. Goals
18
Use of various types of models
One codebase, SaaS Deployment model
Pre-processor, Decoupled architecture
Make changes instantly (no downtime)
Multiple domains
22. Goals
22
Use of various types of models
One codebase, SaaS Deployment model
Make changes instantly (no downtime)
Use case
Feature extraction
Enriching streams
End user tooling
Demo
Multiple domains
23. • Your phone with the banking app installed is stolen
• Limit on the banking app is 1.000,-
• Funds are transferred from your account (A) to a mule account (B)
Use case
23
24. Model features and model output
24
Amount > 500
NrOf Trxs Last 1h
First Trx <24h ago
Model
Alert || OK
25. Stream with stateless operators
25
A
B
1000
Ev.1
Model
scoring
Amount, Unknown, PrevTrxs
PMM
L
FeX
(1000, ?, ?)
Feature
extraction
26. Stream with stateful operators
26
STATE
A
B
1000
Ev.1
A
B
1000
Ev.2
Model
scoring
Alert ||
OK
Alert ||
OK
Key Value
(A,B, FirstTrx) Ev.1
(A,B, HistoricalTrxs) ev11000
Amount, Unknown, PrevTrxs
PMM
L
FeX
(1000, true, 1)
Key Value
(A,B, FirstTrx) Ev.1
(A,B, HistoricalTrxs) ev11000, ev21000
Amount, Unknown, PrevTrxs
(1000, true, 0)
27. How to perform aggregate functions on a stream?
27
Average amount last week: € 37,04
Max amount last month: € 834,12
Average amount last week: € 37,04
28. A
B
IP
1000
Ev.1
192.x.x.4, …….
192.x.x.3, 192.x.x.7
192.x.x.2, 192.x.x.6
192.x.x.1, 192.x.x.5
Aggregation
step
Calculating
features
Enriching the stream based on multiple keys
28
Split
A A’
A
B
IP
1000
Ev.1
B
A.
B
I
P
B’
A.B’
IP’
3542321
3542321
3542321
3542321
3542321
A,E,I ..
B D,F ..
C G, H
..
J, K ..
Accounts are distributed across the task managers
29. (A.B’,
1000)
Aggregating and model scoring
29
A
B
IP
1000
Ev.1
1. Amount
2. (A.B).FirstTr
x
3. (A.B).NrTrxs
A
B
IP
1000
Ev.1
A.B’
B’
(B’)
1. B’
1. IP’
2. ….
Aggregation Model Scoring
30. A DSL is a domain specific language. We use it to define the
behaviour of our operators.
The persist rules (which data to store within state)
Feature calculation rules
Model definition rules
Domain Specific Language (DSL)
30
32. NrOf Trxs Last 1h
count(between @(sourceAccntNr.destAccntNr).Trxs, $eventtime,$eventtime-1hour));
First Trx A to B <24h
@(sourceAccntNr.destAccntNr).FirstUsed >= $eventtime-24hours;
Feature Calculation rules
32
37. Goals
37
Use of various types of models
One codebase, SaaS Deployment model
Make changes instantly (no downtime)
Multiple domains
38. We have built a feature-extraction engine and used that to make a
Fraud-Risk Engine
Can we also build this?….
Customer Notifications?
Calculating RFQ’s for Bond Prices?
Product Fullfilment engine?
Other?
Multiple domains – ponder on this
38