Hundreds of queries in the time of one - Gianmario Spacagna

Hundreds of queries
in the time of one
Gianmario Spacagna
gianmario.spacagna@barclayscorp.com

• Retail Banking
• 1M+ Barclays business
customers in UK
• 95% small businesses
• Many of them accept
debit card payments
• Huge potential for
Business Analytics

Small businesses can’t
harness their own data and/or
compare their performance
with the competitors
• Monetary cost
• Lack of IT infrastructures
• Lack of analytics expertise
• Lack of market data

Insights Engine
• Calculates a business’ Key
Performance Indicators (KPIs) by
combining hundreds of Business
Intelligence (BI) queries
• Compares them to those of similar
local businesses, i.e. market
competitors
• Filters them to expose only relevant
insights and to preserve privacy
• Presents them in natural language
form
• Collects feedback

Example of Insights
Archetype Comparison Granularity
Time Year
Industry Hairdressing
Location Blackpool
Archetype 1: “Compare to customers in
Segment/Location”
f(A) vs. f(B)
“This year, your business spent £1,000 on electricity, other hairdressing
businesses in Blackpool spent on average £1,200 on electricity”
Archetype 2: “Calculate Growth and
compare to customers in
Segment/Location”
f(A)/f(A’) vs. f(B)/f(B’)
“Based on the transactions for the past 12 months, your customers spend £25
on average each time they visit your business. By comparison, customers of
other hairdressing businesses in Blackpool spend £23.”
Archetype 3: “Calculate % of wallet and
compare to customers in
Segment/Location”
f(A)/g(A) vs. f(B)/g(B)
“This year, your customers spent 2.3% of their income in your business.
Customers of other hairdressers in Blackpool spent 3.5% of their income with
them.”

Technical Challenges
Multiple Operations on the Same Dataset  Optimized
In-Memory Execution Plan  No Unnecessary I/O
Agile  Safe Refactoring Statically Typed
Parallel Architecture  Map/Reduce &
Composable High-Orders Functions
Functional Programming
None of the above would be feasible in traditional RDBMS!

Technical Challenges
Multiple Operations on the Same Dataset  Optimized
In-Memory Execution Plan  No Unnecessary I/O
Agile  Safe Refactoring Statically Typed
Parallel Architecture  Map/Reduce &
Composable High-Orders Functions
Functional Programming
Building complex production-quality applications 
• Flexibility
• Richness
• High-level features
• Native to the computation framework
None of the above would be feasible in traditional RDBMS!

Domain Specific Language
• Elegant: few lines where SQL would use more than 200
• Natural: use English to specify, implement and test
• Compiled: easy to spot mistakes

Commutative Monoids
Algebraic Structure made of (T, |+|, Zero):
• Type T
• Binary operator |+|: (T, T) => T
• Neutral element Zero: T
sum = (Int, +, 0), multiplication = (Int, *, 1), distinct = (Set, ++, Set.empty)
Properties:
• (Associativity, Commutability) => Parallelizable aggregation
• Identity: t |+| Zero = t
Our composable “Count and Sum of Amount” monoid:
– ((Int, Int), (count1 + count2, sum1 + sum2), (0, 0))
– val toMonoid = (t: Transaction) => (1, t.amount)
– e.g. (4, 120) |+| (1, 80) = (5, 200)
– Can even achieve median using probabilistic monoids, or
DistinctCountBounds using hash sets
(13,790)
(9, 630)
(3, 400)
(1, 120)
(1, 200)
(1, 80)
(2, 90)
(1, 60)
(1, 30)
(4, 140)
(1, 90)
(1, 10)
(1, 30)
(1, 10)
(4, 160)
(2, 50)
(1, 20)
(1, 30)
(2, 110)
(1, 60)
(1, 50)

Insight Part Optimization
Minimum Set of Insight Parts
Insight
Type 3
Insight
Type 2
Insight
Type 1
Part1
Part2
Part3 Part4
Part5
Given the set of insight types:
• Optimize the minimum number of informative insight parts to compute
• Each part consists of:
– Filters (bookkeeping, paymentType, spendCategory)
– Monoids (Sum, Count, Distinct, Median…)
Example:
- Type: “Energy spending growth”
- 4 Insight Parts:
“(Count & Sum of Amount) of Outgoing Transactions for Energy” of:
1. this business in current timeslot
2. this business in previous timeslot
3. competitors in current timeslot
4. competitors in previous timeslot

Hierarchical Aggregation
Time:
• Month
• Quarter
• Half year
• Year
Industry:
• Sub-segment
• Segment
• All
Location:
• Area
• County
• Country
• Whole UK
day
businessId
filters:
• bookkeeping = Outgoing
• spendCategory = Electricity
Cell contains all of the
monoids sliced by
filters of each extracted
Part and aggregated at
that granularity
combination
(1, t.amount)
Given the set of insight parts and granularities:
• Fill the OLAP cube starting from the finest
granularity levels
• Example:
(Month, Hairdressing, Blackpool) ->
List(part1@Part(filters = List(Ingoing),
monoid = (Count(122), Sum(11928)),
part2@Part(filters =
List(Electricity, Outgoing),
monoid = (Sum(89))
• Further aggregation on ascending levels

Derive Insights from the
“memoized” results
Atomic
Monoids
Memoized
Parts
Insight
1. this busines
s in current
year
2. this business
in previous
year
4. competitors in:
previous year,
Blackpool,
hairdressing
“This year, your
business spent £1,000
on electricity, other
hairdressing businesses
in Blackpool spent on
average £1,200 on
electricity”3. competitors in:
current year,
Blackpool,
hairdressing
This business parts (1 and 2)
comparison parts in each
granularity combination (3 and 4)
Collate
Function
• Growth
• Compare
• Ratios
• Best day of week

DSL
monoids
comparison levels
comparison time
filters
collate function

Some Numbers
• 700,000,000 rows of data (2 years worth)
• 275,000 UK Businesses
• 66 Insights for each Business
– Upper Bound:
9 main queries (insight types) *
4 sub-queries (insight parts) *
36 granularities (dimension combos) =
1296 Queries
– Filtered on Privacy and Ranked by Relevance
• The Engine ran in 30 minutes
– on a small low-performance cluster
6 x (20 CPUs, 48G RAM)
– 500x faster than Hive and probably wouldn’t
return on Teradata

Summary
Insights Engine is an analytical engine that takes hundreds of queries as input and generates an
optimized single execution plan by combining and re-using intermediate results for each business and
each combination of granularity over multiple hierarchical dimensions.
What is cool about it:
1. Composable ”Monoids” allowing aggregations at multiple
levels of granularity, like a tree
2. A DSL that defines Insights succinctly
(3 lines of code vs ~250 lines of SQL)
3. Inspection of the queries specified in the DSL to find
"duplicate" structures of computation (Insight Parts), and
up-front “memoization" to ensure they are only computed
once
4. Ensure all of the results are privacy-safe and relevance-
ranked

Follow-up Links
• Insights Engine Blog on Cloudera:
http://blog.cloudera.com/blog/2015/08/how-apache-spark-scala-and-functional-
programming-made-hard-problems-easy-at-barclays/
• Fast accurate low-memory aggregations DSL for Spark:
https://github.com/samthebest/aggregations
• Contribute to the Agile Data Science Manifesto:
www.datasciencemanifesto.com

Hundreds of queries in the time of one - Gianmario Spacagna

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Hundreds of queries in the time of one - Gianmario Spacagna

Similar to Hundreds of queries in the time of one - Gianmario Spacagna (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Hundreds of queries in the time of one - Gianmario Spacagna