SlideShare a Scribd company logo
1 of 16
Hundreds of queries
in the time of one
Gianmario Spacagna
gianmario.spacagna@barclayscorp.com
• Retail Banking
• 1M+ Barclays business
customers in UK
• 95% small businesses
• Many of them accept
debit card payments
• Huge potential for
Business Analytics
Small businesses can’t
harness their own data and/or
compare their performance
with the competitors
• Monetary cost
• Lack of IT infrastructures
• Lack of analytics expertise
• Lack of market data
Insights Engine
• Calculates a business’ Key
Performance Indicators (KPIs) by
combining hundreds of Business
Intelligence (BI) queries
• Compares them to those of similar
local businesses, i.e. market
competitors
• Filters them to expose only relevant
insights and to preserve privacy
• Presents them in natural language
form
• Collects feedback
Example of Insights
Archetype Comparison Granularity
Time Year
Industry Hairdressing
Location Blackpool
Archetype 1: “Compare to customers in
Segment/Location”
f(A) vs. f(B)
“This year, your business spent £1,000 on electricity, other hairdressing
businesses in Blackpool spent on average £1,200 on electricity”
Archetype 2: “Calculate Growth and
compare to customers in
Segment/Location”
f(A)/f(A’) vs. f(B)/f(B’)
“Based on the transactions for the past 12 months, your customers spend £25
on average each time they visit your business. By comparison, customers of
other hairdressing businesses in Blackpool spend £23.”
Archetype 3: “Calculate % of wallet and
compare to customers in
Segment/Location”
f(A)/g(A) vs. f(B)/g(B)
“This year, your customers spent 2.3% of their income in your business.
Customers of other hairdressers in Blackpool spent 3.5% of their income with
them.”
Technical Challenges
Multiple Operations on the Same Dataset  Optimized
In-Memory Execution Plan  No Unnecessary I/O
Agile  Safe Refactoring Statically Typed
Parallel Architecture  Map/Reduce &
Composable High-Orders Functions
Functional Programming
None of the above would be feasible in traditional RDBMS!
Technical Challenges
Multiple Operations on the Same Dataset  Optimized
In-Memory Execution Plan  No Unnecessary I/O
Agile  Safe Refactoring Statically Typed
Parallel Architecture  Map/Reduce &
Composable High-Orders Functions
Functional Programming
Building complex production-quality applications 
• Flexibility
• Richness
• High-level features
• Native to the computation framework
None of the above would be feasible in traditional RDBMS!
Domain Specific Language
• Elegant: few lines where SQL would use more than 200
• Natural: use English to specify, implement and test
• Compiled: easy to spot mistakes
Commutative Monoids
Algebraic Structure made of (T, |+|, Zero):
• Type T
• Binary operator |+|: (T, T) => T
• Neutral element Zero: T
sum = (Int, +, 0), multiplication = (Int, *, 1), distinct = (Set, ++, Set.empty)
Properties:
• (Associativity, Commutability) => Parallelizable aggregation
• Identity: t |+| Zero = t
Our composable “Count and Sum of Amount” monoid:
– ((Int, Int), (count1 + count2, sum1 + sum2), (0, 0))
– val toMonoid = (t: Transaction) => (1, t.amount)
– e.g. (4, 120) |+| (1, 80) = (5, 200)
– Can even achieve median using probabilistic monoids, or
DistinctCountBounds using hash sets
(13,790)
(9, 630)
(3, 400)
(1, 120)
(1, 200)
(1, 80)
(2, 90)
(1, 60)
(1, 30)
(4, 140)
(1, 90)
(1, 10)
(1, 30)
(1, 10)
(4, 160)
(2, 50)
(1, 20)
(1, 30)
(2, 110)
(1, 60)
(1, 50)
Insight Part Optimization
Minimum Set of Insight Parts
Insight
Type 3
Insight
Type 2
Insight
Type 1
Part1
Part2
Part3 Part4
Part5
Given the set of insight types:
• Optimize the minimum number of informative insight parts to compute
• Each part consists of:
– Filters (bookkeeping, paymentType, spendCategory)
– Monoids (Sum, Count, Distinct, Median…)
Example:
- Type: “Energy spending growth”
- 4 Insight Parts:
“(Count & Sum of Amount) of Outgoing Transactions for Energy” of:
1. this business in current timeslot
2. this business in previous timeslot
3. competitors in current timeslot
4. competitors in previous timeslot
Hierarchical Aggregation
Time:
• Month
• Quarter
• Half year
• Year
Industry:
• Sub-segment
• Segment
• All
Location:
• Area
• County
• Country
• Whole UK
day
businessId
filters:
• bookkeeping = Outgoing
• spendCategory = Electricity
Cell contains all of the
monoids sliced by
filters of each extracted
Part and aggregated at
that granularity
combination
(1, t.amount)
Given the set of insight parts and granularities:
• Fill the OLAP cube starting from the finest
granularity levels
• Example:
(Month, Hairdressing, Blackpool) ->
List(part1@Part(filters = List(Ingoing),
monoid = (Count(122), Sum(11928)),
part2@Part(filters =
List(Electricity, Outgoing),
monoid = (Sum(89))
• Further aggregation on ascending levels
Derive Insights from the
“memoized” results
Atomic
Monoids
Memoized
Parts
Insight
1. this busines
s in current
year
2. this business
in previous
year
4. competitors in:
previous year,
Blackpool,
hairdressing
“This year, your
business spent £1,000
on electricity, other
hairdressing businesses
in Blackpool spent on
average £1,200 on
electricity”3. competitors in:
current year,
Blackpool,
hairdressing
This business parts (1 and 2)
comparison parts in each
granularity combination (3 and 4)
Collate
Function
• Growth
• Compare
• Ratios
• Best day of week
DSL
monoids
comparison levels
comparison time
filters
collate function
Some Numbers
• 700,000,000 rows of data (2 years worth)
• 275,000 UK Businesses
• 66 Insights for each Business
– Upper Bound:
9 main queries (insight types) *
4 sub-queries (insight parts) *
36 granularities (dimension combos) =
1296 Queries
– Filtered on Privacy and Ranked by Relevance
• The Engine ran in 30 minutes
– on a small low-performance cluster
6 x (20 CPUs, 48G RAM)
– 500x faster than Hive and probably wouldn’t
return on Teradata
Summary
Insights Engine is an analytical engine that takes hundreds of queries as input and generates an
optimized single execution plan by combining and re-using intermediate results for each business and
each combination of granularity over multiple hierarchical dimensions.
What is cool about it:
1. Composable ”Monoids” allowing aggregations at multiple
levels of granularity, like a tree
2. A DSL that defines Insights succinctly
(3 lines of code vs ~250 lines of SQL)
3. Inspection of the queries specified in the DSL to find
"duplicate" structures of computation (Insight Parts), and
up-front “memoization" to ensure they are only computed
once
4. Ensure all of the results are privacy-safe and relevance-
ranked
Follow-up Links
• Insights Engine Blog on Cloudera:
http://blog.cloudera.com/blog/2015/08/how-apache-spark-scala-and-functional-
programming-made-hard-problems-easy-at-barclays/
• Fast accurate low-memory aggregations DSL for Spark:
https://github.com/samthebest/aggregations
• Contribute to the Agile Data Science Manifesto:
www.datasciencemanifesto.com

More Related Content

What's hot

Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflow
Databricks
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Spark Summit
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Databricks
 

What's hot (20)

Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflow
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas Dinsmore
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With Spark
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Machine Learning with Spark MLlib
Machine Learning with Spark MLlibMachine Learning with Spark MLlib
Machine Learning with Spark MLlib
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 
Accelerating Production Machine Learning with MLflow with Matei Zaharia
Accelerating Production Machine Learning with MLflow with Matei ZahariaAccelerating Production Machine Learning with MLflow with Matei Zaharia
Accelerating Production Machine Learning with MLflow with Matei Zaharia
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
 

Viewers also liked

Viewers also liked (18)

Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van Hovell
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
 
Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon Whitear
 
Spark Summit EU talk by Jim Dowling
Spark Summit EU talk by Jim DowlingSpark Summit EU talk by Jim Dowling
Spark Summit EU talk by Jim Dowling
 
Spark Summit EU talk by Larisa Sawyer
Spark Summit EU talk by Larisa SawyerSpark Summit EU talk by Larisa Sawyer
Spark Summit EU talk by Larisa Sawyer
 
Architecture Big Data open source S.M.A.C.K
Architecture Big Data open source S.M.A.C.KArchitecture Big Data open source S.M.A.C.K
Architecture Big Data open source S.M.A.C.K
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean Wampler
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Lessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerLessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On Docker
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 

Similar to Hundreds of queries in the time of one - Gianmario Spacagna

Dimensional Modelling Session 2
Dimensional Modelling Session 2Dimensional Modelling Session 2
Dimensional Modelling Session 2
akitda
 
Excel Pivot Tables and Graphing for Auditors
Excel Pivot Tables and Graphing for AuditorsExcel Pivot Tables and Graphing for Auditors
Excel Pivot Tables and Graphing for Auditors
Jim Kaplan CIA CFE
 

Similar to Hundreds of queries in the time of one - Gianmario Spacagna (20)

Boosting the Performance of your Rails Apps
Boosting the Performance of your Rails AppsBoosting the Performance of your Rails Apps
Boosting the Performance of your Rails Apps
 
Dimensional Modelling Session 2
Dimensional Modelling Session 2Dimensional Modelling Session 2
Dimensional Modelling Session 2
 
#GeodeSummit - Wall St. Derivative Risk Solutions Using Geode
#GeodeSummit - Wall St. Derivative Risk Solutions Using Geode#GeodeSummit - Wall St. Derivative Risk Solutions Using Geode
#GeodeSummit - Wall St. Derivative Risk Solutions Using Geode
 
Building Wall St Risk Systems with Apache Geode
Building Wall St Risk Systems with Apache GeodeBuilding Wall St Risk Systems with Apache Geode
Building Wall St Risk Systems with Apache Geode
 
IT Business Management - Why Now?
IT Business Management - Why Now?IT Business Management - Why Now?
IT Business Management - Why Now?
 
Enabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopEnabling real interactive BI on Hadoop
Enabling real interactive BI on Hadoop
 
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
 
LeanXcale for Monitoring
LeanXcale for MonitoringLeanXcale for Monitoring
LeanXcale for Monitoring
 
ROI and Economic Value of Data Virtualization
ROI and Economic Value of Data VirtualizationROI and Economic Value of Data Virtualization
ROI and Economic Value of Data Virtualization
 
Excel Pivot Tables and Graphing for Auditors
Excel Pivot Tables and Graphing for AuditorsExcel Pivot Tables and Graphing for Auditors
Excel Pivot Tables and Graphing for Auditors
 
AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...
AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...
AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...
 
Innovate 2014 - Customizing Your Rational Insight Deployment (workshop)
Innovate 2014 - Customizing Your Rational Insight Deployment (workshop)Innovate 2014 - Customizing Your Rational Insight Deployment (workshop)
Innovate 2014 - Customizing Your Rational Insight Deployment (workshop)
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorization
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization
 
Automating Business Insights on AWS,
Automating Business Insights on AWS, Automating Business Insights on AWS,
Automating Business Insights on AWS,
 
Dynamics CRM high volume systems - lessons from the field
Dynamics CRM high volume systems - lessons from the fieldDynamics CRM high volume systems - lessons from the field
Dynamics CRM high volume systems - lessons from the field
 
EM12c: Capacity Planning with OEM Metrics
EM12c: Capacity Planning with OEM MetricsEM12c: Capacity Planning with OEM Metrics
EM12c: Capacity Planning with OEM Metrics
 
Enabling Telco to Build and Run Modern Applications
Enabling Telco to Build and Run Modern Applications Enabling Telco to Build and Run Modern Applications
Enabling Telco to Build and Run Modern Applications
 

More from Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 

Recently uploaded (20)

RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 

Hundreds of queries in the time of one - Gianmario Spacagna

  • 1. Hundreds of queries in the time of one Gianmario Spacagna gianmario.spacagna@barclayscorp.com
  • 2. • Retail Banking • 1M+ Barclays business customers in UK • 95% small businesses • Many of them accept debit card payments • Huge potential for Business Analytics
  • 3. Small businesses can’t harness their own data and/or compare their performance with the competitors • Monetary cost • Lack of IT infrastructures • Lack of analytics expertise • Lack of market data
  • 4. Insights Engine • Calculates a business’ Key Performance Indicators (KPIs) by combining hundreds of Business Intelligence (BI) queries • Compares them to those of similar local businesses, i.e. market competitors • Filters them to expose only relevant insights and to preserve privacy • Presents them in natural language form • Collects feedback
  • 5. Example of Insights Archetype Comparison Granularity Time Year Industry Hairdressing Location Blackpool Archetype 1: “Compare to customers in Segment/Location” f(A) vs. f(B) “This year, your business spent £1,000 on electricity, other hairdressing businesses in Blackpool spent on average £1,200 on electricity” Archetype 2: “Calculate Growth and compare to customers in Segment/Location” f(A)/f(A’) vs. f(B)/f(B’) “Based on the transactions for the past 12 months, your customers spend £25 on average each time they visit your business. By comparison, customers of other hairdressing businesses in Blackpool spend £23.” Archetype 3: “Calculate % of wallet and compare to customers in Segment/Location” f(A)/g(A) vs. f(B)/g(B) “This year, your customers spent 2.3% of their income in your business. Customers of other hairdressers in Blackpool spent 3.5% of their income with them.”
  • 6. Technical Challenges Multiple Operations on the Same Dataset  Optimized In-Memory Execution Plan  No Unnecessary I/O Agile  Safe Refactoring Statically Typed Parallel Architecture  Map/Reduce & Composable High-Orders Functions Functional Programming None of the above would be feasible in traditional RDBMS!
  • 7. Technical Challenges Multiple Operations on the Same Dataset  Optimized In-Memory Execution Plan  No Unnecessary I/O Agile  Safe Refactoring Statically Typed Parallel Architecture  Map/Reduce & Composable High-Orders Functions Functional Programming Building complex production-quality applications  • Flexibility • Richness • High-level features • Native to the computation framework None of the above would be feasible in traditional RDBMS!
  • 8. Domain Specific Language • Elegant: few lines where SQL would use more than 200 • Natural: use English to specify, implement and test • Compiled: easy to spot mistakes
  • 9. Commutative Monoids Algebraic Structure made of (T, |+|, Zero): • Type T • Binary operator |+|: (T, T) => T • Neutral element Zero: T sum = (Int, +, 0), multiplication = (Int, *, 1), distinct = (Set, ++, Set.empty) Properties: • (Associativity, Commutability) => Parallelizable aggregation • Identity: t |+| Zero = t Our composable “Count and Sum of Amount” monoid: – ((Int, Int), (count1 + count2, sum1 + sum2), (0, 0)) – val toMonoid = (t: Transaction) => (1, t.amount) – e.g. (4, 120) |+| (1, 80) = (5, 200) – Can even achieve median using probabilistic monoids, or DistinctCountBounds using hash sets (13,790) (9, 630) (3, 400) (1, 120) (1, 200) (1, 80) (2, 90) (1, 60) (1, 30) (4, 140) (1, 90) (1, 10) (1, 30) (1, 10) (4, 160) (2, 50) (1, 20) (1, 30) (2, 110) (1, 60) (1, 50)
  • 10. Insight Part Optimization Minimum Set of Insight Parts Insight Type 3 Insight Type 2 Insight Type 1 Part1 Part2 Part3 Part4 Part5 Given the set of insight types: • Optimize the minimum number of informative insight parts to compute • Each part consists of: – Filters (bookkeeping, paymentType, spendCategory) – Monoids (Sum, Count, Distinct, Median…) Example: - Type: “Energy spending growth” - 4 Insight Parts: “(Count & Sum of Amount) of Outgoing Transactions for Energy” of: 1. this business in current timeslot 2. this business in previous timeslot 3. competitors in current timeslot 4. competitors in previous timeslot
  • 11. Hierarchical Aggregation Time: • Month • Quarter • Half year • Year Industry: • Sub-segment • Segment • All Location: • Area • County • Country • Whole UK day businessId filters: • bookkeeping = Outgoing • spendCategory = Electricity Cell contains all of the monoids sliced by filters of each extracted Part and aggregated at that granularity combination (1, t.amount) Given the set of insight parts and granularities: • Fill the OLAP cube starting from the finest granularity levels • Example: (Month, Hairdressing, Blackpool) -> List(part1@Part(filters = List(Ingoing), monoid = (Count(122), Sum(11928)), part2@Part(filters = List(Electricity, Outgoing), monoid = (Sum(89)) • Further aggregation on ascending levels
  • 12. Derive Insights from the “memoized” results Atomic Monoids Memoized Parts Insight 1. this busines s in current year 2. this business in previous year 4. competitors in: previous year, Blackpool, hairdressing “This year, your business spent £1,000 on electricity, other hairdressing businesses in Blackpool spent on average £1,200 on electricity”3. competitors in: current year, Blackpool, hairdressing This business parts (1 and 2) comparison parts in each granularity combination (3 and 4) Collate Function • Growth • Compare • Ratios • Best day of week
  • 14. Some Numbers • 700,000,000 rows of data (2 years worth) • 275,000 UK Businesses • 66 Insights for each Business – Upper Bound: 9 main queries (insight types) * 4 sub-queries (insight parts) * 36 granularities (dimension combos) = 1296 Queries – Filtered on Privacy and Ranked by Relevance • The Engine ran in 30 minutes – on a small low-performance cluster 6 x (20 CPUs, 48G RAM) – 500x faster than Hive and probably wouldn’t return on Teradata
  • 15. Summary Insights Engine is an analytical engine that takes hundreds of queries as input and generates an optimized single execution plan by combining and re-using intermediate results for each business and each combination of granularity over multiple hierarchical dimensions. What is cool about it: 1. Composable ”Monoids” allowing aggregations at multiple levels of granularity, like a tree 2. A DSL that defines Insights succinctly (3 lines of code vs ~250 lines of SQL) 3. Inspection of the queries specified in the DSL to find "duplicate" structures of computation (Insight Parts), and up-front “memoization" to ensure they are only computed once 4. Ensure all of the results are privacy-safe and relevance- ranked
  • 16. Follow-up Links • Insights Engine Blog on Cloudera: http://blog.cloudera.com/blog/2015/08/how-apache-spark-scala-and-functional- programming-made-hard-problems-easy-at-barclays/ • Fast accurate low-memory aggregations DSL for Spark: https://github.com/samthebest/aggregations • Contribute to the Agile Data Science Manifesto: www.datasciencemanifesto.com