SlideShare a Scribd company logo
1 of 20
Download to read offline
Pulkit Bhanot, Amit Nene
Risk Platform
Large-scale Feature Aggregation Using
Apache Spark
#Dev1SAIS
#Dev1SAIS
• Motivation
• Challenges
• Architecture Deep Dive
• Role of Spark
• Takeaways
2
Agenda
#Dev1SAIS
Build a scalable, self-service Feature Engineering Platform for
predictive decisioning (based on ML and Business Rules)
• Feature Engineering: use of domain knowledge to create Features
• Self-Service for Data Scientists and Analysts without reliance on Engineers
• Generic Platform: consolidating work towards wider ML efforts at Uber
3
Team Mission
#Dev1SAIS
Promotions
Detect and prevent bad actors in real-time
number of trips over X hours/weeks
trips cancelled over Y months
count of referrals over lifetime
...
4
Payments
Sample use case: Fraud detection
#Dev1SAIS
Indexed
databases
Streaming
aggregations
Needs
• Lifetime of entity
• Sliding long-window: days/weeks/months
• Sliding short-window: mins/hours
• Real-time
Existing solutions
• None satisfies all of above
• Complex onboarding
Warehouse
None
fits the
bill!
5
Time Series Aggregations
#Dev1SAIS
• Scale: 1000s of aggregations for 100s million of
business entities
• Long-window aggregation queries slow even with
indexes (seconds). Millis at high QPS needed.
• Onboarding complexity: many moving parts
• Recovery from accrual of errors
6
Technical Challenges
#Dev1SAIS
One-stop shop for aggregation
• Single system to interact with
• Single spec: automate configurations of underlying system
Scalable
• Leverage the scale of Batch system for long window
• Combine with real-time aggregation for freshness
• Rollups: aggregate over time intervals
• Fast query over rolled up aggregates
• Caveat: summable functions
Self-healing
• Batch system auto-corrects any errors
7
Our approach
#Dev1SAIS
Aggregator
Aggregated
Features
Raw Events:
Streaming+Hive
Aggregation
Function:
sum, count, etc.
Aggregation
Window:
LTD, 7 days, 30
days, etc.
Grain Size:
5 min (realtime),
1 day (offline), etc.
Input parameters to black
box
➔ Source events
➔ Grain size
➔ Aggregation
functions
➔ Aggregation windows
8
Aggregator as a black box
#Dev1SAIS
Specs
Feature Store
Batch
Aggregator
(Spark Apps)
Real-time
Aggregator
(Streaming)
Feature Access
(Microservice)
• Batch (Spark)
– Long-window: weeks, months
– Bootstrap, incremental modes
• Streaming (e.g. Kafka events)
– Short-window (<24 hrs)
– Near-real time
• Real-time Access
– Merge offline and streaming
• Feature Store
– Save rolled-up aggregates in Hive
and Cassandra
9
1
1
2
2
3
4
Overall architecture
#Dev1SAIS
Computation
Hive
Specs
Batch Aggregator (Spark Apps)
Feature Store
(Cassandra)
Feature
Extractor
Scheduler
Rollup
Generator
Bootstrap
Periodic
Snapshot
Manager
Optimizer
Feature
Access
Tbl1:<2018-04-10>
Tb1:<2018-04-09>
Optimized
Snapshot
Full Snapshot
Incremental
Snapshot
Dispersal
Decisioning
System
10
Batch Spark Engine
1
2
3
4
5
6
7
8
9
10
Optimized
Snapshot
#Dev1SAIS
Hive
Tbl1:<2018-04-13> ATable-1_daily:<2018-04-13>
ATable-1_LTD:<2018-04-13>
Table-1
-partition-<2018-04-10>
col1, col2, col3, col4
-partition-<2018-04-11>
col1, col2, col3, col4
-partition-<2018-04-12>
col1, col2, col3, col4
-partition-<2018-04-13>
col1, col2, col3, col4
Tbl1:<2018-04-10>
Tbl1:<2018-04-13>
…….
ATable-1_Lifetime
-partition-<2018-04-10>
uuid, f1_ltd, f2_ltd
-partition-<2018-04-11>
uuid, f1_ltd, f2_ltd
-partition-<2018-04-12>
uuid, f1_ltd, f2_ltd
-partition-<2018-04-13>
uuid, f1_ltd, f2_ltd
ATable-1_daily
-partition-<2018-04-10>
uuid, f1, f2
-partition-<2018-04-11>
uuid, f1, f2
-partition-<2018-04-12>
uuid, f1, f2
-partition-<2018-04-13>
uuid, f1, f2
ATable-1_joined
-partition-<2018-04-10>
uuid, f1, f2, f1_ltd, f2_ltd
-partition-<2018-04-11>
uuid, f1, f2, f1_ltd, f2_ltd
-partition-<2018-04-12>
uuid, f1, f2, f1_ltd, f2_ltd
-partition-<2018-04-13>
uuid, f1, f2, f1_ltd, f2_ltd
Daily Partitioned
Source Tables
Rolled-up Tables
Features involving Lifetime computation.
Features involving sliding window
computation.
Dispersed to real-
time store
11
Batch Storage
Daily Lifetime
Snapshot
Daily Incremental
Rollup
#Dev1SAIS 12
● Orchestrator of ETL pipelines
○ Scheduling of subtasks
○ Record incremental progress
● Optimally resize HDFS files: scale with
size of data set.
● Rich set of APIs to enable complex
optimizations
e.g of an optimization in bootstrap dispersal
dailyDataset.join(
ltdData,
JavaConverters.asScalaIteratorConverter(
Arrays.asList(pipelineConfig.getEntityKey()).iterator())
.asScala()
.toSeq(),
"outer");
uuid _ltd daily_buckets
44b7dc88 1534 [{"2017-10-24":"4"},{"2017-08-
22":"3"},{"2017-09-21":"4"},{"2017-
08-08":"3"},{"2017-10-
03":"3"},{"2017-10-19":"5"},{"2017-
09-06":"1"},{"2017-08-
17":"5"},{"2017-09-09":"12"},{"2017-
10-05":"5"},{"2017-09-
25":"4"},{"2017-09-17":"13"}]
Role of Spark
#Dev1SAIS 13
• Ability to disperse billions of records
– HashPartitioner to the rescue
//Partition the data by hash
HashPartitioner hashPartitioner = new HashPartitioner(partitionNumber);
JavaPairRDD<String, Row> hashedRDD = keyedRDD.partitionBy(hashPartitioner);
//Fetch each hash partition and process
foreach partition{
JavaRDD<Tuple2<String, Row>> filteredHashRDD = filterRows(hashedRDD, index, paritionId);
raise error if partition mismatch
Dataset<Row> filteredDataSet =
etlContext.getSparkSession().createDataset(filteredHashRDD.map(tuple -> tuple._2()).rdd(),
data.org$apache$spark$sql$Dataset$$encoder);
//repartition filteredDataSet, update checkpoint and records processed after successful
completion.
Paren
t
RDD
P1
P2
P3
Pn
…..
Process
Each
Partition
Role of Spark (continued)
#Dev1SAIS 14
2018-02-01
2018-02-02
2018-02-03
2018-03-01
….
Dispersal C*
2018-02-01
2018-02-02
2018-02-03
2018-03-01
Bootstrap
• Global Throttling
– Feature Store can be the bottleneck
– coalesce() to limit the executors
• Inspect data
– Disperse only if any column has
changed
• Monitoring and alert
– create custom metrics
Role of Spark in Dispersal
Full computation snapshots
Optimized snapshots
#Dev1SAIS
● Real-time, summable
aggregations for < 24 hours
● Semantically equivalent to
offline computation
● Aggregation rollups (5 mins)
maintained in feature store
(Cassandra)
event
enrichment
raw kafka
events
microservices
xform_0
xform_1
xform_2
streaming
computation
pipelines
time
window
aggregator
C*
RPCs
Uber Athena streaming
15
Real-time streaming engine
#Dev1SAIS
● Uses time series and clustering
key support in Cassandra
○ 1 table for Lifetime & LTD
values.
○ Multiple tables for realtime
values with grain size 5M
● Consult metadata and assemble
into single result at feature
access time
entity_common_aggr_bt_ltd
UUID
(PK)
trip_count_ltd
entity_common_aggr_bt
UUID
(PK)
eventbucket
(CK)
trip_count
entity_common_aggr_rt_2018_05_08
entity_common_aggr_rt_2018_05_09
entity_common_aggr_rt_2018_05_10
UUID
(PK)
eventbucket
(CK)
trip_coun
t
Feature access Service
Metadata
Service
Query
Planner
e.g - lifetime trip count
- trips over last 51 hrs
- trips over previous 2
days
16
Final aggregation in real time
#Dev1SAIS
Create Query
(Spark SQL)
Configure Spec
Commit to Prod
Test Spec
17
Self-service onboarding
#Dev1SAIS
Backfill Support: what is the value of a feature f1 for
an entity E1 from Thist to Tnow
• Bootstrap to historic point in time: Thist
• Incrementally compute from Thist to Tnow
How ?
• Lifetime: feature f1 on Thist access partition Thist
• Windowed: feature f2 on Thist with window N days
• Merge partitions Thist-N to Thist
18
Machine learning support
T-120
T-119
T-90
T-1
T
…..
…..
Lifetime
value on a
given date
Last 30 day
trips at T-90
Last 30 day
trips at T-89
#Dev1SAIS
● Use of Spark to achieve massive scale
● Combine with Streaming aggregation for freshness
● Low latency access in production (P99 <= 20ms) at high QPS
● Simplify onboarding via single spec, onboarding time in hours
● Huge computational cost improvements
19
Takeaways
Proprietary and confidential © 2018 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity
to whom it is addressed and contains information that is privileged, confidential or otherwise exempt
from disclosure under applicable law. All recipients of this document are notified that the information
contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information
to any person other than employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.
bhanotp@uber.com
anene@uber.com

More Related Content

What's hot

Oracle to Postgres Migration - part 2
Oracle to Postgres Migration - part 2Oracle to Postgres Migration - part 2
Oracle to Postgres Migration - part 2PgTraining
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline DevelopmentTimothy Spann
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InSnapLogic
 
Enabling a Data Mesh Architecture and Data Sharing Culture with Denodo
Enabling a Data Mesh Architecture and Data Sharing Culture with DenodoEnabling a Data Mesh Architecture and Data Sharing Culture with Denodo
Enabling a Data Mesh Architecture and Data Sharing Culture with DenodoDenodo
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowDremio Corporation
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architectureSudheer Kondla
 
Data Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced AnalyticsData Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced AnalyticsDATAVERSITY
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsDatabricks
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Oracle to Postgres Migration - part 1
Oracle to Postgres Migration - part 1Oracle to Postgres Migration - part 1
Oracle to Postgres Migration - part 1PgTraining
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 

What's hot (20)

Oracle to Postgres Migration - part 2
Oracle to Postgres Migration - part 2Oracle to Postgres Migration - part 2
Oracle to Postgres Migration - part 2
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
 
Enabling a Data Mesh Architecture and Data Sharing Culture with Denodo
Enabling a Data Mesh Architecture and Data Sharing Culture with DenodoEnabling a Data Mesh Architecture and Data Sharing Culture with Denodo
Enabling a Data Mesh Architecture and Data Sharing Culture with Denodo
 
Vue d'ensemble Dremio
Vue d'ensemble DremioVue d'ensemble Dremio
Vue d'ensemble Dremio
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
 
Data Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced AnalyticsData Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced Analytics
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Oracle to Postgres Migration - part 1
Oracle to Postgres Migration - part 1Oracle to Postgres Migration - part 1
Oracle to Postgres Migration - part 1
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Migrating Oracle to PostgreSQL
Migrating Oracle to PostgreSQLMigrating Oracle to PostgreSQL
Migrating Oracle to PostgreSQL
 

Similar to Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Amit Nene

Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...Flink Forward
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkWenrui Meng
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsClaudiu Barbura
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerFederico Palladoro
 
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Pierre GRANDIN
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_SummaryHiram Fleitas León
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...GetInData
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesKarthik Murugesan
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Giridhar Addepalli
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuningYosuke Mizutani
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesReal-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesDatabricks
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...NETWAYS
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Etu Solution
 

Similar to Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Amit Nene (20)

Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
dA Platform Overview
dA Platform OverviewdA Platform Overview
dA Platform Overview
 
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache Flink
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
Shaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M ResumeShaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M Resume
 
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesReal-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on Kubernetes
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 

Recently uploaded (20)

Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 

Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Amit Nene

  • 1. Pulkit Bhanot, Amit Nene Risk Platform Large-scale Feature Aggregation Using Apache Spark #Dev1SAIS
  • 2. #Dev1SAIS • Motivation • Challenges • Architecture Deep Dive • Role of Spark • Takeaways 2 Agenda
  • 3. #Dev1SAIS Build a scalable, self-service Feature Engineering Platform for predictive decisioning (based on ML and Business Rules) • Feature Engineering: use of domain knowledge to create Features • Self-Service for Data Scientists and Analysts without reliance on Engineers • Generic Platform: consolidating work towards wider ML efforts at Uber 3 Team Mission
  • 4. #Dev1SAIS Promotions Detect and prevent bad actors in real-time number of trips over X hours/weeks trips cancelled over Y months count of referrals over lifetime ... 4 Payments Sample use case: Fraud detection
  • 5. #Dev1SAIS Indexed databases Streaming aggregations Needs • Lifetime of entity • Sliding long-window: days/weeks/months • Sliding short-window: mins/hours • Real-time Existing solutions • None satisfies all of above • Complex onboarding Warehouse None fits the bill! 5 Time Series Aggregations
  • 6. #Dev1SAIS • Scale: 1000s of aggregations for 100s million of business entities • Long-window aggregation queries slow even with indexes (seconds). Millis at high QPS needed. • Onboarding complexity: many moving parts • Recovery from accrual of errors 6 Technical Challenges
  • 7. #Dev1SAIS One-stop shop for aggregation • Single system to interact with • Single spec: automate configurations of underlying system Scalable • Leverage the scale of Batch system for long window • Combine with real-time aggregation for freshness • Rollups: aggregate over time intervals • Fast query over rolled up aggregates • Caveat: summable functions Self-healing • Batch system auto-corrects any errors 7 Our approach
  • 8. #Dev1SAIS Aggregator Aggregated Features Raw Events: Streaming+Hive Aggregation Function: sum, count, etc. Aggregation Window: LTD, 7 days, 30 days, etc. Grain Size: 5 min (realtime), 1 day (offline), etc. Input parameters to black box ➔ Source events ➔ Grain size ➔ Aggregation functions ➔ Aggregation windows 8 Aggregator as a black box
  • 9. #Dev1SAIS Specs Feature Store Batch Aggregator (Spark Apps) Real-time Aggregator (Streaming) Feature Access (Microservice) • Batch (Spark) – Long-window: weeks, months – Bootstrap, incremental modes • Streaming (e.g. Kafka events) – Short-window (<24 hrs) – Near-real time • Real-time Access – Merge offline and streaming • Feature Store – Save rolled-up aggregates in Hive and Cassandra 9 1 1 2 2 3 4 Overall architecture
  • 10. #Dev1SAIS Computation Hive Specs Batch Aggregator (Spark Apps) Feature Store (Cassandra) Feature Extractor Scheduler Rollup Generator Bootstrap Periodic Snapshot Manager Optimizer Feature Access Tbl1:<2018-04-10> Tb1:<2018-04-09> Optimized Snapshot Full Snapshot Incremental Snapshot Dispersal Decisioning System 10 Batch Spark Engine 1 2 3 4 5 6 7 8 9 10 Optimized Snapshot
  • 11. #Dev1SAIS Hive Tbl1:<2018-04-13> ATable-1_daily:<2018-04-13> ATable-1_LTD:<2018-04-13> Table-1 -partition-<2018-04-10> col1, col2, col3, col4 -partition-<2018-04-11> col1, col2, col3, col4 -partition-<2018-04-12> col1, col2, col3, col4 -partition-<2018-04-13> col1, col2, col3, col4 Tbl1:<2018-04-10> Tbl1:<2018-04-13> ……. ATable-1_Lifetime -partition-<2018-04-10> uuid, f1_ltd, f2_ltd -partition-<2018-04-11> uuid, f1_ltd, f2_ltd -partition-<2018-04-12> uuid, f1_ltd, f2_ltd -partition-<2018-04-13> uuid, f1_ltd, f2_ltd ATable-1_daily -partition-<2018-04-10> uuid, f1, f2 -partition-<2018-04-11> uuid, f1, f2 -partition-<2018-04-12> uuid, f1, f2 -partition-<2018-04-13> uuid, f1, f2 ATable-1_joined -partition-<2018-04-10> uuid, f1, f2, f1_ltd, f2_ltd -partition-<2018-04-11> uuid, f1, f2, f1_ltd, f2_ltd -partition-<2018-04-12> uuid, f1, f2, f1_ltd, f2_ltd -partition-<2018-04-13> uuid, f1, f2, f1_ltd, f2_ltd Daily Partitioned Source Tables Rolled-up Tables Features involving Lifetime computation. Features involving sliding window computation. Dispersed to real- time store 11 Batch Storage Daily Lifetime Snapshot Daily Incremental Rollup
  • 12. #Dev1SAIS 12 ● Orchestrator of ETL pipelines ○ Scheduling of subtasks ○ Record incremental progress ● Optimally resize HDFS files: scale with size of data set. ● Rich set of APIs to enable complex optimizations e.g of an optimization in bootstrap dispersal dailyDataset.join( ltdData, JavaConverters.asScalaIteratorConverter( Arrays.asList(pipelineConfig.getEntityKey()).iterator()) .asScala() .toSeq(), "outer"); uuid _ltd daily_buckets 44b7dc88 1534 [{"2017-10-24":"4"},{"2017-08- 22":"3"},{"2017-09-21":"4"},{"2017- 08-08":"3"},{"2017-10- 03":"3"},{"2017-10-19":"5"},{"2017- 09-06":"1"},{"2017-08- 17":"5"},{"2017-09-09":"12"},{"2017- 10-05":"5"},{"2017-09- 25":"4"},{"2017-09-17":"13"}] Role of Spark
  • 13. #Dev1SAIS 13 • Ability to disperse billions of records – HashPartitioner to the rescue //Partition the data by hash HashPartitioner hashPartitioner = new HashPartitioner(partitionNumber); JavaPairRDD<String, Row> hashedRDD = keyedRDD.partitionBy(hashPartitioner); //Fetch each hash partition and process foreach partition{ JavaRDD<Tuple2<String, Row>> filteredHashRDD = filterRows(hashedRDD, index, paritionId); raise error if partition mismatch Dataset<Row> filteredDataSet = etlContext.getSparkSession().createDataset(filteredHashRDD.map(tuple -> tuple._2()).rdd(), data.org$apache$spark$sql$Dataset$$encoder); //repartition filteredDataSet, update checkpoint and records processed after successful completion. Paren t RDD P1 P2 P3 Pn ….. Process Each Partition Role of Spark (continued)
  • 14. #Dev1SAIS 14 2018-02-01 2018-02-02 2018-02-03 2018-03-01 …. Dispersal C* 2018-02-01 2018-02-02 2018-02-03 2018-03-01 Bootstrap • Global Throttling – Feature Store can be the bottleneck – coalesce() to limit the executors • Inspect data – Disperse only if any column has changed • Monitoring and alert – create custom metrics Role of Spark in Dispersal Full computation snapshots Optimized snapshots
  • 15. #Dev1SAIS ● Real-time, summable aggregations for < 24 hours ● Semantically equivalent to offline computation ● Aggregation rollups (5 mins) maintained in feature store (Cassandra) event enrichment raw kafka events microservices xform_0 xform_1 xform_2 streaming computation pipelines time window aggregator C* RPCs Uber Athena streaming 15 Real-time streaming engine
  • 16. #Dev1SAIS ● Uses time series and clustering key support in Cassandra ○ 1 table for Lifetime & LTD values. ○ Multiple tables for realtime values with grain size 5M ● Consult metadata and assemble into single result at feature access time entity_common_aggr_bt_ltd UUID (PK) trip_count_ltd entity_common_aggr_bt UUID (PK) eventbucket (CK) trip_count entity_common_aggr_rt_2018_05_08 entity_common_aggr_rt_2018_05_09 entity_common_aggr_rt_2018_05_10 UUID (PK) eventbucket (CK) trip_coun t Feature access Service Metadata Service Query Planner e.g - lifetime trip count - trips over last 51 hrs - trips over previous 2 days 16 Final aggregation in real time
  • 17. #Dev1SAIS Create Query (Spark SQL) Configure Spec Commit to Prod Test Spec 17 Self-service onboarding
  • 18. #Dev1SAIS Backfill Support: what is the value of a feature f1 for an entity E1 from Thist to Tnow • Bootstrap to historic point in time: Thist • Incrementally compute from Thist to Tnow How ? • Lifetime: feature f1 on Thist access partition Thist • Windowed: feature f2 on Thist with window N days • Merge partitions Thist-N to Thist 18 Machine learning support T-120 T-119 T-90 T-1 T ….. ….. Lifetime value on a given date Last 30 day trips at T-90 Last 30 day trips at T-89
  • 19. #Dev1SAIS ● Use of Spark to achieve massive scale ● Combine with Streaming aggregation for freshness ● Low latency access in production (P99 <= 20ms) at high QPS ● Simplify onboarding via single spec, onboarding time in hours ● Huge computational cost improvements 19 Takeaways
  • 20. Proprietary and confidential © 2018 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber. bhanotp@uber.com anene@uber.com