SlideShare une entreprise Scribd logo
1  sur  27
PayPal Merchant ecosystem using Spark,
Hive, Druid, HBase & Elasticsearch
2
Who we are?
Deepika Khera
Kasi Natarajan
• Big Data Technologist for over a decade
• Focused on building scalable platforms with Hadoop ecosystem – Map Reduce,
HBase, Spark, Elasticsearch, Druid
• Senior Engineering Manager - Merchant Analytics at PayPal
• Contributed to Druid for the Spark Streaming integration
• 15+ years of industry experience
• Spark Engineer @PayPal Merchant Analytics
• Building solutions using Apache Spark, Scala, Hive, HBase, Druid and Spark ML.
• Passionate about providing Analytics at scale from Big Data platforms
3
Agenda
PayPal Data & Scale
Merchant Use Case Review
Data Pipeline
Learnings - Spark & HBase
Tools & Utilities
Behavioral Driven Development
Data Quality Tool using Spark
BI with Druid & Tableau
PayPal Data & Scale
4
PayPal is more than a button
Loyalty
Faster
Conversion
Reduction
in Cart
Abandonment
Credit
Customer
Acquisition
APV Lift
Invoicing
Offers
CBT Mobile In-Store Online
5
PayPal Datasets
6
Social
Media
Demo-
graphics
Marketing
Activity
Email Applicatio
n Logs
Invoice
Credit
Reversals
Disputes
CBT
Risk
Consumer
Merchants
Partners
Location
Payment
Products
Transactio
n
Spending
7
PayPal operates one of
the largest
PRIVATE
CLOUDSin the world*
petabytes
of data*
42markets active customer
accounts**
237M
payments in
2017**
7.6
BILLION
merchants
19Mpayments/
second at peak*
~60
0
our platform
Dedicated to with a
customer focused,
strong performance,
highly scalable,
continuously available
PLATFORM.
PayPal has one of the top five Kafka
deployments in the world, handling over
200 billion messages per day
200
+
PayPal operates one of the largest Hadoop
deployments in the world.
A 1600 Node Hadoop Cluster
with 230TB of Memory, 78PB of Storage
Running 50,000 Jobs Per day
The power of
Merchant Use Case Review
8
9
Use Case Overview
INSIGHTS MARKETING SOLUTIONS
• Help Merchants engage their
customers by personalized
shopping experience
• Offers & Campaigns
• Shoppers Insights
• Revenue & transaction trends
• Cross-Border Insights
• Customer Shopping Segments
• Products performance
• Checkout Funnel
• Behavior analysis
• Measuring effectiveness
PAYPAL ANALYTICS.com
Merchant Data Platform
1. Fast Processing platform crunching multi-terabytes of data
2. Scalable, Highly available, Low latency Serving platform
10
Technologies
Processing
Serving
Movement
11
Merchant Analytics
Merchant
Data
Platform
Pre-aggregated
Cubes
Denormalized
Schema
Analytics
Data Pipeline
12
13
Data Ingestion
PayPal
Replication
Data Lake
Data Processing
Data Serving
SQL
Data PipelineData Sources Visualization
Custom UI
Web
Servers
Data Pipeline Architecture
Learnings – Spark & HBase
14
Design Considerations for Spark
Data Serialization  Use Kyro Serializer with SparkConf, which is faster and compact
 Tune kyroserializer buffer to hold large objects
Garbage Collection
Memory Management
Parallelism
Action-Transformation
Spark Best Practices Checklist
Caching & Persisting
 Clean up cached/persisted collections when they are no longer needed
 Tuned concurrent abortable preclean time from 10sec to 30sec to push out stop the world GC
 Avoided using executors with too much memory
 Optimize number of cores & partitions*
 Minimize shuffles on join() by broadcasting the smaller collection
 Optimize wider transformations as much as possible*
 Used MEMORY_AND_DISK storage level for caching large
 Repartition data before persisting to HDFS for better performance in downstream jobs
*Specific examples later
© 2018 PayPal Inc. Confidential and proprietary. 15
Learnings
• Executor spends long time on shuffle reads. Then times out , terminates and results in job failure
• Resource constraints on executor nodes causing delay in executor node
Observations
Resolution
To address memory constraints, tuned
1. config from 200 executor * 4 cores to 400 executor * 2 cores
2. executor memory allocation (reduced)
16© 2018 PayPal Inc. Confidential and proprietary.
Spark job failures with Fetch Exceptions
Long shuffle read times
Learnings
• Series of left joins on large datasets cause shuffle exceptions
Observations
Resolution
1. Split into Small jobs and run them in parallel
2. Faster reprocessing and fail fast jobs
7day
30day
60day
90day
180da
y
Union Hive
7day
30day
60day
90day
180da
y
Job
Job1
Job2
Job3
Job4
Job5
(Multiple Partitions)
Time Series data source
Other Data sources to Join with
17© 2018 PayPal Inc. Confidential and proprietary.
Hive
Parallelism for long running jobs
Learnings
• Spark Driver was left with too many heartbeat requests to process even after the job was complete
• Yarn kills the Spark job after waiting on the Driver to complete processing the Heartbeats
• The setting “spark.executor.heartbeatInterval” was set too low. Increasing it to 50s fixed the issue
• Allocate more memory to Driver to handle overheads other than typical Driver processes
Resolution
Tuning between Spark driver and executors
Driver
Executor
Executor Heartbeats
Observations
18© 2018 PayPal Inc. Confidential and proprietary.
Executor
Executorheartbeats
Yarn RM
Waiting on Driver
With the default shuffle partitions of 200, the Join Stage was running with too many tasks causing performance
overhead
Observation
Resolution
Reduce the spark.sql.shuffle.partitions settings to a lower threshold
19© 2018 PayPal Inc. Confidential and proprietary.
Learnings
Optimize joins for efficient use of cluster resources (Memory, CPU etc..,)
Read
Table
1
Read
Table 2
Start
Join
Process
T2T1
20© 2018 PayPal Inc. Confidential and proprietary.
Learnings
Optimize wide transformations
Left Outer Join Left Outer Join with OR Operators
Resolution
• Convert expensive left joins to combination of light weight join and except/union etc..,
Results of the Sub-Joins are
being sent back to Driver causing
poor performance
left
join
T2 is NULL
7 billion rows1 billion rows
T2 is NOT NULL
T2T1
join
T2 is NOT NULL
except
T2 is NULL
rewritten as
T2T1
left join
OR
T3
25 million rows25 million
rows
T2T1
left
join
rewritten as
On T1.C1 = T2.C1 On T1.C2=T2.C2
On T1.C1 =
T2.C1
T2T1
left
join On T1.C2=T2.C2
union
T3
• Batch puts and gets slow due to HBase overloaded connections
• Since our HBase row was wide, HBase operations for partitions containing larger groups were slow
Observations
Resolution
• Implemented sliding window for HBase Operations to reduce HBase connection overload
21© 2018 PayPal Inc. Confidential and proprietary.
Learnings
Optimize throughput for HBase Spark Connection
Val rePartitionedRDD: RDD[Event] =
filledRDD.repartition(2000)
…..
…..
groupedEventRDD.mapPartitions( p =>
p.sliding(2000,2000)..foreach(
Create Hbase Connection
Batch Hbase Read or Write
Close Hbase Connection
)
)
Repartition
RDD
For each
RDD partition
Perform HBase
batch operation
For each Sliding
Window
Example
Tools & Utilities
22
Behavioral Driven Development
Feature : Identify the activity related to an event
Scenario: Should perform an iteration on events and join to activity table and identify
the activity name
Given I have a set of events
|cookie_id:String |page_id:String|last_actvty:String|
|263FHFBCBBCBV|login_provide |review_next_page|
|HFFDJFLUFBFNJL|home_page |provide_credent|
And I have a Activity table
|last_activity_id:String|activity_id:String|activity_name:String|
|review_next_page | 1494300886856 |Reviewing Next Page |
|provide_credent | 2323232323232 |Provide Credentials |
When I implement Event Activity joins
Then the final result is
|cookie_id:String |activity_id:String|activity_name:String|
|last_activity_id:String|activity_id:String|activity_name:String|
|263FHFBCBBCBV | 1494300886856 |Reviewing Next Page |
|HFFDJFLUFBFNJL | 2323232323232 |Provide Credentials |
• While Unit tests are more about the implementation, BDD emphasizes more on the behavior of the code
• Writing “Specifications” in pseudo-English.
• Enables testing at external touch-points of your application
© 2018 PayPal Inc. Confidential and proprietary. 23
import cucumber.api.scala.{EN, ScalaDsl}
import cucumber.api.DataTable
import org.scalatest.Matchers
Given("""^I have a set of events$""") { (data:DataTable) =>
eventdataDF = dataTableToDataFrame(data)
}
Given("""^I have a Activity table$""") { (data:DataTable) =>
activityDataDF = dataTableToDataFrame(data)
}
When("""^I implement Event Activity joins$"""){ () =>
eventActivityDF = Activity.findAct(eventdataDF, activityDataDF) } }
Then("""^the final result is $"""){ (expectedData:DataTable) =>
val expectedDf = dataTableToDataFrame(expectedData)
val resultDF = eventActivityDF
resultDF.except(expectedDF).count
pseudo code
Data Quality Tool
1. Define Source ,Target Query, Test Operation in Config file
Source Tables Output Table
2. Spark Job that takes the config and runs the test cases
Config in SQL
format
Operation Source
Query
Target
Query
Key Column
Count Select c1
…
Select c1,c2,c3
….
C1
Values Select
c1….…
Select c1,c2,c3
from t1
C1
……..
Reports
Alerts
Quality operations
Count Aggreagtion
DuplicateRows MissingInLookup
Values
© 2018 PayPal Inc. Confidential and proprietary.
24
• Config driven automated tool written in Spark for Quality Control
• Used extensively during functional testing of the application and once live, used as quality check for our data pipeline
• Feature to compare tables (schema agnostic and at Scale) for data validation and helping engineers troubleshoot effectively
Quality Tool Flow
Batch
Ingestion
Druid Integration with BI
• Druid is an open-source time series data store designed for sub-second queries on real-time and historical data. It is
primarily used for business intelligence queries on event data*
• Traditional Databases did not scale and perform with Tableau dashboards (for many use cases)
• Enable Tableau dashboards with Druid as the serving platform
• Live connection from tableau to druid avoids getting limited by storage at any layer.
*.from http://Druid.io
Our Datasets
Druid Cluster
Historicals serving Data
Segments
Druid
Broker
Deep Storage
(HDFS)
Visualization
Druid SQL
SQL Client
Custom App
© 2018 PayPal Inc. Confidential and proprietary.
25
Hadoop
HDFS
Visualization at scale
ConclusionConclusion
© 2018 PayPal Inc. Confidential and proprietary. 26
 Spark Applications on Yarn (Hortonworks distribution).
 Spark jobs were easy to write and had excellent performance (though little hard to troubleshoot)
 Spark-HBase optimization improved performance
 Pre-aggregated datasets to Elasticsearch
 Denormalized datasets to Druid
 Pushed lowest-granularity denormalized datasets to Druid
 Behavior Driven Development a great add-on for Product-backed applications
QUESTIONS?

Contenu connexe

Tendances

Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 

Tendances (20)

Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven Organizations
 
ODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps ManifestoODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps Manifesto
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 

Similaire à PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase

Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 

Similaire à PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase (20)

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Nike tech talk.2
Nike tech talk.2Nike tech talk.2
Nike tech talk.2
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWS
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 

Plus de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase

  • 1. PayPal Merchant ecosystem using Spark, Hive, Druid, HBase & Elasticsearch
  • 2. 2 Who we are? Deepika Khera Kasi Natarajan • Big Data Technologist for over a decade • Focused on building scalable platforms with Hadoop ecosystem – Map Reduce, HBase, Spark, Elasticsearch, Druid • Senior Engineering Manager - Merchant Analytics at PayPal • Contributed to Druid for the Spark Streaming integration • 15+ years of industry experience • Spark Engineer @PayPal Merchant Analytics • Building solutions using Apache Spark, Scala, Hive, HBase, Druid and Spark ML. • Passionate about providing Analytics at scale from Big Data platforms
  • 3. 3 Agenda PayPal Data & Scale Merchant Use Case Review Data Pipeline Learnings - Spark & HBase Tools & Utilities Behavioral Driven Development Data Quality Tool using Spark BI with Druid & Tableau
  • 4. PayPal Data & Scale 4
  • 5. PayPal is more than a button Loyalty Faster Conversion Reduction in Cart Abandonment Credit Customer Acquisition APV Lift Invoicing Offers CBT Mobile In-Store Online 5
  • 6. PayPal Datasets 6 Social Media Demo- graphics Marketing Activity Email Applicatio n Logs Invoice Credit Reversals Disputes CBT Risk Consumer Merchants Partners Location Payment Products Transactio n Spending
  • 7. 7 PayPal operates one of the largest PRIVATE CLOUDSin the world* petabytes of data* 42markets active customer accounts** 237M payments in 2017** 7.6 BILLION merchants 19Mpayments/ second at peak* ~60 0 our platform Dedicated to with a customer focused, strong performance, highly scalable, continuously available PLATFORM. PayPal has one of the top five Kafka deployments in the world, handling over 200 billion messages per day 200 + PayPal operates one of the largest Hadoop deployments in the world. A 1600 Node Hadoop Cluster with 230TB of Memory, 78PB of Storage Running 50,000 Jobs Per day The power of
  • 8. Merchant Use Case Review 8
  • 9. 9 Use Case Overview INSIGHTS MARKETING SOLUTIONS • Help Merchants engage their customers by personalized shopping experience • Offers & Campaigns • Shoppers Insights • Revenue & transaction trends • Cross-Border Insights • Customer Shopping Segments • Products performance • Checkout Funnel • Behavior analysis • Measuring effectiveness PAYPAL ANALYTICS.com Merchant Data Platform 1. Fast Processing platform crunching multi-terabytes of data 2. Scalable, Highly available, Low latency Serving platform
  • 13. 13 Data Ingestion PayPal Replication Data Lake Data Processing Data Serving SQL Data PipelineData Sources Visualization Custom UI Web Servers Data Pipeline Architecture
  • 14. Learnings – Spark & HBase 14
  • 15. Design Considerations for Spark Data Serialization  Use Kyro Serializer with SparkConf, which is faster and compact  Tune kyroserializer buffer to hold large objects Garbage Collection Memory Management Parallelism Action-Transformation Spark Best Practices Checklist Caching & Persisting  Clean up cached/persisted collections when they are no longer needed  Tuned concurrent abortable preclean time from 10sec to 30sec to push out stop the world GC  Avoided using executors with too much memory  Optimize number of cores & partitions*  Minimize shuffles on join() by broadcasting the smaller collection  Optimize wider transformations as much as possible*  Used MEMORY_AND_DISK storage level for caching large  Repartition data before persisting to HDFS for better performance in downstream jobs *Specific examples later © 2018 PayPal Inc. Confidential and proprietary. 15
  • 16. Learnings • Executor spends long time on shuffle reads. Then times out , terminates and results in job failure • Resource constraints on executor nodes causing delay in executor node Observations Resolution To address memory constraints, tuned 1. config from 200 executor * 4 cores to 400 executor * 2 cores 2. executor memory allocation (reduced) 16© 2018 PayPal Inc. Confidential and proprietary. Spark job failures with Fetch Exceptions Long shuffle read times
  • 17. Learnings • Series of left joins on large datasets cause shuffle exceptions Observations Resolution 1. Split into Small jobs and run them in parallel 2. Faster reprocessing and fail fast jobs 7day 30day 60day 90day 180da y Union Hive 7day 30day 60day 90day 180da y Job Job1 Job2 Job3 Job4 Job5 (Multiple Partitions) Time Series data source Other Data sources to Join with 17© 2018 PayPal Inc. Confidential and proprietary. Hive Parallelism for long running jobs
  • 18. Learnings • Spark Driver was left with too many heartbeat requests to process even after the job was complete • Yarn kills the Spark job after waiting on the Driver to complete processing the Heartbeats • The setting “spark.executor.heartbeatInterval” was set too low. Increasing it to 50s fixed the issue • Allocate more memory to Driver to handle overheads other than typical Driver processes Resolution Tuning between Spark driver and executors Driver Executor Executor Heartbeats Observations 18© 2018 PayPal Inc. Confidential and proprietary. Executor Executorheartbeats Yarn RM Waiting on Driver
  • 19. With the default shuffle partitions of 200, the Join Stage was running with too many tasks causing performance overhead Observation Resolution Reduce the spark.sql.shuffle.partitions settings to a lower threshold 19© 2018 PayPal Inc. Confidential and proprietary. Learnings Optimize joins for efficient use of cluster resources (Memory, CPU etc..,) Read Table 1 Read Table 2 Start Join Process
  • 20. T2T1 20© 2018 PayPal Inc. Confidential and proprietary. Learnings Optimize wide transformations Left Outer Join Left Outer Join with OR Operators Resolution • Convert expensive left joins to combination of light weight join and except/union etc.., Results of the Sub-Joins are being sent back to Driver causing poor performance left join T2 is NULL 7 billion rows1 billion rows T2 is NOT NULL T2T1 join T2 is NOT NULL except T2 is NULL rewritten as T2T1 left join OR T3 25 million rows25 million rows T2T1 left join rewritten as On T1.C1 = T2.C1 On T1.C2=T2.C2 On T1.C1 = T2.C1 T2T1 left join On T1.C2=T2.C2 union T3
  • 21. • Batch puts and gets slow due to HBase overloaded connections • Since our HBase row was wide, HBase operations for partitions containing larger groups were slow Observations Resolution • Implemented sliding window for HBase Operations to reduce HBase connection overload 21© 2018 PayPal Inc. Confidential and proprietary. Learnings Optimize throughput for HBase Spark Connection Val rePartitionedRDD: RDD[Event] = filledRDD.repartition(2000) ….. ….. groupedEventRDD.mapPartitions( p => p.sliding(2000,2000)..foreach( Create Hbase Connection Batch Hbase Read or Write Close Hbase Connection ) ) Repartition RDD For each RDD partition Perform HBase batch operation For each Sliding Window Example
  • 23. Behavioral Driven Development Feature : Identify the activity related to an event Scenario: Should perform an iteration on events and join to activity table and identify the activity name Given I have a set of events |cookie_id:String |page_id:String|last_actvty:String| |263FHFBCBBCBV|login_provide |review_next_page| |HFFDJFLUFBFNJL|home_page |provide_credent| And I have a Activity table |last_activity_id:String|activity_id:String|activity_name:String| |review_next_page | 1494300886856 |Reviewing Next Page | |provide_credent | 2323232323232 |Provide Credentials | When I implement Event Activity joins Then the final result is |cookie_id:String |activity_id:String|activity_name:String| |last_activity_id:String|activity_id:String|activity_name:String| |263FHFBCBBCBV | 1494300886856 |Reviewing Next Page | |HFFDJFLUFBFNJL | 2323232323232 |Provide Credentials | • While Unit tests are more about the implementation, BDD emphasizes more on the behavior of the code • Writing “Specifications” in pseudo-English. • Enables testing at external touch-points of your application © 2018 PayPal Inc. Confidential and proprietary. 23 import cucumber.api.scala.{EN, ScalaDsl} import cucumber.api.DataTable import org.scalatest.Matchers Given("""^I have a set of events$""") { (data:DataTable) => eventdataDF = dataTableToDataFrame(data) } Given("""^I have a Activity table$""") { (data:DataTable) => activityDataDF = dataTableToDataFrame(data) } When("""^I implement Event Activity joins$"""){ () => eventActivityDF = Activity.findAct(eventdataDF, activityDataDF) } } Then("""^the final result is $"""){ (expectedData:DataTable) => val expectedDf = dataTableToDataFrame(expectedData) val resultDF = eventActivityDF resultDF.except(expectedDF).count pseudo code
  • 24. Data Quality Tool 1. Define Source ,Target Query, Test Operation in Config file Source Tables Output Table 2. Spark Job that takes the config and runs the test cases Config in SQL format Operation Source Query Target Query Key Column Count Select c1 … Select c1,c2,c3 …. C1 Values Select c1….… Select c1,c2,c3 from t1 C1 …….. Reports Alerts Quality operations Count Aggreagtion DuplicateRows MissingInLookup Values © 2018 PayPal Inc. Confidential and proprietary. 24 • Config driven automated tool written in Spark for Quality Control • Used extensively during functional testing of the application and once live, used as quality check for our data pipeline • Feature to compare tables (schema agnostic and at Scale) for data validation and helping engineers troubleshoot effectively Quality Tool Flow
  • 25. Batch Ingestion Druid Integration with BI • Druid is an open-source time series data store designed for sub-second queries on real-time and historical data. It is primarily used for business intelligence queries on event data* • Traditional Databases did not scale and perform with Tableau dashboards (for many use cases) • Enable Tableau dashboards with Druid as the serving platform • Live connection from tableau to druid avoids getting limited by storage at any layer. *.from http://Druid.io Our Datasets Druid Cluster Historicals serving Data Segments Druid Broker Deep Storage (HDFS) Visualization Druid SQL SQL Client Custom App © 2018 PayPal Inc. Confidential and proprietary. 25 Hadoop HDFS Visualization at scale
  • 26. ConclusionConclusion © 2018 PayPal Inc. Confidential and proprietary. 26  Spark Applications on Yarn (Hortonworks distribution).  Spark jobs were easy to write and had excellent performance (though little hard to troubleshoot)  Spark-HBase optimization improved performance  Pre-aggregated datasets to Elasticsearch  Denormalized datasets to Druid  Pushed lowest-granularity denormalized datasets to Druid  Behavior Driven Development a great add-on for Product-backed applications

Notes de l'éditeur

  1. Today PayPal is much more than a button on a website. We have an extensive portfolio of products & services. Enabling CBT, easy Mobile & Web access, Credit Options to customers, Marketing solutions for merchants and many more help merchants grow their business and enable customers to safe digital commerce.
  2. All of these also translate into a rich set of data that PayPal has to inform strategic and operational decisions
  3. Concurrent Mark Sweep – If it doesn’t finish garbage collection, it starts stop the world GC. Tuned it from 10s ec to 30seconds CMSMaxAbortablePrecleanTime
  4. https://community.hortonworks.com/questions/44950/spark-memory-issue.html org.apache.spark.shuffle.MetadataFetchFailedException Running this job with 4 cores and 200 executors. Although there could be multiple reasons for delay like skewness in data . For us it turned out that the datanode that the executor was running on was busy , a lot of times this happened with nodes with limited capacity having more number of tasks in per executor theoretically puts more pressure on the executor where if there are memory constraints the chances of having an executor failure increases metafetchfailed happens usually due to executor failure or due to executor termination
  5. https://engineering.paypalcorp.com/confluence/display/EDS/Muse+Visitor+Count+Job+Split+Design
  6. Combinations
  7. Points The tool was completely customizable for each project The tool was build to be schema agnostic of the table and scalable to run on datasets of large size Report was generated on Match/Mismatch count by Key Columns like Product and Geography as needed