As one of the few closed-loop payment platforms, PayPal is uniquely positioned to provide merchants with insights aimed to identify opportunities to help grow and manage their business. PayPal processes billions of data events every day around our users, risk, payments, web behavior and identity. We are motivated to use this data to enable solutions to help our merchants maximize the number of successful transactions (checkout-conversion), better understand who their customers are and find additional opportunities to grow and attract new customers.
As part of the Merchant Data Analytics, we have built a platform that serves low latency, scalable analytics and insights by leveraging some of the established and emerging platforms to best realize returns on the many business objectives at PayPal.
Join us to learn more about how we leveraged platforms and technologies like Spark, Hive, Druid, Elastic Search and HBase to process large scale data for enabling impactful merchant solutions. We’ll share the architecture of our data pipelines, some real dashboards and the challenges involved.
Speakers
Kasiviswanathan Natarajan, Member of Technical Staff, PayPal
Deepika Khera, Senior Manager - Merchant Data Analytics, PayPal
2. 2
Who we are?
Deepika Khera
Kasi Natarajan
• Big Data Technologist for over a decade
• Focused on building scalable platforms with Hadoop ecosystem – Map Reduce,
HBase, Spark, Elasticsearch, Druid
• Senior Engineering Manager - Merchant Analytics at PayPal
• Contributed to Druid for the Spark Streaming integration
• 15+ years of industry experience
• Spark Engineer @PayPal Merchant Analytics
• Building solutions using Apache Spark, Scala, Hive, HBase, Druid and Spark ML.
• Passionate about providing Analytics at scale from Big Data platforms
3. 3
Agenda
PayPal Data & Scale
Merchant Use Case Review
Data Pipeline
Learnings - Spark & HBase
Tools & Utilities
Behavioral Driven Development
Data Quality Tool using Spark
BI with Druid & Tableau
5. PayPal is more than a button
Loyalty
Faster
Conversion
Reduction
in Cart
Abandonment
Credit
Customer
Acquisition
APV Lift
Invoicing
Offers
CBT Mobile In-Store Online
5
7. 7
PayPal operates one of
the largest
PRIVATE
CLOUDSin the world*
petabytes
of data*
42markets active customer
accounts**
237M
payments in
2017**
7.6
BILLION
merchants
19Mpayments/
second at peak*
~60
0
our platform
Dedicated to with a
customer focused,
strong performance,
highly scalable,
continuously available
PLATFORM.
PayPal has one of the top five Kafka
deployments in the world, handling over
200 billion messages per day
200
+
PayPal operates one of the largest Hadoop
deployments in the world.
A 1600 Node Hadoop Cluster
with 230TB of Memory, 78PB of Storage
Running 50,000 Jobs Per day
The power of
Today PayPal is much more than a button on a website. We have an extensive portfolio of products & services. Enabling CBT, easy Mobile & Web access, Credit Options to customers, Marketing solutions for merchants and many more help merchants grow their business and enable customers to safe digital commerce.
All of these also translate into a rich set of data that PayPal has to inform strategic and operational decisions
Concurrent Mark Sweep – If it doesn’t finish garbage collection, it starts stop the world GC. Tuned it from 10s
ec to 30seconds
CMSMaxAbortablePrecleanTime
https://community.hortonworks.com/questions/44950/spark-memory-issue.html
org.apache.spark.shuffle.MetadataFetchFailedException
Running this job with 4 cores and 200 executors.
Although there could be multiple reasons for delay like skewness in data . For us it turned out that the datanode that the executor was running on was busy , a lot of times this happened with nodes with limited capacity
having more number of tasks in per executor theoretically puts more pressure on the executor where if there are memory constraints the chances of having an executor failure increases
metafetchfailed happens usually due to executor failure or due to executor termination
Points
The tool was completely customizable for each project
The tool was build to be schema agnostic of the table and scalable to run on datasets of large size
Report was generated on Match/Mismatch count by Key Columns like Product and Geography as needed