Spark Summit Keynote by Seshu Adunuthula

Enterprise Data Platform and role
of Apache Spark
Seshu Adunuthula (@SeshuAd)

42%
GMV VIA MOBILE
291M
MOBILE APP DOWNLOADS
GLOBALLY
1.3B
LISTINGS CREATED
VIA MOBILE EVERY WEEK
162M
ACTIVE BUYERS
25M
ACTIVE SELLERS
800M
ACTIVE LISTINGS
8.8M
NEW LISTINGS
EVERY WEEK
Q4 2015 Q4 2015
eBay Marketplace

*Q3 2015 data
Items sold are
new
79%
Items are fixed
price
84%
Evolving from our Auction Roots..
Transactions offer
free shipping
63%
Data is eBay’s
Most Important
Asset

Data Sciences
Personalization
User propensity modeling based on 5 quarters
behavioral data. Cluster based unstructured data.
User (Badges, Activity Synopsis, Word Cloud),
Deals Personalization
Merchandising
Similar items - recommending similar items on
key placements on site and mobile. Powered
among other things by co-clicked items.
Structured Data
Deal discovery leverages machine learning
model.
Trust
Predictive machine learning models for fraud
prevention, account take-over, prevent loss, bad
buying experience prevention, buyer/seller risk
prediction
Shipping
Delivery experience: Building a model to predict
more accurate and shorter delivery estimates.
BI & Analytical Apps
Search Backend
950M items/ ~15TB indexes in 2.5 hours on a
daily basis.
2M item/11TB index subsets generated near
real-time in 3 minutes
Search Sciences
Search ranking/spam/recall factors or data (like
phrase table, query-rewrite table, etc.)
preparation, including many pipelines built on
top of search Scala platform
s
Structured Data
Convert 800M Item listings into Product pages
that are automatically curated and persisted
Traffic – Paid Search
10% efficiency lift for paid search (Amber model)
Identify low performing items (Google search)
Data Preparation.
Bot detection, data transformation, sessionization
for user behavior and core data sets
Experimentation
A/B Testing for new features and user
Experiences released on the site.
Behavioral Data
Search
SPD: Provides global B2C and C2C seller
performance overview
Nous, DNA: Provide self service product
experiences analysis on product health,
behavior analytics and product domain specific
reporting
System Services
Sherlock Monitoring: Applications logs (CAL
logs) are stored and processed on Hadoop.
Data is eBay’s DNA

ENTERPRISE DATA PLATFORM
9
Agile Data Warehouse Data Streams
Batch
Humans
Sets of data
Streams
Systems
Sets of data
Data Services
Services
Applications
Specific calls
Populated
Used by
How
Enterprise
Populated
Used by
How
Enterprise Data Platform

Structured Data
SQL
Interactive Relational Analysis
Semi Structured Data
Java/Scala…
Batch and Ad-Hoc Algorithmic Analysis
Relational Analysis
Programmatic Exploration
EDW Hadoop
10
Analysts/BU PM/Executives/Tools Analysts/Scientists/Tools Scientists/Engineers
Agile Data Warehouse

11
Simplify Access to DW
Cross Platform VDMs
Geo Distributed Caches
Data Virtualization

Apache Kylin: Extreme OLAP Engine
12
Cube Build
Engine
SQL
Low Latency -
Seconds
Mid Latency -
Minutes
Routing
3rd Party App
(Web App, Mobile…)
Metadata
SQL-Based Tool
(BI Tools: Tableau…)
Query Engine
Hadoop
Hive
REST API JDBC/ODBC
Star Schema Data Key Value Data
Data
Cube
OLAP
Cube
(HBase)
SQL
REST Server
MOLAP Cubes
ANSI SQL on Hadoop
Interactive Query on Billions
of Rows
Apache Kylin: Extreme OLAP Engine

Data Streams Platform
Apache Kafka
Behavior TXN User
Streaming
Apps
Sand
box
Stream Processing ClustersOpen
Sample
Streams Needs
Approval
Staging Pool1 Pool2
Rheos App
Manager
Configuration
Deployment
GitHub
Tor
a
ETL
Connectors
Hadoop Teradat
a
Pool3
Risk Real-time Representation
Of EDW Datasets
Centralized Shared Data
Streams
DW populated using the
Streams Platform
Data Streams Platform

Q3 2015
DataSearch&Discovery
CollaborativeAnalytics
DataGovernance
Data ServicesData Services

•Execution environment tailored for
Spark
•Governance of Big Data Apps with
a PaaS layer
•Application life cycle spanning
development to deployment
Spades – Spark Provisioning & Deployment Environment

Ingest
Servers
Listeners
AnalystsAnalytics Platform & DeliveryCAL
Application
Server
eBay Visitors
Application
servers
SingularityHadoopCentral App
Logging
BEHAVIORAL DATA PIPELINEBehavioral Data Pipeline

Behavior Data: A/B Testing
18
Behavioral Data: A/B Testing

Transactional Data Pipelines
19
• Initial pattern
• More data available faster on Hadoop
• Leverage Hadoop SQL / Spark
• Subsequent pattern
• >10% datasets available on Hadoop
• New innovation avenues via OpenSource
• Give humans more Teradata capacity
• Teradata data available no later than before
• <95% extract / daily batch
• >5% stream / frequent batch
Transactional Data Pipelines

Spark Summit Keynote by Seshu Adunuthula

Spark Summit Keynote by Seshu Adunuthula

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (19)

Similaire à Spark Summit Keynote by Seshu Adunuthula

Similaire à Spark Summit Keynote by Seshu Adunuthula (20)

Plus de Spark Summit

Plus de Spark Summit (20)

Dernier

Dernier (20)

Spark Summit Keynote by Seshu Adunuthula

Notes de l'éditeur