Building data pipelines at Shopee with DEC

Rim Zaydullin (zaydullinr@seagroup.com) 
Platform Engineering Group (PEG), Shopee 2018
BUILDING DATA PIPELINES IN SHOPEE
WITH

WHY?
*LONG INTRO
Before diving in we need to
understand context and reasoning

As some of you are from the outside
the company, I need to give a bit more
details on how things work

So, bear with me

Behind the scenes of any internet company
Any projects begins with real life

And real life shows that every
company has a mess of a various
scale 
 
Separate parts or subsystems can be
very clean and pretty, but we never
stop our progress and even clean
systems deteriorate with time, due to
project evolution and new features,
nigher loads that require new
architectures, etc.

Engineers are the creators and the
cleaners of this mess, today we’ll talk
about cleaning-up

One example would be:

Shopee app We’re Shopee, we’re doing e-
commerse :D

People buy and sell stuﬀ, and when
they do, they have this useful info
numbers on their “orders” page

To ship, etc. Now, we found out that
having these numbers can cause
some nasty pain during sale events

Why?

CORE SERVER
Mobile app
&
Web clients
DB1 DB2 DB3
CORE SERVER*
*INSANELY SIMPLIFIEDVIEW
Some intro about core server

create transaction
commit transaction
query 
query 
query 
query 
query 
SLOW (locking) query 
query
query
CORE SERVER
Mobile app
&
Web clients
DB1 DB2 DB3
BOTTLENECKS!
When ppl buy stuff, number “to_ship”
changes for the seller.

All those numbers, to_ship, to_receive
(returns), etc are bunch of values in a
single row in a table.

When a lot of ppl buy stuff, this row
has many simultaneous updates
which leads to row locks, which leads
to transactions being timed out and
we have this avalanche effect

avalanche, when users can’t make a
purchase, they retry this whole big
transactions again and again, we
can’t serve new users, they
accumulate, everyone’s retrying to
make a purchase again and the whole
system is bought to a crawl

Shopee users are not happy, out DBA
are not happy, we gotta do something

CORE SERVER
Mobile app
&
Web clients
DB1 DB2 DB3
Let’s process slow (locking)
queries in background,
asynchronously
?
?
??
?
This info numbers are not absolutely
important in the big scheme of things,
they can be processed in background.

They can even be a bit delayed, it’s no
problem.

So, we need some new system
outside of core server, that could
handle these requests in background

DB3
Let’s process slow (locking)
queries in background,
asynchronously
?
?
??
?
This info numbers are not absolutely
important in the big scheme of things,
they can be processed in background.

They can even be a bit delayed, it’s no
problem.

So, we need some new system
outside of core server, that could
handle these requests in background

In fact we don’t need core server to
care about this logic at all. External
system could track buyer actions from
DB changes and update seller records
accordingly

Source
DB
Destination
DB
Magic Data
Pipeline??
Looks like we need something like
this? General solution

CORE SERVER
Mobile app
&
Web clients
SERVICE
redis queue A
redis queue B
transformation
server
DB1 DB2 DB3
CODE / INFRA BLOAT!
Let’s continue cleaning things up!Another
example!

Explain what’s going on.

It’s already outside the core server, but
requires core server to have additional code
(that needs support, monitoring and is not a
general solution)

External system can be a complicated mess
that’s reinvented over and over again by
diﬀerent teams

CORE SERVER
Mobile app
&
Web clients
DB1 DB2 DB3
CODE / INFRA BLOAT!

CORE SERVER
Mobile app
&
Web clients
DB1 DB2 DB3
CODE / INFRA BLOAT
This magic piece of infra is a bicycle
reinvention every time. It needs
servers, maintenance and it’s a
custom solution every time
SERVICE
redis queue A
redis queue B
transformation
server

CORE SERVER
Mobile app
&
Web clients
SERVICE
DB1 DB2 DB3
CODE / INFRA BLOAT
?
?
??
?
Data transformation

Source
DB
Destination
Service
Magic Data
Pipeline??
Again, looks like we need something
like this? General solution

EXISTING DBTOOLS?
TRIGGERS? 
FUNCTIONS?
- Triggers allows to modify only storage itself using set of
predeﬁned functions. React to insert/update/delete queries,
executes before or after the query
- Works only on DB host itself
- Limited in data processing capabilities
- Are bound to speciﬁc DB (mysql, oracle, etc)
- Can not send request to outside systems, queues
- Extending functionality is pretty much impossible

All problems in computer science can be solved
by another level of indirection. © David Wheeler
* the guy who invented
subroutines in software
He knows a lot about indirection!

Data
Source
Data
Destination
Magic Data
Pipeline??
Again, looks like we need something
like this? General solution
- Works as independent service
- Has ﬂexible data processing capabilities
- Not bound to speciﬁc data sources or destinations
- Connects completely unrelated systems in generic way
- Is easily extensible to support new systems
- It’s like DB functions/triggers taken to another level

Data
Source
Data
Destination
Magic Data
Pipeline??
But, the requirements are tough
- No additional point of failure
- Source consistency preservation
- Zero loss, low latency
- Highly available, scalable

• REPLICATION
• SIMPLE TRANSFORMS
INITIAL IDEA(S)

SIMPLE WEB INTERFACE
source transformation destination
Table Type Sharding Key Operations
Description

source
Enabled Source column Destination column
Source table
Destination table
Columns mapping
Operations mapping

OTHERS
LinkedIn's Change Data Capture Pipeline
SOCC 2012
We looked at other systems there are
not many in open source. It’s all
mostly internal systems never shared
with the outside world

This speciﬁc system is closely
connected with Oracle DB that’s used
at linkedin

DEC
DATABASE
QUEUE
DATABASE
QUEUE
SOURCE
MAPPING &
SIMPLE TRANSFORM
K /V
DESTINATION
• HARDCODED FUNCTIONS
• SIMPLE JSON CONFIGS
INITIAL DESIGN EXPLANATION

WE NEED SIMPLE SYSTEM

OH WAIT…

By nature, the more complex the
system is, the more prone it is to
breaking.

BUT

WAY MORE COMPLEXSometimes we need a transformation
function, that generated a request to
celery, for example. How are we going
to do that?

DEC
DATABASE
QUEUE
DATABASE
QUEUE
SOURCE
K /V
DESTINATION
• HARDCODED FUNCIONS
• SIMPLE JSON CONFIGS
• SCRIPTABLE ENGINE
MAPPING &TRANSFORM

• REPLICATION + SHARDING
• MAPPING + SIMPLE TRANSFORMS
• SCRIPTABLE ENGINE (LUA!)
• HA, LOW LATENCY, ZERO DATA LOSS

TRACKING DB EVENTS
- GDS connects directly to MySQL instance as slave 
- Receives logical replication log (modiﬁcations only) 
- Converts received events to json 
- Pushes those json events onto Kafka topic(s) 
- Highly conﬁgurable

SOURCE MAPPING &TRANSFORM DESTINATION

DEC ARCHITECTURE:
1) Reads events from datasource

2) Applies transformations to events using simple transforms or (LUA script)

3) Serializes resulting queries to internal format using msgpack

4) Writes resulting binary queries to conﬁgured kafka topics
1) Reads binary queries

2) Deserializes queries and sends them to speciﬁed destination

3) Takes care of retry logic and events deduplication

CONFIGURATION
Make sure your data source is conﬁgured
(we have a DB replication stream from GDS)
Step 1

Make sure DEC conﬁguration has 
correct data source and data destination
Step 2
CONFIGURATION

Implement and deploy necessary data
transformation scripts.
Step 3
CONFIGURATION

CONSUMER
EVENT TRANSFORMATIONS
1) DEC Consumer takes event from GDS queue
2) Filters event by table/event type (insert/update/delete)
3) Process with corresponding LUA script

CONSUMER EVENT TRANSFORMATIONS

CONSUMER EVENT TRANSFORMATIONS
2018/12/17 16:41:20.882435 [INFO] [buyer_seller_count.dec_shopee_order_details.order_2_seller]
[4097018849][619930454] 
SQL: UPDATE order_cnt_seller_tab_00000002 SET `mtime` = 1545036080, `seller_toreceice` =
`seller_toreceice` + 1 , `seller_toship` = `seller_toship` - 1 WHERE `shopid` = 27045752;
rows affected: 1

SO,WHERE WE’RE AT?
LIVE FOR ALL 7 COUNTRIES
USED BY 3TEAMS,
MORE COMING ON BOARD

create transaction
commit transaction
query 
query 
query 
query 
query 
SLOW (locking) query 
query
query
CORE SERVER
Mobile app
&
Web clients
DB1 DB2 DB3
BOTTLENECKS!

CORE SERVER
Mobile app
&
Web clients
DB1 DB2 DB3
DEC
NO BOTTLENECKS!

CORE SERVER
Mobile app
&
Web clients
SERVICE
redis queue A
redis queue B
transformation
DB1 DB2 DB3
CODE / INFRA BLOAT

CORE SERVER
Mobile app
&
Web clients
SERVICE
DB1 DB2 DB3
NO CODE / INFRA BLOAT
DEC

CORE SERVER
Mobile app
&
Web clients
SERVICE
DB1 DB2 DB3
DEC
SERVICE
SERVICE

CONCLUDING
All software projects are evolving and it’s always a mess 
but we need to create decent tools to keep the entropy at bay
and DEC is one such attempt in this never ending battle :)

THANKYOU!
Q&A
Rim Zaydullin (zaydullinr@seagroup.com) 
Platform Engineering Group (PEG), Shopee 2018

Building data pipelines at Shopee with DEC

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Building data pipelines at Shopee with DEC

Similaire à Building data pipelines at Shopee with DEC (20)

Dernier

Dernier (20)

Building data pipelines at Shopee with DEC