Moving to a data-centric architecture: Toronto Data Unconference 2015

Moving to a data-centric
architecture!
Toronto Data Unconference!
June 19th, 2015
Adam Muise!
Chief Architect!
Paytm Labs!
adam@paytm.com!

Who am I?!
•  Chief Architect at Paytm Labs!
•  Paytm Labs is a data-driven lab founded to take on
the really hard problems of scaling up Fraud,
Recommendation, Rating, and Platform at Paytm!
•  Paytm is an Indian Payments/Wallet company, has
50 Million wallets already, adds almost 1 Million
wallets a day, and will be greater than 100 Million
customers by the end of the year. Alibaba recently
invested in us, perhaps you heard. !
•  I’ve also worked with Data Science teams at IBM,
Cloudera, and Hortonworks!

Paytm = Biggest Digital
Wallet in India &
Marketplace!

That means you
should speak. !

Try it out now. Let’s
introduce ourselves. !

What I suggest we
discuss:!
Why a Datalake?!
&!
How to Datalake in
2015!

In most cases, more data is better.!
Work with the population, not just a
sample.

Your view of a client today.
Male!
Female!
Age: 25-30!
Town/City!
Middle Income Band!
Product Category
Preferences!

Your view with more data.
Male!
Female!
Age: 27 but
feels old!
GPS coordinates!
$65-68k per year!
Product
recommendations!
Tea Party!
Hippie!
Looking to start a
business !
Walking into
Starbucks right now…!
A depressed Toronto
Maple Leaf’s Fan!
Products left in
basket indicate drunk
amazon shopper!
Gene
Expression for
Risk Taker!
Thinking about
a new house!
Unhappy with his cell
phone plan!
Pregnant!
Spent 25 minutes
looking at tea cozies!

New types of data don’t quite ﬁt into
your pristine view of the world.
My Little Data Empire!
Data!Data!
Data!
Data!
Data!
Data!
Data! Data!
Data!
Logs!
Data!
Data!Data!
Data!
Data!
Data!
Data!
Machine Data!
Data!
Data!Data!
Data!
Data!
Data!
Data!
?!
?!
?! ?!

To resolve this, some people make
Data Warehouses with ﬁxed
schemas
EDW!
Data!Data!
Data!
Data!
Data!
Data!
Data! Data!
Data!Schema!

…but that has its problems too.
EDW!
Data!Data!
Data!
Data!
Data!
Data!
Data! Data!
Data!Schema!Data!
Data!
Data!
ETL! ETL!
ETL! ETL!
EDW!
Data!Data!
Data!
Data!
Data!
Data!
Data! Data!
Data!Schema!Data!
Data!
Data!
ETL! ETL!
ETL! ETL!

What if the data was processed and stored centrally?
What if you didn’t need to force it into a single schema? !
Data Lake.
EDW!
Data!Data!
Data!
Data!
Data!
Data!
Data!
Schem
a!
BI & Analytics! Schema! Schema!
Data!
Data!
Data!
Data Lake!
Data!
Data!
Data!
Data!
Data!
Data!Data!
Data!
Data!
Data!
Data!
Data!
Schema!Schema!
Data!
Data!
Data!Process! Process!
Data!
Data!
Data!
Data!
Data!
Data!
Data!
Data!
Data!
Data!
Data!
Data!Data Sources!
Data Sources!

A Data Lake Architecture enables:!
- Landing data without forcing a single schema!
- Landing a variety and large volume of data
efﬁciently!
- Retaining data for a long period of time with a very

low $/TB!
- A platform to feed other Analytical DBs!
- A platform to execute next gen data analytics and

processing applications (Graph Analytics, Machine
Learning, SAP,
etc…)

Batch Processing!
+!
Stream Processing!
= Lambda Architecture!

Batch Layer!
- Handles ETL!
- Traditional Integration!
- Often System of Record!
- Archive!
- Large-scale analytics!

Speed Layer!
- Handles Event streams!
- Near-Realtime predictive analytics!
- Alerting/Trending!
- Processing/Parsing for Micro-Batch ETL!
- Often an ingest layer for NoSQL DB
data or Search indexes (Solr, ES, etc)!

Our Datalake!
We had to build a data lake with realtime capability. !
It looks like this:!
!

Our Datalake!
Lambda Architecture!
Batch Ingest:!
•  SQOOP from MySQL instances!
•  Keep as much in HDFS as you can, ofﬂoad to S3 for DR/
Archive and when you have colder data!
•  Spark and other Hadoop processing tools can run natively
over S3 data so it’s never really gone (don’t use Glacier in a
processing workﬂow)!
Realtime Ingest:!
•  Mypipe to get events from binary log data and push into
Kafka topics (under construction)!
•  Applications push critical events to Kafka !
•  Kafka acts as a buffered ingest, can be archived to HDFS
with Camus!
•  All Realtime data processed with Spark Streaming (Micro
Batch) or Camus (archive to Avro)!

Variations…!
Realtime Fraud Rule Engine!

Discussion:!
What are the options to
handle realtime
updates in Hadoop?!
!

Moving to a data-centric architecture: Toronto Data Unconference 2015

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (9)

Similaire à Moving to a data-centric architecture: Toronto Data Unconference 2015

Similaire à Moving to a data-centric architecture: Toronto Data Unconference 2015 (20)

Plus de Adam Muise

Plus de Adam Muise (20)

Dernier

Dernier (20)

Moving to a data-centric architecture: Toronto Data Unconference 2015