SlideShare une entreprise Scribd logo
1  sur  22
{…where the best deals find you in real time.
Emmanuel Awa
 For the love of deals, we all just love it.
 Real world engineering challenge.
MOTIVATION
 ONE platform : User’s preference Inspired
Searches and Shopping..
MOTIVATION
Sqoot API.
 Scaled to all categories offered by API
Sample Data
 User Interaction – Engineered 1B users
Current Data Source
 Any trending deals?
 Top selling providers
 Categorize deals based on price and discount
percentages.
 Friends purchase pattern
Sample Queries.
 Complex queries? Real time response?
Sample Queries.
Current Pipeline
API
INGESTION
BATCH LAYER
SERVING LAYER
Hybrid
Streaming
API
Interaction
and deals
collection
 API DESIGN
 Bad or Good?
Biggest Engineering
Challenges
 Pagination limits and constant API updates.
http://api.sqoot.com/v2/deals?api_key=xxxxxx;category_slug=
home_goods;page=1;per_page=100
 Freezing time for real-time non-fire-hose
data source is hard
Data Source Constraints
Biggest Project Challenge
Three queries done at the same time.
Not fun – Inconsistent. Pagination depends on total largely.
New Page refresh New
 ASYNC DISTRIBUTED QUERYING ENGINE
 First Stage Master Producer (FSM)
 Intermediate Hybrid Consumer-Producer
 Final Stage Consumer
Design to solve this?
.
Architecture
 FIRST STAGE MASTER
Compute page chunks
Leaky bucket approach
 FIRST STAGE MASTER Cont’d
 HYBRID CONSUMER-PRODUCER
Fetch and produce actual
data.
 FINAL STAGE CONSUMER
Persist data - HDFS
 Nigerian.
 Masters’ in Computer Science – Brandeis
University MA
 Software Engineer 2 ½ years.
 Hobbyist Photographer.
About Me.
 PyKafka vs. Kafka-Python.
 Balanced consumer.
 Topic to partition assignment – Hash partitioning.
 Engineering architecture to handle complex real world data source.
 Deep dive. Tweak source code for use case.
 DevOps
 General learning curves.
Other Challenges
CREATE TABLE trending_categories_with_price (category text, created_at timestamp, updated_at
timestamp, expires_at timestamp, description text, fine_print text, price float, discount_percentage float, id
bigint, merchant_address text, merchant_country text, merchant_id bigint,merchant_latitude text,
merchant_longitude text, merchant_locality text, merchant_name text ,merchant_phone_number text,
merchant_region text, number_sold float, online boolean, provider_name text, title text, url text, PRIMARY
KEY ((price) category, discount_percentage)) WITH CLUSTERING ORDER BY (discount_percentage DESC);
Sample tables
 Elasticsearch or Cassandra or Elasticsearch on Cassandra
 Elasticsearch –
 Good with preserving indexes data.
 Great for more reads than writes.
 Analytics.
 Search
 Cassandra –
 Good for fast writes.
 Preserving data schema
 Uptime critical
 Time series
Elastic Search vs Cassandra
Benchmarking Pipeline
API
INGESTION
BATCH LAYER
SERVING LAYER
Hybrid
Streaming
API
Interaction
and deals
collection

Contenu connexe

Similaire à ExStreamly Cheap - Insight Data Engineering 2016a Project

Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Yahoo Developer Network
 
Designing a Future-proof API Program
Designing a Future-proof API ProgramDesigning a Future-proof API Program
Designing a Future-proof API Program
Pronovix
 

Similaire à ExStreamly Cheap - Insight Data Engineering 2016a Project (20)

API and Platform Strategies to Win in Global and Local Markets
API and Platform Strategies to Win in Global and Local MarketsAPI and Platform Strategies to Win in Global and Local Markets
API and Platform Strategies to Win in Global and Local Markets
 
How to Choose Your Tech Stack?
How to Choose Your Tech Stack?How to Choose Your Tech Stack?
How to Choose Your Tech Stack?
 
AI/ML Powered Personalized Recommendations in Gaming Industry
AI/ML PoweredPersonalized Recommendations in Gaming IndustryAI/ML PoweredPersonalized Recommendations in Gaming Industry
AI/ML Powered Personalized Recommendations in Gaming Industry
 
The Cloud - What's different
The Cloud - What's differentThe Cloud - What's different
The Cloud - What's different
 
ExStreamlycheap Final Slides
ExStreamlycheap Final SlidesExStreamlycheap Final Slides
ExStreamlycheap Final Slides
 
Power
PowerPower
Power
 
29.4 mb
29.4 mb29.4 mb
29.4 mb
 
29.4 Mb
29.4 Mb29.4 Mb
29.4 Mb
 
Databases on AWS: The Right Tool for the Right Job (DAT205-R1) - AWS re:Inven...
Databases on AWS: The Right Tool for the Right Job (DAT205-R1) - AWS re:Inven...Databases on AWS: The Right Tool for the Right Job (DAT205-R1) - AWS re:Inven...
Databases on AWS: The Right Tool for the Right Job (DAT205-R1) - AWS re:Inven...
 
SaaS Pricing
SaaS PricingSaaS Pricing
SaaS Pricing
 
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
 
presentation slides
presentation slidespresentation slides
presentation slides
 
AppSync and GraphQL on iOS
AppSync and GraphQL on iOSAppSync and GraphQL on iOS
AppSync and GraphQL on iOS
 
Z Enterprise.Optimization And Security
Z Enterprise.Optimization And SecurityZ Enterprise.Optimization And Security
Z Enterprise.Optimization And Security
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
 
Use Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfUse Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdf
 
Scalding Big (Ad)ta
Scalding Big (Ad)taScalding Big (Ad)ta
Scalding Big (Ad)ta
 
NEW LAUNCH! Realtime and Offline application development using GraphQL with A...
NEW LAUNCH! Realtime and Offline application development using GraphQL with A...NEW LAUNCH! Realtime and Offline application development using GraphQL with A...
NEW LAUNCH! Realtime and Offline application development using GraphQL with A...
 
NEW LAUNCH! Realtime and Offline application development using GraphQL with A...
NEW LAUNCH! Realtime and Offline application development using GraphQL with A...NEW LAUNCH! Realtime and Offline application development using GraphQL with A...
NEW LAUNCH! Realtime and Offline application development using GraphQL with A...
 
Designing a Future-proof API Program
Designing a Future-proof API ProgramDesigning a Future-proof API Program
Designing a Future-proof API Program
 

Dernier

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Dernier (20)

Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 

ExStreamly Cheap - Insight Data Engineering 2016a Project

  • 1. {…where the best deals find you in real time. Emmanuel Awa
  • 2.  For the love of deals, we all just love it.  Real world engineering challenge. MOTIVATION
  • 3.  ONE platform : User’s preference Inspired Searches and Shopping.. MOTIVATION
  • 4. Sqoot API.  Scaled to all categories offered by API Sample Data
  • 5.  User Interaction – Engineered 1B users Current Data Source
  • 6.  Any trending deals?  Top selling providers  Categorize deals based on price and discount percentages.  Friends purchase pattern Sample Queries.
  • 7.  Complex queries? Real time response? Sample Queries.
  • 8. Current Pipeline API INGESTION BATCH LAYER SERVING LAYER Hybrid Streaming API Interaction and deals collection
  • 9.  API DESIGN  Bad or Good? Biggest Engineering Challenges
  • 10.  Pagination limits and constant API updates. http://api.sqoot.com/v2/deals?api_key=xxxxxx;category_slug= home_goods;page=1;per_page=100  Freezing time for real-time non-fire-hose data source is hard Data Source Constraints
  • 11. Biggest Project Challenge Three queries done at the same time. Not fun – Inconsistent. Pagination depends on total largely. New Page refresh New
  • 12.  ASYNC DISTRIBUTED QUERYING ENGINE  First Stage Master Producer (FSM)  Intermediate Hybrid Consumer-Producer  Final Stage Consumer Design to solve this?
  • 14.  FIRST STAGE MASTER Compute page chunks Leaky bucket approach
  • 15.  FIRST STAGE MASTER Cont’d
  • 16.  HYBRID CONSUMER-PRODUCER Fetch and produce actual data.
  • 17.  FINAL STAGE CONSUMER Persist data - HDFS
  • 18.  Nigerian.  Masters’ in Computer Science – Brandeis University MA  Software Engineer 2 ½ years.  Hobbyist Photographer. About Me.
  • 19.  PyKafka vs. Kafka-Python.  Balanced consumer.  Topic to partition assignment – Hash partitioning.  Engineering architecture to handle complex real world data source.  Deep dive. Tweak source code for use case.  DevOps  General learning curves. Other Challenges
  • 20. CREATE TABLE trending_categories_with_price (category text, created_at timestamp, updated_at timestamp, expires_at timestamp, description text, fine_print text, price float, discount_percentage float, id bigint, merchant_address text, merchant_country text, merchant_id bigint,merchant_latitude text, merchant_longitude text, merchant_locality text, merchant_name text ,merchant_phone_number text, merchant_region text, number_sold float, online boolean, provider_name text, title text, url text, PRIMARY KEY ((price) category, discount_percentage)) WITH CLUSTERING ORDER BY (discount_percentage DESC); Sample tables
  • 21.  Elasticsearch or Cassandra or Elasticsearch on Cassandra  Elasticsearch –  Good with preserving indexes data.  Great for more reads than writes.  Analytics.  Search  Cassandra –  Good for fast writes.  Preserving data schema  Uptime critical  Time series Elastic Search vs Cassandra
  • 22. Benchmarking Pipeline API INGESTION BATCH LAYER SERVING LAYER Hybrid Streaming API Interaction and deals collection

Notes de l'éditeur

  1. Engineering challenge of utilizing external data sources with vast technical constraints you have no control over.
  2. Choice of tools and reasons for taking that into consideration.
  3. The velocity of change with such APIs can cause terrible behaviors in your app. Getting a snapshot to fetch unique data Time to crawl and API changes was large.
  4. Crawling api synchronously? Duplicates and dead. Deals are pushed down other pages constantly. Engineered a bespoke solution for that. My project largely depends on the total in order to fetch the complete deals.
  5. 1. An Asynchronous distributed engine that queries the API and tries to compute what pages to fetch. 2. Sends it to multiple consumers in a LEAKY Bucket fashion, and then synchronously writes output using Bounded Semaphores to try and maintain consistency. 3. Order of fetch wasn’t important. Aggregation and sorting done in Spark. 4. Main point is UNIQUENESS as much as possible.
  6. One producer per category Communicate with Sqoot API Compute intelligently page number to fetch also considering time deltas Produces urls with page chunks to a kafka topic queue Consumer producers quickly fetch the data and further produce the actual data to another topic for further processing.
  7. Compute with API server Determine what categories to fetch
  8. Computes page chunks for available consumers to fetch in a leaky bucket fashion.
  9. Consumers defined URLS and page chunks list from FSM Non-blocking spin up multiple threads == length of page chunk lists The producer defined URLS are consumed and for aggregation of data. Syncing consumer output? Bounded Semaphores
  10. Hash partitions. Started building mine but found a more robust tool that handled that. Kafka-python vs pykafka
  11. Elasticsearch – Loaded 15GB of data Read and processed Profiled each stage
  12. Hash partitions. Started building mine but found a more robust tool that handled that. Kafka-python vs pykafka