6. Any trending deals?
Top selling providers
Categorize deals based on price and discount
percentages.
Friends purchase pattern
Sample Queries.
9. API DESIGN
Bad or Good?
Biggest Engineering
Challenges
10. Pagination limits and constant API updates.
http://api.sqoot.com/v2/deals?api_key=xxxxxx;category_slug=
home_goods;page=1;per_page=100
Freezing time for real-time non-fire-hose
data source is hard
Data Source Constraints
11. Biggest Project Challenge
Three queries done at the same time.
Not fun – Inconsistent. Pagination depends on total largely.
New Page refresh New
12. ASYNC DISTRIBUTED QUERYING ENGINE
First Stage Master Producer (FSM)
Intermediate Hybrid Consumer-Producer
Final Stage Consumer
Design to solve this?
18. Nigerian.
Masters’ in Computer Science – Brandeis
University MA
Software Engineer 2 ½ years.
Hobbyist Photographer.
About Me.
19. PyKafka vs. Kafka-Python.
Balanced consumer.
Topic to partition assignment – Hash partitioning.
Engineering architecture to handle complex real world data source.
Deep dive. Tweak source code for use case.
DevOps
General learning curves.
Other Challenges
21. Elasticsearch or Cassandra or Elasticsearch on Cassandra
Elasticsearch –
Good with preserving indexes data.
Great for more reads than writes.
Analytics.
Search
Cassandra –
Good for fast writes.
Preserving data schema
Uptime critical
Time series
Elastic Search vs Cassandra
Engineering challenge of utilizing external data sources with vast technical constraints you have no control over.
Choice of tools and reasons for taking that into consideration.
The velocity of change with such APIs can cause terrible behaviors in your app.
Getting a snapshot to fetch unique data
Time to crawl and API changes was large.
Crawling api synchronously? Duplicates and dead. Deals are pushed down other pages constantly. Engineered a bespoke solution for that.
My project largely depends on the total in order to fetch the complete deals.
1. An Asynchronous distributed engine that queries the API and tries to compute what pages to fetch.
2. Sends it to multiple consumers in a LEAKY Bucket fashion, and then synchronously writes output using Bounded Semaphores to try and maintain consistency.
3. Order of fetch wasn’t important. Aggregation and sorting done in Spark.
4. Main point is UNIQUENESS as much as possible.
One producer per category
Communicate with Sqoot API
Compute intelligently page number to fetch also considering time deltas
Produces urls with page chunks to a kafka topic queue
Consumer producers quickly fetch the data and further produce the actual data to another topic for further processing.
Compute with API server
Determine what categories to fetch
Computes page chunks for available consumers to fetch in a leaky bucket fashion.
Consumers defined URLS and page chunks list from FSM
Non-blocking spin up multiple threads == length of page chunk lists
The producer defined URLS are consumed and for aggregation of data.
Syncing consumer output? Bounded Semaphores
Hash partitions.
Started building mine but found a more robust tool that handled that.
Kafka-python vs pykafka
Elasticsearch –
Loaded 15GB of data
Read and processed
Profiled each stage
Hash partitions.
Started building mine but found a more robust tool that handled that.
Kafka-python vs pykafka