2. About Me - Jared L Kerim
- Software Developer (Python)
- Mozilla Geolocation Cloud Services Team
- CTO at PressureNET
3. PressureNET (Shameless Plug)
- Gathers sensor data from
smartphones
- Constant stream of data to
servers
- API to retrieve data
- Visualization
- Analysis
5. The Problem: MySQL
- Slow lookups
- Takes a lot of disk space
- Cost (Large Relational DBs are expensive)
- Schema changes (become slow or impossible)
6. How Big is “Big”
- PressureNET 100 req/s, 1.5 billion records
- Analytics Systems 5000 req/s, 100s of
billions of records
- Ad Buying Service 500k req/s, trillions of
records
8. What do we want to accomplish?
- Receive and store large amounts of data
- Access it quickly
- Small fast lookups (visualization)
- Large batch computations (mapreduce)
10. Durability
- Data in a durable store should be ‘safe’
- Don’t remove data from one durable data
store until it is confirmed to be in another
durable data store
- Durable data stores should have redundant
backups (hot standbys)
11. Redundancy
- Each stage of your system should have
multiple copies
- If one copy goes down, another should take
over
- Redundancy ensures availability
12. Scalability
- The rate of data intake can grow or spike
- Your system should be able to add more
resources to handle that growth
- Require that your workload is partitionable
14. We Are Not Alone
- This architecture is widely adopted
- Analytics
- Ad Serving/Views
- Log Analysis
- Sensor Data
- Game Events
- Video Events
15. Ingestors
- A redundant, scalable set of nodes which
receive data over http
- Can apply early validation and
authentication
- Stateless, low latency
16. Queue
- A scalable, durable storage mechanism for
data ‘in flight’
- Only holds data temporarily
- Typically preserves the order data was
received in
17. Aggregator
- A scalable, stateless set of workers which
consume data from the queue
- Can process data in small batches
- Write raw or transformed data to persistent
storage such as S3, Databases, etc.