Dealing with data doesn't only require a data store, it requires an infrastructure. At SimpleReach, we have 5 data storage layers to service all of our data needs. These range from high volume, high velocity data ingestion with real-time analytics to ad-hoc style historical analysis with search capabilities. To communicate effectively between applications, data stores sit behind a service architecture for consistent data access patterns and failover/redundancy. This talk is a story of how we came to this architecture and some of the lessons we learned along the way.
2. Overvie
• Evolution
• SimpleReach
• Data Stores / Languages
• Architecture Implementation
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
3. We're in the midst of an
evolution, not a revolution.
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
4. The 2 Truths
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
5. The Real Truth
Even with the right tools, 80% of
the work of building a big data
system is acquiring and refining
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
6. 30m plays/day + 4m user ratings + 75k movies metadata + 24.4m use
metadata =
David Fincher + Kevin Mitch Hurwitz + Will Arnett +
Spacey + British House of Jason Bateman + Arrested
Cards Development
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
7. BRING IT
TOGETHE
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
8. revolution evolution
Insufficient
New Products
Capabilities
Scale/Need Development &
Changes Integration
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
11. SimpleReach
• Millions of URLs per day
• Over 1 billion pageviews per month
• 250m events per day (~3k events/second)
• Auto-scale 90-130 machines depending on traffic
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
12. HUMBLE BEGINNINGS
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
13. Scale
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
14. AND THEN...
C*
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
15. Cassandra C*
• Large data volume ingestion at high velocity
• Really fast writes to many locations (eventual
consistency)
• Query by column groups within rows (slicing)
• TTLs for small group aggregation
• Wrote Helenus, Node.js driver for Cassandra
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
16. •
MongoDB
Fast atomic increments (Node.js is native JSON)
• Sharding
• Solid ORM for Rails (MongoID)
• B-Tree Indexes
• Document based via JSON
• TTLs for ephemeral data
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
17. Redis
• Supports hundreds of thousands transactions per
second
• Great caching engine
• Supports useful variable types like sets, sorted set,
lists
• Everything is guaranteed to be Memory Mapped
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
18. Infobright
• Works with standard MySQL driver
• Column Stores for ad-hoc analytics queries
in SQL
• Heavy compression of data (avg 12:1)
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
19. The
• c0dez
Polyglottany doesn’t only apply to data stores
• Each language has its own benefit to each stack
layer
• Each language has its own individual benefits
• Each language has its own development benefits
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
21. Cons
• Redis - Can only utilize a single core. SerDe price.
• Infobright - DELETE/UPDATEs are VERY expensive
• Cassandra - No btree indexes or probabilistic counters
• Mongo - Indexes must fit in memory. Forced Replica ping times
• Python - Whitespace. Community
• Ruby - Not high performance enough for our standards
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
22. •
Evolution Takes Work
Service Oriented Architecture (Internal API)
• Data accuracy checks: visual and programmatic
• Built framework for testing out engines (Storage,
Queueing, etc)
• Access to many toolsets (for all languages, DBs, Engines)
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
23. Service
Solr
C*
Real-time
C*
Internal API
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
24. Path of a Packet
Fire Solr
Hos
C*
Internal API
Consumers
EP
Queue
Internet Mong
API
Redis
SC IB
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
25. Architecture Distribution
US-EAST-1a US-EAST-1b US-EAST-1e
CASSANDRA-0001 CASSANDRA-0002 CASSANDRA-0003
CASSANDRA-0010 CASSANDRA-0011 CASSANDRA-0012
REDIS-0001A REDIS-0001B
INFOBRIGHT-00 INFOBRIGHT-00
01 02
MONGO-SHARD-0000-A MONGO-SHARD-0000-B
MONGO-SHARD-0001-B MONGO-SHARD-0001-A
MONGO-SHARD-0002-B MONGO-SHARD-0002-A
iAPI-0001 iAPI-0002 iAPI-0003
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
26. The Schrute of the Problem
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
27. Evolving Amazon Tools • CloudSearch
• Full Featured API
• Elastic Beanstalk
• Simple Queuing Service
• Elastic MapReduce
• Data Pipelining
• Simple Workflow Coordinator
• OpsWorks
• S3 / Glacier
• Cloud Formation
• Redshift Analytics
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
28. DevOps Wizardry
• Extensive use of AWS
• Monitor: Nagios, Statsd, and Graphite
• Manage: Chef, OpsWorks, cSSHx
• Deployments
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
29. •
Summary
Solutions Require Evolution
• Build, Use, and Integrate Tools
• Abstraction
• Distribution
• Monitoring & Automation
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
30. Evolution Takes
Time
A revolution only lasts fifteen
years, a period which
coincides with the
Big Data Revolution is an Eric Lubow @elubow
Evolution #NYCassandra2013
31. We’re
(Ask us about Foodis an
Big Data Revolution Coma Fridays)
Eric Lubow @elubow
Evolution #NYCassandra2013
32. Questions are guaranteed in life.
Answers aren’t.
Eric Lubow
@elubow
elubow@simplereach.co
Thank
Big Data Revolution is an
you.
Eric Lubow @elubow
Evolution #NYCassandra2013