Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

NoSQL in Real-time Architectures

Prochain SlideShare
That ORM is Lying to You
That ORM is Lying to You
Chargement dans…3

Consultez-les par la suite

1 sur 30 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)


Similaire à NoSQL in Real-time Architectures (20)

Plus récents (20)


NoSQL in Real-time Architectures

  1. 1. © 2014 Aerospike. All rights reserved ‹#› NoSQL in Real-time Architectures Ronen Botzer Aerospike
  2. 2. © 2014 Aerospike. All rights reserved ‹#› NoSQL? What is NoSQL Anyway? ■ Strozzi NoSQL (1998) - an RDBMS that lacks support for the Structured Query Language ■ A collective term for non-relational data stores (~2009) ■ Column: Cassandra, HBase, BigTable ■ Document: MongoDB, CouchDB ■ Key-value: Redis, Aerospike ■ Graph: OrientDB, Neo4j ■ BTW, SQL-like query languages are emerging in NoSQL ■ "Not Only SQL" is one of the worst backronyms, ever. ■ A vague "marketing" term describing NotREL databases
  3. 3. © 2014 Aerospike. All rights reserved ‹#› Old Architecture ( scale out in 2000 ) APP SERVERS CACHE CLUSTER STORAGE CONTENT DELIVERY NETWORK LOAD BALANCER SHARDED RDBMS SHARD MANAGER
  4. 4. © 2014 Aerospike. All rights reserved ‹#› We Have a Problem, Part 1 - The RDBMS ■ Relational databases don't cluster well. ■ Most are not designed to scale well vertically, either. ■ They don't work at the velocities required by web applications under high loads. ■ Schemas are too rigid for modern applications. ■ Relational databases were not designed for this ■ Designed in the era of single cores, expensive RAM, rotational disks, and accounting for the huge speed difference between disk and RAM. For example, disk-based indexes. ■ The days when DBAs controlled the design and access to the schema, and dictated a glacial rate of change, with long design and implementation cycles. Not adaptive or responsive. ■ Designed to power a single app, not a growing number of them.
  5. 5. © 2014 Aerospike. All rights reserved ‹#› We Have a Problem, Part 2 - Architectural Impact ■ Architecting around the weakness of the RDBMS ■ Caches are added to compensate for slow reads and to reduce query load. ■ Increased the complexity of application logic. ■ Caches have their own clustering problems. ■ Broke database consistency. ■ Only improves reads, write-load still an issue. ■ Increasing use of denormalization. ■ Various attempts at sharing relational databases ■ Shard managers are usually written wrong. ■ Hotspots often emerge due to unbalanced hashing. ■ Cluster rebalancing once nodes are added is painful. ■ Does not provide high-availability.
  6. 6. © 2014 Aerospike. All rights reserved ‹#› Social Media MYSQL or POSTGRES (ROTATIONAL DISK) Recent user generated content Java application tier Data abstraction and sharding MODIFIED REDIS (SSD ENABLED) Content and Historical data
  7. 7. © 2014 Aerospike. All rights reserved ‹#› Travel Portal PRICING DATABASE (RATE LIMITED) Poll for Pricing Changes PRICING DATA Store Latest Price SESSION MANAGEMENT Session Data Read Price XDR Airlines forced interstate banking Legacy mainframe technology Multi-company reservation and pricing Requirement: 1M TPS allowing overhead Travel App
  8. 8. © 2014 Aerospike. All rights reserved ‹#› MILLIONS OF CONSUMERS BILLIONS OF DEVICES APP SERVERS DATA WAREHOUSEINSIGHTS Advertising Technology Stack WRITE CONTEXT In-memory NoSQL WRITE REAL-TIME CONTEXT READ RECENT CONTENT PROFILE STORE Cookies, email, deviceID, IP address, location, segments, clicks, likes, tweets, search terms... REAL-TIME ANALYTICS Best sellers, top scores, trending tweets BATCH ANALYTICS Discover patterns, segment data: location patterns, audience affinity
  9. 9. © 2014 Aerospike. All rights reserved ‹#› North American RTB speeds & feeds ■ 1 to 6 billion cookies tracked ■ Some companies track 200M, some track 20B ■ Each bidder has their own data pool ■ Data is your weapon ■ Recent searches, behavior, IP addresses ■ Audience clusters (K-cluster, K-means) from offline Hadoop ■ “Remnant” from Google, Yahoo is about 0.6 million / sec ■ Facebook exchange: about 0.6 million / sec ■ “other” is 0.5 million / sec Currently about 3.0M / sec in North American
  10. 10. © 2014 Aerospike. All rights reserved ‹#› Advertising Ecosystem
  11. 11. © 2014 Aerospike. All rights reserved ‹#› Modern Scale Out Architecture Load balancer Simple stateless APP SERVERS IN-MEMORY NoSQL RESEARCH WAREHOUSE CONTENT DELIVERY NETWORK LOAD BALANCER Long term cold storageFast stateless
  12. 12. © 2014 Aerospike. All rights reserved ‹#› Modern Scale Out Architecture Load balancer Simple stateless APP SERVERS IN-MEMORY NoSQL RESEARCH WAREHOUSE CONTENT DELIVERY NETWORK LOAD BALANCER Long term cold storageFast stateless HDFS BASED
  13. 13. © 2014 Aerospike. All rights reserved ‹#› Financial Services – Intraday Positions LEGACY DATABASE (MAINFRAME) Read/Write Start of Day Data Loading End of Day Reconciliation Query REAL-TIME DATA FEED ACCOUNT POSITIONS XDR 10M+ user records Primary key access 1M+ TPS planned Finance App Records App RT Reporting App
  14. 14. © 2014 Aerospike. All rights reserved ‹#› Live analytics without ETL http://www.aerospike.com/community/labs/ ■ 'Old Hadoop' involves using MapReduce for ELT/ETL. ■ Integration points with fast NoSQL ■ Input format connector - using NoSQL as a faster storage layer. ■ Output format connector - skipping the L and the T. ■ Dynamic programming paradigm - shared nothing MR tasks have to wait until the reduce phase to consolidate information. You can lookup and update row-level data during the map phase instead.
  15. 15. © 2014 Aerospike. All rights reserved ‹#› Live Analytics Load balancer Simple stateless APP SERVERS IN-MEMORY NoSQL RESEARCH WAREHOUSE CONTENT DELIVERY NETWORK LOAD BALANCER Long term cold storageFast stateless Live Analytics Kafka
  16. 16. © 2014 Aerospike. All rights reserved ‹#› How fast can you go?
  17. 17. © 2014 Aerospike. All rights reserved ‹#› – Geir Magnusson, CTO of AppNexus Strata Santa Clara, 2014 “We run Aerospike heavily, peaking at 3 Million reads per second and well over 1 1/2 million writes a second in a very cost effective way. I don’t think there’s any technology we’ve run into that even comes close.”
  18. 18. © 2014 Aerospike. All rights reserved ‹#› Tada Pivorius, Developer at Adform "Married to Cassandra", 2014 http://vimeo.com/102812401 “Adform scaled from a 32 node Cassandra cluster to a 3 node Aerospike cluster, managing 1 TB data at 120k tps.”
  19. 19. © 2014 Aerospike. All rights reserved ‹#› Native Flash  Performance 0 100,000 200,000 300,000 400,000 Balanced Read-Heavy Aerospike Cassandra MongoDB Couchbase 2.0**We were forced to exclude Couchbase...since when run with either disk or replica durability on it was unable to complete the test.” – Thumbtack Technology 0. 2.25 4.5 6.75 9. 11.25 0 50,000 100,000 150,000 200,000 AverageLatency,ms Throughput, ops/sec Balanced Workload Read Latency Aerospike Cassandra MongoDB 0. 3.5 7. 10.5 14. 17.5 0 50,000 100,000 150,000 200,000 AverageLatency,ms Throughput, ops/sec Balanced Workload Update Latency Aerospike Cassandra MongoDB HIGH THROUGHPUT LOW LATENCY Throughput,TPS
  20. 20. © 2014 Aerospike. All rights reserved ‹#› YCSB Performance Comparison 2014
  21. 21. © 2014 Aerospike. All rights reserved ‹#› Hot Analytics ■ High throughput Queries ■2 node cluster, 10 Indexes ■Query returns 100 of 50M records ■ Predictable low latency UN-PREDICTABLE LATENCY 128 – 300 ms 70 – 760 ms 7 – 10 ms QPS
  22. 22. © 2014 Aerospike. All rights reserved ‹#› Amazon EC2 results
  23. 23. © 2014 Aerospike. All rights reserved ‹#› Amazon EC2 results
  24. 24. © 2014 Aerospike. All rights reserved ‹#› Lots of Clients & Examples
  25. 25. © 2014 Aerospike. All rights reserved ‹#› Use Open Source
  26. 26. © 2014 Aerospike. All rights reserved ‹#› How do we do it?
  27. 27. © 2014 Aerospike. All rights reserved ‹#› WRITING RELIABLY WITH HIGH PERFORMANCE 1. Write sent to row master 2. Latch against simultaneous writes 3. Apply write to master memory and replica memory synchronously 4. Queue operations to disk 5. Signal completed transaction (optional storage commit wait) 6. Master applies conflict resolution policy (rollback/ rollforward) master replica 1. Cluster discovers new node via gossip protocol 2. Paxos vote determines new data organization 3. Partition migrations scheduled 4. When a partition migration starts, write journal starts on destination 5. Partition moves atomically 6. Journal is applied and source data deleted transactions continue Writing with Immediate Consistency Adding a Node
  28. 28. © 2014 Aerospike. All rights reserved ‹#› DATABASE OS FILE SYSTEM PAGE CACHE BLOCK INTERFACE SSD HDD BLOCK INTERFACE SSD SSD OPEN NVM SSD Ask me and I’ll tell you the answer.Ask me. I’ll look up the answer and then tell it to you. DATABASE HYBRID MEMORY SYSTEM™ •Direct device access •Large Block Writes •Indexes in DRAM •Highly Parallelized •Log-structured FS “copy-on-write” •Fast restart with shared memory FLASH OPTIMIZED HIGH PERFORMANCE
  29. 29. © 2014 Aerospike. All rights reserved ‹#› SHARED-NOTHING SYSTEM:100% DATA AVAILABILITY ■ Every node in a cluster is identical, handles both transactions and long running tasks ■ Data is replicated synchronously with immediate consistency within the cluster ■ Data is replicated asynchronously across data centers OHIO Data Center
  30. 30. © 2014 Aerospike. All rights reserved ‹#›

Notes de l'éditeur

  • In the Oracle world you scaled vertically - bigger and bigger iron. Expensive specialized hardware was necessary to power scaling (SAN trays and heads). Complex to configure, storage app DSLs.
    Mention the problems with MySQL on a large dedicated EC2 instance
    8 cores, only 1 at 100% load during heavy query loads
    Horrible utilization of the 64GB of RAM it was given
    Modern applications are designed and built using agile methodologies.
    The focus is on time-to-market, quickly launching with a minimum-viable-product, and following with quick iterations.
    Modifying the schema in a large enough application that is in production is a nightmare.
    'Solution' of BLOB columns defeated the notion of being relational
    In MySQL and PostgreSQL indexes are brittle and require constant maintenance or they break and we're back to scans-speed performance (if the table is still usable at all).
  • A glaring bottleneck requires urgent measures. Think how much engineering effort goes into fixing and working around the problems of relational databases at FB, Yahoo, Amazon, Google, etc.
    Not until recent years do decent algorithms for caching logic surface (russian-doll caching, timestamp-based expiration). Cache invalidation is usually done wrong.
    Wait, if you need to ditch the relational model to make your RDBMS-backed app scale…
    Lots of large scale internet companies have that realization and start building their own NoSQL
    Most of it is in-house efforts, custom KVS. Google publishes a white-paper about BigTable(2004), HDFS/Hadoop follow it (2005), CouchDB will show up around 2005. Amazon is inspired to do Dynamo in 2007, Cassandra is influenced by Dynamo in 2008.
    Managing TCP connection pools isn't trivial, evenly distributing data isn't trivial
    Custom scripts needed to manage data rebalancing (move and clear out) as cluster size changes.
    What happens when a cluster node goes down? Heartbeat? read-writes go where?
  • Twitter starts off with sharding MySQL and hits difficulties quickly.
    Introduces their in-house graph database FlockDB to reduce load from MySQL which is now just mostly as a key-value store. A big, complex, low performance one. Heavy reliance on caching.
    In late 2013 Twitter reports a peak of 140K writes-per-second for the site. Very large read/write ratio.
    Combining MySQL for storage and FlockDB for queries they report a latency of 350ms for a new write (tweet)
    In China, Weibo, Alibaba, and TenCent follow a new trend in application design, at scale. A recent discussion with Pinterest showed a similar design approach.
    * These companies use in-memory NoSQL on the front application tier, but also abstract the application logic from database choice and scale using a separate data-access middleware layer.
    provides abstractions like ‘get list of friends’, ‘get list of tweets’.
    decide what’s cached, what’s in RDBMS, what is in NoSQL.
    separate users into “high traffic” and “low traffic” groups with different infrastructure, which allows for different optimization patterns. The application does not have direct access to the databases.
    discover that fast NoSQ L is faster than RDBMS + caching
    The separation is due to economics of DRAM-only fast NoSQL (Redis). Expensive to scale.
  • The travel industry is the first to have realtime pricing on inventory, as early as the 70s. Travel agents required SABRE to be built with 350K agents using it. Based on old mainframe computers.
    When web-based travel portals were created, suddenly everybody acts as their travel agent.
    The same API supports all the travel portals, and they are charged for lookups due to limited bandwidth on the provider side (airlines). The infrastructure is old, and the amount of queries and bookings keeps increasing.
    Another problem is that as opposed to agents, you don't necessarily know anything about travel portal users. Most do not log in when they search or even have a user account at the travel portal. You need to track these users.
    Travel portals applied massive cache layers, but have consistency problems. How often do you try to reserve and find the seat is no longer available, or at a different price? How frustrating is that as a user - bad user experience.
    Travel portals also compete on an even basis. The user experience (better UIs) is about all they can offer as a differentiator. These snappier UI adds even more load onto the app but to keep their profit margin they can't reflect those extra calls to the APIs they use.
    * Moving to fast NoSQL that allow for removal of the caching layer and supported a much higher rate of queries per second.
  • * This is the technology stack that major advertising technology companies built to sustain the crushing load of aggregating the clicks and views from so many websites

    * Individual retailers are now using this same tech stack, for the same reason they wish to present a near real-time experience, and include Analytics-based results
  • The modern scale out architecture replaces the cache, database, and storage tier with a single, straightforward system.
    less hardware to purchase, and maintain
    less development time patching antique systems (or searching for 'solutions')
    remove the caching logic
    easier to administer (configure and monitor), and now affordable because of flash

    if you don’t have a database that’s good at writes you don’t even create an application that uses it.
    key value is what you need 99.999% of the time. a real query that needs to pull in actual analytics
    many web applications built around an RDBMS treat it as a key-value store - denormalizing and removing foreign key constraints.
  • Predictable low latency is one of our Aerospike's core strengths
    In the RTB world where the entire time frame for user lookup, ad lookup, decision whether to bid, and potential auction war is 100-150ms databases with unpredictable latency spikes cause whole opportunities to be lost.
  • Aerospike is a real distributed database. Clustering was built-in from the very beginning and is core to the operation and performance on the database. It is not an after-the-fact bolted-on feature.
    masterless with replication
    Smart client connects and learns about the cluster topology. It only needs a single IP address.
    records are identified by a RIPEMD-160 digest of the PK. indexes are always 20-bytes wide.
    The client knows the partition map and will seek to write to the master and replica partitions synchronously.
    The client knows which partition to read a record from, and knows where the replica is for failover.
    Stop writing sharing logic
  • Scaling in DRAM only is simply not economical.
    SSDs are formatted and used to expand the memory space.
    Raw device, direct access pattern.
    Indexes are kept in DRAM to save on an extra IOP. enterprise feature: fast restart (shared-memory)
    Sets (tables) are kept contiguous for efficient bulk reads, scans