SlideShare une entreprise Scribd logo
1  sur  41
Télécharger pour lire hors ligne
Learning from Google Megastore

 Part1: Data Model and Transactions in single datacenter
                (w/o Replication and Paxos)

                     Schubert Zhang
                      April 9, 2011
Megastore Introduction




2012/3/25                            2
Three Aspects of Megastore

• Data Model to be a DB
   – Data layout
   – Indexing


• Transactions and ACID
   – Within Entity Group
   – Across Entity Group


• Replication across datacenter (not be researched in detail in
  this presentation)
   – Synchronous replication
   – Optimized Paxos




     2012/3/25                                         3
What is?

                              Megastore is:

                        A database over Bigtable,
                with High Availability across datacenters.



                          Bigdata Philosophy:

          fine-grained partitioning to make things easy,
                   data placement for relations,
                             and Paxos

   then, a simple API/Language for convenience of usage!


    2012/3/25                                                4
Target Applications
•   Interactive online services        •   Application developers
     – User facing applications            – Be familiar with RDBMS, SQL
                                           – Difficult to give up “read-
•   Conflicting requirements                 modify-write” idiom
     – Highly scalable (size,              – But now need high scalability
       throughput)                           for bigdata
     – Rapid development, fast time-
       to-market
     – Responsive, Low latency
     – Consistent view of data
     – Highly available

•   Reads vs. Writes
     – 20 billion:3 billion, daily
       @Google
     – 7:1

•   Bigdata
     – Petabyte of primary data
     – Across datacenters
       2012/3/25                                                 5
NoSQL + RDBMS = Megastore

•   NoSQL datastore (Bigtable)                    •   Megastore database
    – Pros                                             – High scalability
         • Highly scalable                             – Distributed transactions
         • Highly available within DC                  – Consistency guarantees
           (across hosts)
                                                       – Fully serializable ACID
    – Cons
                                                         semantics within entity-groups
         • Limited API
                                                       – Convenience, rapid
         • Loose consistency models
                                                         development for applications
         • Complicate application         blend
           development
                                                  •   + High Availability
•   RDBMS                                              – Within-DC (Bigtable)
    – Pros                                             –   Across-DC replication, Paxos
                                                           (synchronously write within EG)
         • Rich set of features for
           convenience, rapid                          –   Strong consistency guarantees
           development for applications                    (synchronously replicate)
         • Transactions                                –   Reasonable latency, seamless
                                                           failover
         • ACID semantics
    – Cons
         • Difficult to scale

       2012/3/25                                                                 6
Design Principles

•   Taking a middle ground in the RDBMS vs. NoSQL design space:
    –   partition the datastore and
    –   replicate each partition separately,
    –   providing full ACID semantics within partitions,
    –   but only limited/loose consistency guarantees across them.


•   Use Paxos to build a highly available system:
    – provides reasonable latencies for interactive applications while
    – synchronously replicating writes across geographically distributed
      datacenters,
    – to achieve across-DC high availability and a consistent view of the data.


•   Approachs:
    – for database scale, partitioning data into a vast space of small
      databases, each with its own replicated log stored in a per-replica
      Bigtable;
    – for availability, implementing a synchronous, fault-tolerant log replicator
      optimized for cross-DC replication.

        2012/3/25                                                     7
EG: Entity-Groups
•   Entity-Group concept is the footstone of scalability and availability!
     –   Fine-grained partitions of data
     –   Fine-grained control over data’s partitioning and locality
     –   Like many mini-databases
     –   To scale throughput and localize outages
     –   Each independently and synchronously replicated across-DC

                                                               The data for most Internet
•   An physical EG in Bigtable consist of                       services can be suitably
     –   A write-ahead-log (for ACID transactions)            partitioned (e.g., by user) to
     –   Related data (pre-joined)                            make this approach viable.
     –   Local indexes (with also ACID)
     –   … Like a mini-database (locally complete)
                                                            Nearly all applications built on
     –   And a inbox for receiving across-EG messages
                                                            Megastore have found ways to
                                                                draw EG boundaries.
•   Size of a EG
     –   Not too large, Not too small
     –   A priori/natural or deliberate grouping of data for fast operations
     –   If too large: serializable ACID make long latency and low throughput
     –   If too small: many across-EG expensive consistency operations (e.g. 2PC), or
         looser consistency asynchronous messaging

         2012/3/25                                                               8
Schematic Diagrams
                        A EG like a mini-DB

                             WAL (logs)


                             Primary Data


                            Local Indexes


                      Inbox for Queue Messages




                                 EG 2

                                ……

                                 EG n


                     Megastore layout in Bigtable




    2012/3/25                                       9
Many WAL vs. Single WAL

• Many replicated logs each governing its own EG, to improve
  availability and throughput.
   – Independent and concurrent operations for different EG
   – Only operations within a EG need to be serialized
   – Temporary long-wait and failed operations does not impact
     other EG


• Many WAL to scale throughput and localize outages

• WAL is stored with each EG in Bigtable

• Examples with the same tenet
   – The asynchronous and concurrent RPC communication
     framework of HBase and Hadoop IPC.



     2012/3/25                                            10
Consistency Levels and the Approaches
•   Within each EG: Full ACID semantics
     –   Single-Phase-Commit ACID transactions
     –   And commit entity is replicated via Paxos across-DC


•   Across-EG: Limited consistency guarantees (two methods for tow levels)
     –   Two-Phase-Commit (expensive, long latency) -> strong consistency
     –   Or, Typically leverage efficient asynchronous messaging (queue!, inexpensive, low latency) ->
         loose (or eventual) consistency


•   Two-phase-commit vs. asynchronous-messaging
     –   Two-Phase-Commit transactions
           •   Strong consistency
           •   Expensive
           •   Long latency and low throughput
           •   Usually for low-traffic operations
     –   Asynchronous-messaging
           •   Loose consistency, may be inconsistent (or may be eventual consistency)
           •   Inexpensive
           •   Usually for heavy-traffic operations


•   Objects to be made consistent:
     –   Data, Local Indexes, within EG : strong (via WAL, ACID)
     –   Data, Global Indexes, cross-EG : strong (via 2PC) or looser (via messaging)
     –   Replicas within DC : strong (via GFS and Bigtable)
     –   Replicas across DC : strong (via Paxos)
         2012/3/25                                                                        11
The two Faces of ACID Transactions

• Frontface:
  – Simplify development for applications
  – Reasoning about correctness


• Backface:
  – Performance reduce
  – Latency
  – Throughput




    2012/3/25                               12
Architecture of Megastore – How it deploy?

• How it deploy
   – a client library (DB logic)
   – and auxiliary servers (for across-DC replication)


• Applications link to the client library




      2012/3/25                                          13
Data Model and Semantics
                to be a database …




2012/3/25                              14
Principles to be a DBMS


• Provides traditional database features, such as secondary
  indexes, etc.

• but only those features that can scale within user-tolerable
  latency limits,

• and only with the semantics that EG partitioning scheme can
  support.



                 Feature set carefully chosen, tradeoffs.




     2012/3/25                                              15
Data Model (concepts for database)

•   A Data Model is a notation for describing data or information.

•   Consists of 3 parts, generally
     – Structure of the data
     – Operations on the data
     – Constraints on the data


•   Megastore Data Model: Relational Model + Scale
     – Limited relational model
     – Bigtable’s scalability


•   High Level Model vs. Physical Level Model
     – Physical Level
         • Complicate application development
         • Bigtable’s data model is at physical level
     – High Level
         • Let programmers to write code conveniently
         • Language, SQL

       2012/3/25                                              16
Data Model

•     Schemaful                                            •   Primary key
       −      Strongly typed (Primitives or PB)                 – Built from a sequence of
       −      Required, optional or repeated                      properties
       −      All entities in a table have the
              same set of allowable properties.
                                                                – Must be unique within the table
       −      Nested Protocol-Buffers?
                                                               An EG= a root entity + all entities
                                  Entities    Properties
    Schemas          Tables                                    in child tables that reference it
                                  (primary     (name,
    (name)          (name)
                                    key)        type)
                                                                 EG Root    Child tables
                                              Property-           table       (foreign     Entities
                                                 111
                                                                 (EG key)   key=EG key)
                                              Property-
                                  Entity-11
                                                 112                                       Entity
                     Table-1
                                              Property-                       Photo
                                  Entity-12
                                                 113                                       Entity
    Schema
                                                                  User
                                  Entity-21                                                Entity
                     Table-2                                                  Book
                                  Entity-22                                                Entity


                         schema                                     related hierarchical data

             2012/3/25                                                                       17
SQL-Like Schema Language (DDL)


CREATE SCHEMA DemoApp;                        Additional Qualifiers:
CREATE TABLE User {                           DESC|ASC|SCATTER
    required int64 userId;
    required string name;                     ------------------------------------
} PRIMARY KEY(userId), ENTITY GROUP ROOT;     CREATE TABLE Book{
                                                  required int64 userId;
CREATE TABLE Photo {                              required int32 bookId;
    required int64 userId;
                                                  required int64 time;
    required int32 photoId;
    required int64 time;                          required string url;
    required string url;                          repeated string tag;
    optional string thumbUrl;                 } PRIMARY KEY([DESC|ASC|SCATTER] userId,
    repeated string tag;                      [DESC|ASC|SCATTER] bookId),
} PRIMARY KEY(userId, photoId),                 IN TABLE User,
  IN TABLE User,                                ENTITY GROUP KEY(userId) REFERENCES User;
  ENTITY GROUP KEY(userId) REFERENCES User;

CREATE LOCAL INDEX PhotosByTime               CREATE LOCAL INDEX BooksByTime
       ON Photo(userId, time);                       ON Book([DESC|ASC|SCATTER] userId,
                                              [DESC|ASC] time);
CREATE GLOBAL INDEX PhotosByTag
       ON Photo(tag) STORING (thumbUrl);




         2012/3/25                                                          18
Data Placement in Bigtable (principles)
Pre-join with Keys, for performance …
•   Lets applications control the placement of hierarchical/related data, to
    minimize latency and maximize throughput
     –   Storing data that is accessed together in nearby rows, or
     –   Denormalized into the same row


•   The data for a EG are held in contiguous ranges of Bigtable rows, for
     –   Low latency
     –   High throughput
     –   Cache efficiency


•   Pre-Joining with keys
     –   Primary keys to cluster entities that will be read together.
     –   Each entity maps into a single Bigtable row.
     –   Primary key values are concatenated to form the Bigtable row key
     –   Each remaining property occupies its own Bigtable column
     –   Entity-group key as the prefix of Primary key (row key)
     –   Sorted keys ascending or descending
     –   SCATTER (two-byte hash prefix), to prevent hotspots in Bigtable

     –   Recursive for arbitrary join depths (multiple levels of “IN TABLE”)

         2012/3/25                                                             19
Data Placement in Bigtable (details)
Pre-join with Keys, for performance …
•   Bigtable row key = primary key of each table

•   Bigtable column name = <table name>.<property name>
    – Allowing entities from different Megastore tables to be mapped into the
      same Bigtable row without collision.


•   Store the transaction and replication log and metadata for the EG
    in root entity’s Bigtable row.
    – Because Bigtable provides per-row transactions.


•   Indexes: Each index entry is represented as a single Bigtable row
    – Bigtable row key = <indexed property values> + <primary key>
    – Bigtable cell columns: denormalized properties




       2012/3/25                                                   20
Data Placement in Bigtable (examples)

                                                                                                                              STORING
                                                    Transaction Meta User Table                 Photo Table                 Denormalized

                                   Row Key          Root. Root.        User.      Photo. Photo. Photo.        Photo.        PhotosByTag.
                                                    WAL meta           name       time   url    thumbUrl      tag           thumbUrl
                                   <U1>             Log3    commit     Jack
                                                    Log2    offset
   Root
   User




                                                    Log1    applied
                                                            offset …




                                                                                                                                           EG for U1
                                   <U1,P1>                                        T1     URL1     TURL1       girl, car
Photo Local Index Global Index
 Data PhotosByTime PhotosByTag




                                   <U1,P2>                                        T2     URL2     TURL2       dress, girl

                                   <U1,T1><U1,P1>

                                   <U1,T2><U1,P2>

                                   <car><U1,P1>                                                                             TURL1

                                   <dress><U1,P2>                                                                           TURL2

                                   <girl><U1,P1>                                                                            TURL1

                                   <girl><U1,P2>                                                                            TURL2


                                       2012/3/25                                                                            21
Secondary Indexes

•   Secondary indexes can be declared on any list of entity
    properties(optional is ok), including repeated properties, as well as
    sub-fields within ProtocolBuffers, and full-text index.

•   Local Indexes
     – Within EG
     – Obey ACID semantics
         • The index entries are stored in the entity group and are updated atomically
           and consistently with the primary entity data.


•   Global Indexes
     – Span EGs
     – Looser consistency (or may eventual)
         • Not guaranteed to reflect all recent updates. (may inconsistent with the
           primary data?)
         • It is a trick to keep consistent between Global Indexes and primary data!?
                   – Special Two-Phase-Commit? and
                   – Read-Repair?



       2012/3/25                                                            22
Secondary Indexes and Demoralization
•   STORING clause for copied data in index entities
      –   Avoid the indirect access of primary entities, it is very expensive random access.
      –   But, keeping consistent is a issue!


•   Inline Indexes
      –   Index entries from the source entities appear as a virtual repeated column in the
          target entry.
      –   An inline index can be created on any table (child) that has a foreign key
          referencing another table (parent) by using the first primary key of the target
          entity as the first components of the index.
                                               Inline Index
                                         Repeated Columns Inline

          User        Row Key   User.   PhotosByTime.   PhotosByTime. Photo.   Photo.
      Parent Table              name    T1              T2            time     thumbUrl
                      <U1>      Jack    <P1>            <P2>
      Photo
    Child Table       <U1,P1>                                         T1       TURL1

                      <U1,P2>                                         T2       TURL2

              CREATE INLINE INDEX PhotosByTime ON Photo(userId, time);

          2012/3/25                                                                    23
Inline Indexes for many-to-many
Relationships
•      Coupled with repeated indexes, inline indexes can also be used to
       implement many-to-many relationships more efficiently than by maintaining
       a many-to-many link table.
                                        Inline Index
                                                                  many-to-many
                                  Repeated Columns Inline

                         Row       User.   PhotosByTag.   PhotosByTag.   PhotosByTag.   Photo.   Photo.
         User
                         Key       name    car            dress          girl           time     thumbUrl
     Parent Table
                         <U1>      Jack    <P1>           <P2>           <P1>
                                                                         <P2>

      Photo              <U1,P1>                                                        T1       TURL1
    Child Table          <U1,P2>                                                        T2       TURL2

                         <U2>      Tom                                   <P1>

                         <U2,P1>                                                        T3       TURL3



                    CREATE INLINE INDEX PhotosByTag ON Photo(userId, tag);



             2012/3/25                                                                           24
API
•   Cost-transparent API
     –   Match application developers’ intuitions
     –   High-volume interactive workloads benefit more from predictable performance than from an
         expressive query language.


•   Normalized relational schemas rely on joins at query time to service user operations, is
    not the right model for Megastore applications.
     –   Pre-joins
     –   Denormalization


•   SQL-Like Schema language (DDL, for data structures and data placement)
     –   Fine-grained control over physical locality
           •   Hierarchical layouts (pre-joins)
           •   Declarative denormalization
     –   Eliminate the need for most joins


•   Queries API against particular tables and indexes
     –   Range Scans
     –   Lookups


•   Schema changes require corresponding modifications to the query implementation
    code



         2012/3/25                                                                     25
Query Joins

• Query Joins, when required, are implemented in application
  code.

• Index-based join

• Merge joins
   – Multiple queries returns primary keys for the same table, in the
     same order.
   – Then intersection of keys for them.


• Outer joins
   – Index lookup (return small result set)
   – Parallel index lookups using the results of the above lookup


• Other joins …?

     2012/3/25                                              26
Query Joins - Merge Joins

              Query-1
 SELECT * FROM Photo WHERE tag=girl
                                                                                           girl & car
                                                          Intersection                         or
                                                             & or |                        girl | car
 SELECT * FROM Photo WHERE tag=car

              Query-2


                               Use the global index: PhotosByTag

                                          Just like:
                        SELECT * FROM Photo WHERE tag=girl AND tag=car
                                              or
                         SELECT * FROM Photo WHERE tag=girl OR tag=car

          Strictly, Merge Join is not a real join in the lingo of SQL, but is really a “Join”.



      2012/3/25                                                                            27
Query Joins - Outer Joins

                       name=Jack,
                       userId=U1,U2           Query-2
       Query-1                                                           userId=U1,U2
                                       Parallel Index Lookup
                                                 Query-2
     Index lookup                                                        T1<time<T10
                                        Parallel Index Lookup

 SELECT name, userId FROM User
                                       SELECT thumbnUrl FROM Photo
 WHERE name=Jack
                                       WHERE time>T1 AND time<T10;
 (suppose there is a index:
                                       … Parallel for each userId.
  UsersByName)


                                          Just like:
                  SELECT User.name, User.userId, Photo.thumbUrl FROM User
                     LEFT OUTER JOIN Photo ON Photo.userId=User.userId
                 WHERE User.name=Jack AND Photo.time>T1 and Photo.time<T10

                                      Example of result:
                                      Jack, U1, TURL1
                                      Jack, U2, NULL


     2012/3/25                                                               28
Transactions and Concurrency Control
•   An EG as a mini-database, serializable ACID transactions .
•   Transactions within-EG
     –   A transaction writes its mutations into the EG's WAL, then the mutations are applied to the data.
     –   Readers use the timestamp of the last fully applied transaction to avoid seeing partial updates.


•   MVCC: Multi-Version Concurrency Control (very important)
     –   Use Bigtable cell’s timestamps/versions
     –   Readers and writers don't block each other, and reads are isolated from writes for the duration
         of a transaction. (How? See MVCC in Wikipedia)


•   Write patterns
     –   A write transaction always begins with a current read to determine the next available log
         position. (This current read only ensures that all previously committed writes to be applied.)
     –   The commit operation gathers mutations into a log entry, assigns it a timestamp higher than
         any previous one, and appends it to the log (and using Paxos for replicate across-DC).
     –   The write operation can return to the client at any point after Commit.
                                         Write Op                Commit                                              Read Op
•   Read patterns
     –   Current Read                   Metadata and WAL of EG root                        Check for
     –   Snapshot Read                                                               recover committed logs




                                                                                                                ad
                                                 In Bigtable




                                                                                                              Re
     –   Inconsistent Read                   The apply may be async    Apply


                                                                      Tables data and Indexes data
                                                                               in Bigtable


         2012/3/25                                                                                                             29
Transactions and Concurrency Control -
                      Write
                                                                       Last committed position
                            Writer
                                                                               (ts2)
                                                                                                                                                         When failure occurs here:
                                                                   Last fully applied position
                                                                               (ts1)
                                                                                                                                                          Transaction-1:
                                                                              Metadata
                                                                                                                                                           Very safe.
                               Transaction-3         Writing
                                                                                                                                                          Transaction-2:
Serializable Transactions




                                 (ongoing)           Writ                                               writing but not commit
                                                          e
                                                                                                    not gather and append into log
                                                                                                                                                            Safe, no data loss, but
                               Transaction-2
                                (commited)
                                                        Write
                                                       Commit             Mutation-22-ts2
                                                                                                                                                         should be recovered from
                                                                                                     committed but not
                                                                                                       fully applied
                                                                                                                                            partially
                                                                                                                                          applied data
                                                                                                                                                         log to data, when doing
                               Transaction-1
                                                       Writ
                                                                          Mutation-21-ts2                                                                “current read” or “write”
                            (committed, applied)
                                                      Com e                                                                                              operations.
                                                         mit              Mutation-12-ts1                                Data-part1-ts2
                               Transactions
                                                                          Mutation-11-ts1                                Data-part2-ts1
                                                                                                                                                          Transaction-3:
                                                    a log entry
                                                                               WAL
                                                                                                                         Data-part1-ts1                     Not complete, failed.
                                               assign it a timestamp                                                                                     Application will get failed
                                                                                                 committed and
                                                                                                  fully applied
                                                                                                                              Data
                                                                                                                    (Use Bigtable timestamp
                                                                                                                                                         return.
                                   Figure : The state of Transaction System for a EG                                      for MVCC)



                            Note: The commit operation gathers mutations into a log entry, assigns it a timestamp to it.
                            A write transaction always begins with ensuring that all previously committed writes to be
                                                          applied (via a current read)!
                                   2012/3/25                                                                                                                           30
Transactions Read Patterns and Lifecycle
•   Current Read                          •   A complete transaction
    – Only within-EG                          lifecycle
    – When starting a current read,           – Read
      the transaction system first
      ensures that all previously                  • Get timestamp of the last
      committed writes are applied.                  committed transaction from
      (Just like the recovery of                     metadata.
      commit-logs.)                           – Application logic
    – Then the application reads at the            • Read-modify-write.
      timestamp of the latest                 – Commit
      committed transaction.
                                                    • Gathers mutations into a log
                                                       entry, assigns it a higher
•   Snapshot Read                                      timestamp.
    – Only within-EG                                • Replicate across-DC via
    – Picks up the timestamp of the                    Paxos.
      last known fully applied                      • Can return to client here.
      transaction and reads from there.       ----------------------------------------
    – Some committed transactions             (following job may be asynchronous)
      may not yet be applied.                 ----------------------------------------
                                              – Apply
•   Inconsistent Read                              • Write mutations into data
    – Read the latest values directly,               and indexes.
      may get partially applied data,
      for aggressive latency.                 – Clean up
                                                   • Delete fully applied logs.

       2012/3/25                                                         31
Transactions Read Patterns – Current Read

                                                                                                           (1) Check latest
                                                                                                                                       Current Reader
                                                                                                          committed writes
                                                                            Last committed position
                                                                                    (ts2)
                                                                                                           (3) Update metadata
                                                                        Last fully applied position
                                                                               (ts1) -> (ts2)
                                                                                   Metadata



                                    Transaction-3         Writing
     Serializable Transactions




                                      (ongoing)           Writ                                               writing but not commit             (4) Read data at ts2
                                                               e
                                                                                                         not gather and append into log
                                                                                                                (2) Apply previous
                                    Transaction-2            Write                                               committed writes
                                     (commited)             Commit             Mutation-22-ts2
                                                                                                        committed but not
                                                                                                          fully applied
                                    Transaction-1                              Mutation-21-ts2                                Data-part2-ts2
                                 (committed, applied)       Writ
                                                           Com e
                                                              mit              Mutation-12-ts1                                Data-part1-ts2
                                    Transactions
                                                                               Mutation-11-ts1                                Data-part2-ts1

                                                                                    WAL
                                                         a log entry                                                          Data-part1-ts1
                                                    assign it a timestamp
                                                                                                      committed and                Data
                                                                                                       fully applied     (Use Bigtable timestamp
                                        Figure : The state of Transaction System for a EG                                      for MVCC)




                                                           Do recovery write before read data.

    2012/3/25                                                                                                                                                      32
Transactions Read Patterns – Snapshot Read

                                                                                                                                        Snapshot Reader

                                                                            Last committed position
                                                                                                              (1) Get last fully applied ts
                                                                                    (ts2)
                                                                        Last fully applied position
                                                                                    (ts1)
                                                                                   Metadata



                                    Transaction-3         Writing
     Serializable Transactions




                                      (ongoing)           Writ                                               writing but not commit
                                                               e
                                                                                                         not gather and append into log

                                    Transaction-2            Write                                                                               (2) Read data at ts1
                                     (commited)             Commit             Mutation-22-ts2
                                                                                                          committed but not
                                                                                                            fully applied
                                    Transaction-1                              Mutation-21-ts2
                                 (committed, applied)       Writ
                                                           Com e
                                                              mit              Mutation-12-ts1                                Data-part1-ts2
                                    Transactions
                                                                               Mutation-11-ts1                                Data-part2-ts1

                                                                                    WAL
                                                         a log entry                                                          Data-part1-ts1
                                                    assign it a timestamp
                                                                                                      committed and                 Data
                                                                                                       fully applied      (Use Bigtable timestamp
                                        Figure : The state of Transaction System for a EG                                       for MVCC)




                                                                      The very easy read pattern.

    2012/3/25                                                                                                                                                       33
Transactions Read Patterns – Inconsistent Read

                                                                                                                                           Inconsistent
                                                                                                                                             Reader
                                                                             Last committed position
                                                                                     (ts2)
                                                                         Last fully applied position
                                                                                     (ts1)
                                                                                    Metadata                                                       (1) Directly read
                                                                                                                                                      partial data

                                     Transaction-3         Writing
      Serializable Transactions




                                       (ongoing)           Writ                                               writing but not commit
                                                                e
                                                                                                          not gather and append into log

                                     Transaction-2            Write
                                      (commited)             Commit             Mutation-22-ts2
                                                                                                           committed but not                      partially
                                                                                                             fully applied                      applied data
                                     Transaction-1                              Mutation-21-ts2
                                  (committed, applied)       Writ
                                                            Com e
                                                               mit              Mutation-12-ts1                                Data-part1-ts2
                                     Transactions
                                                                                Mutation-11-ts1                                Data-part2-ts1

                                                                                     WAL
                                                          a log entry                                                          Data-part1-ts1
                                                     assign it a timestamp
                                                                                                       committed and                Data
                                                                                                        fully applied     (Use Bigtable timestamp
                                         Figure : The state of Transaction System for a EG                                      for MVCC)



                                  The application must tolerate the stale or partially applied data.

    2012/3/25                                                                                                                                                    34
Two-Phase-Commit




                   Expensive, Long Latency




   2012/3/25                           35
Replication
             for High Availability …

I need more study about Paxos, so it is not go-to-
                   detailed.




 2012/3/25                                   36
Replication

• Within-DC
   – Across hosts
   – Built-in from Bigtable and GFS


• Across-DC
   – … synchronous and consistent for each write




     2012/3/25                                     37
Replication cross-DC
•   Traditional strategies (not work)                   •   EG based synchronously
     –   Asynchronous Master/Slave
           •   Asynchronously propagate
                                                            replicate each write
           •   Master supports fast ACID transactions
           •   Low latency
           •   Data loss risk                           •   Use Paxos
           •   Downtime for failover                        – No distinguished master
           •   Heavyweight master
           •   Required a mediate mastership(e.g.           – Replicate write-ahead-log
               ZooKeeper)                                   – Synchronously replicating
                                                              writes (each log append blocks
     –   Synchronous Master/Slave
           •   No data loss
                                                              on acknowledgments from a
           •   Downtime for failover                          majority of replicas, and
           •   Long latency                                   replicas in the minority catch
           •   Heavyweight master                             up as they are able)
           •   Required a mediate mastership(e.g.
               ZooKeeper)                                   – Any node can initiate writes
                                                              and reads
     –   Optimistic Replication
                                                            – Reasonable latency
           •   No distinguished master
           •   Asynchronously propagate
           •   Availability and latency are excellent
                                                            – Extensions
           •   No mutation order and transactions
               are impossible                                   • Allows local reads at any up-
           •   Like Cassandra/Dynamo                              to-date replica
                                                                • Permits single-roundtrip writes
         2012/3/25                                                                  38
Paxos

• Traditional usages
   – Locking
   – Master election
   – Replication of metadata and configurations


• Megastore use Paxos
   – Replicate primary user data across-DC on every write
   – For across-DC high availability




     2012/3/25                                              39
To Study More …




2012/3/25                     40
Valuable References

•   P. Helland. Life beyond distributed transactions: an apostate's
    opinion. In CIDR, pages 132-141, 2007.
     – The philosophy inspire of Megastore




       2012/3/25                                               41

Contenu connexe

Tendances

Cassandra Virtual Node talk
Cassandra Virtual Node talkCassandra Virtual Node talk
Cassandra Virtual Node talkPatrick McFadin
 
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...MongoDB
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Zhenxiao Luo
 
Always on in SQL Server 2012
Always on in SQL Server 2012Always on in SQL Server 2012
Always on in SQL Server 2012Fadi Abdulwahab
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Sec016 詳説 -_rights_management_services__azure_information_protection
Sec016 詳説 -_rights_management_services__azure_information_protectionSec016 詳説 -_rights_management_services__azure_information_protection
Sec016 詳説 -_rights_management_services__azure_information_protectionTech Summit 2016
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsDavid Portnoy
 
DI&A Slides: Data Lake vs. Data Warehouse
DI&A Slides: Data Lake vs. Data WarehouseDI&A Slides: Data Lake vs. Data Warehouse
DI&A Slides: Data Lake vs. Data WarehouseDATAVERSITY
 
Google Bigtable Paper Presentation
Google Bigtable Paper PresentationGoogle Bigtable Paper Presentation
Google Bigtable Paper Presentationvanjakom
 
NoSQL Architecture Overview
NoSQL Architecture OverviewNoSQL Architecture Overview
NoSQL Architecture OverviewChristopher Foot
 
MSA 전략 1: 마이크로서비스, 어떻게 디자인 할 것인가?
MSA 전략 1: 마이크로서비스, 어떻게 디자인 할 것인가?MSA 전략 1: 마이크로서비스, 어떻게 디자인 할 것인가?
MSA 전략 1: 마이크로서비스, 어떻게 디자인 할 것인가?VMware Tanzu Korea
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
Messaging queue - Kafka
Messaging queue - KafkaMessaging queue - Kafka
Messaging queue - KafkaMayank Bansal
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedis Labs
 
CAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesCAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesYoav Francis
 

Tendances (20)

Cassandra Virtual Node talk
Cassandra Virtual Node talkCassandra Virtual Node talk
Cassandra Virtual Node talk
 
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Google File System
Google File SystemGoogle File System
Google File System
 
Always on in SQL Server 2012
Always on in SQL Server 2012Always on in SQL Server 2012
Always on in SQL Server 2012
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
080827 abramson inmon vs kimball
080827 abramson   inmon vs kimball080827 abramson   inmon vs kimball
080827 abramson inmon vs kimball
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Sec016 詳説 -_rights_management_services__azure_information_protection
Sec016 詳説 -_rights_management_services__azure_information_protectionSec016 詳説 -_rights_management_services__azure_information_protection
Sec016 詳説 -_rights_management_services__azure_information_protection
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse Platforms
 
DI&A Slides: Data Lake vs. Data Warehouse
DI&A Slides: Data Lake vs. Data WarehouseDI&A Slides: Data Lake vs. Data Warehouse
DI&A Slides: Data Lake vs. Data Warehouse
 
Google Bigtable Paper Presentation
Google Bigtable Paper PresentationGoogle Bigtable Paper Presentation
Google Bigtable Paper Presentation
 
NoSQL Architecture Overview
NoSQL Architecture OverviewNoSQL Architecture Overview
NoSQL Architecture Overview
 
MSA 전략 1: 마이크로서비스, 어떻게 디자인 할 것인가?
MSA 전략 1: 마이크로서비스, 어떻게 디자인 할 것인가?MSA 전략 1: 마이크로서비스, 어떻게 디자인 할 것인가?
MSA 전략 1: 마이크로서비스, 어떻게 디자인 할 것인가?
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Messaging queue - Kafka
Messaging queue - KafkaMessaging queue - Kafka
Messaging queue - Kafka
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ Twitter
 
Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 
CAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesCAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and Practices
 

En vedette

Google Megastore
Google MegastoreGoogle Megastore
Google Megastorebergwolf
 
Cassandra Compression and Performance Evaluation
Cassandra Compression and Performance EvaluationCassandra Compression and Performance Evaluation
Cassandra Compression and Performance EvaluationSchubert Zhang
 
Scrum Agile Development
Scrum Agile DevelopmentScrum Agile Development
Scrum Agile DevelopmentSchubert Zhang
 
Db presentation google_megastore
Db presentation google_megastoreDb presentation google_megastore
Db presentation google_megastoreAlanoud Alqoufi
 
TestNet thema avond 11-12-2013 - De T-Shaped Tester
TestNet thema avond 11-12-2013 - De T-Shaped TesterTestNet thema avond 11-12-2013 - De T-Shaped Tester
TestNet thema avond 11-12-2013 - De T-Shaped TesterRemi-Armand Collaris
 
Training sessions: User Experience
Training sessions: User ExperienceTraining sessions: User Experience
Training sessions: User Experiencesoftwareallies
 
MORE Mega Store .........
MORE Mega Store .........MORE Mega Store .........
MORE Mega Store .........PESHWA ACHARYA
 
MySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinarMySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinarAndrew Morgan
 
Dental caries ppt
Dental caries pptDental caries ppt
Dental caries pptRubab000
 

En vedette (11)

Google Megastore
Google MegastoreGoogle Megastore
Google Megastore
 
Cassandra Compression and Performance Evaluation
Cassandra Compression and Performance EvaluationCassandra Compression and Performance Evaluation
Cassandra Compression and Performance Evaluation
 
Megastore by Google
Megastore by GoogleMegastore by Google
Megastore by Google
 
Scrum Agile Development
Scrum Agile DevelopmentScrum Agile Development
Scrum Agile Development
 
Db presentation google_megastore
Db presentation google_megastoreDb presentation google_megastore
Db presentation google_megastore
 
TestNet thema avond 11-12-2013 - De T-Shaped Tester
TestNet thema avond 11-12-2013 - De T-Shaped TesterTestNet thema avond 11-12-2013 - De T-Shaped Tester
TestNet thema avond 11-12-2013 - De T-Shaped Tester
 
Noha mega store
Noha mega storeNoha mega store
Noha mega store
 
Training sessions: User Experience
Training sessions: User ExperienceTraining sessions: User Experience
Training sessions: User Experience
 
MORE Mega Store .........
MORE Mega Store .........MORE Mega Store .........
MORE Mega Store .........
 
MySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinarMySQL High Availability Solutions - Feb 2015 webinar
MySQL High Availability Solutions - Feb 2015 webinar
 
Dental caries ppt
Dental caries pptDental caries ppt
Dental caries ppt
 

Similaire à Learning from google megastore (Part-1)

A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin
 
Cidr11 paper32
Cidr11 paper32Cidr11 paper32
Cidr11 paper32jujukoko
 
Megastore providing scalable, highly available storage for interactive services
Megastore providing scalable, highly available storage for interactive servicesMegastore providing scalable, highly available storage for interactive services
Megastore providing scalable, highly available storage for interactive servicesJoão Gabriel Lima
 
In memory grids IMDG
In memory grids IMDGIn memory grids IMDG
In memory grids IMDGPrateek Jain
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesshnkr_rmchndrn
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople
 
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear ScalabilityBeyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear ScalabilityBen Stopford
 
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisUsing Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisScaleOut Software
 
Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)Ben Stopford
 
Where Does Big Data Meet Big Database - QCon 2012
Where Does Big Data Meet Big Database - QCon 2012Where Does Big Data Meet Big Database - QCon 2012
Where Does Big Data Meet Big Database - QCon 2012Ben Stopford
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservicesBigstep
 
North Bay Ruby Meetup 101911
North Bay Ruby Meetup 101911North Bay Ruby Meetup 101911
North Bay Ruby Meetup 101911Ines Sombra
 
NoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed DatabaseNoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed DatabaseJoe Alex
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureVenu Anuganti
 
start_your_datacenter_sds_v3
start_your_datacenter_sds_v3start_your_datacenter_sds_v3
start_your_datacenter_sds_v3David Byte
 
Yes sql08 inmemorydb
Yes sql08 inmemorydbYes sql08 inmemorydb
Yes sql08 inmemorydbDaniel Austin
 
GEN-Z: An Overview and Use Cases
GEN-Z: An Overview and Use CasesGEN-Z: An Overview and Use Cases
GEN-Z: An Overview and Use Casesinside-BigData.com
 

Similaire à Learning from google megastore (Part-1) (20)

A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
 
Cidr11 paper32
Cidr11 paper32Cidr11 paper32
Cidr11 paper32
 
Megastore providing scalable, highly available storage for interactive services
Megastore providing scalable, highly available storage for interactive servicesMegastore providing scalable, highly available storage for interactive services
Megastore providing scalable, highly available storage for interactive services
 
In memory grids IMDG
In memory grids IMDGIn memory grids IMDG
In memory grids IMDG
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skies
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
 
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear ScalabilityBeyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
 
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisUsing Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data Analysis
 
Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)
 
Where Does Big Data Meet Big Database - QCon 2012
Where Does Big Data Meet Big Database - QCon 2012Where Does Big Data Meet Big Database - QCon 2012
Where Does Big Data Meet Big Database - QCon 2012
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
North Bay Ruby Meetup 101911
North Bay Ruby Meetup 101911North Bay Ruby Meetup 101911
North Bay Ruby Meetup 101911
 
NoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed DatabaseNoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed Database
 
No sql
No sqlNo sql
No sql
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
Redis meetup
Redis meetupRedis meetup
Redis meetup
 
start_your_datacenter_sds_v3
start_your_datacenter_sds_v3start_your_datacenter_sds_v3
start_your_datacenter_sds_v3
 
Yes sql08 inmemorydb
Yes sql08 inmemorydbYes sql08 inmemorydb
Yes sql08 inmemorydb
 
GEN-Z: An Overview and Use Cases
GEN-Z: An Overview and Use CasesGEN-Z: An Overview and Use Cases
GEN-Z: An Overview and Use Cases
 

Plus de Schubert Zhang

Engineering Culture and Infrastructure
Engineering Culture and InfrastructureEngineering Culture and Infrastructure
Engineering Culture and InfrastructureSchubert Zhang
 
Simple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluationSimple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluationSchubert Zhang
 
Engineering practices in big data storage and processing
Engineering practices in big data storage and processingEngineering practices in big data storage and processing
Engineering practices in big data storage and processingSchubert Zhang
 
Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算Schubert Zhang
 
Big Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223aBig Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223aSchubert Zhang
 
HBase Coprocessor Introduction
HBase Coprocessor IntroductionHBase Coprocessor Introduction
HBase Coprocessor IntroductionSchubert Zhang
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验Schubert Zhang
 
Wild Thinking of BigdataBase
Wild Thinking of BigdataBaseWild Thinking of BigdataBase
Wild Thinking of BigdataBaseSchubert Zhang
 
RockStor - A Cloud Object System based on Hadoop
RockStor -  A Cloud Object System based on HadoopRockStor -  A Cloud Object System based on Hadoop
RockStor - A Cloud Object System based on HadoopSchubert Zhang
 
Hadoop compress-stream
Hadoop compress-streamHadoop compress-stream
Hadoop compress-streamSchubert Zhang
 
Ganglia轻度使用指南
Ganglia轻度使用指南Ganglia轻度使用指南
Ganglia轻度使用指南Schubert Zhang
 
DaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionDaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionSchubert Zhang
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aSchubert Zhang
 

Plus de Schubert Zhang (20)

Blockchain in Action
Blockchain in ActionBlockchain in Action
Blockchain in Action
 
科普区块链
科普区块链科普区块链
科普区块链
 
Engineering Culture and Infrastructure
Engineering Culture and InfrastructureEngineering Culture and Infrastructure
Engineering Culture and Infrastructure
 
Simple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluationSimple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluation
 
Career Advice
Career AdviceCareer Advice
Career Advice
 
Engineering practices in big data storage and processing
Engineering practices in big data storage and processingEngineering practices in big data storage and processing
Engineering practices in big data storage and processing
 
HiveServer2
HiveServer2HiveServer2
HiveServer2
 
Horizon for Big Data
Horizon for Big DataHorizon for Big Data
Horizon for Big Data
 
Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算
 
Big Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223aBig Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223a
 
HBase Coprocessor Introduction
HBase Coprocessor IntroductionHBase Coprocessor Introduction
HBase Coprocessor Introduction
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验
 
Wild Thinking of BigdataBase
Wild Thinking of BigdataBaseWild Thinking of BigdataBase
Wild Thinking of BigdataBase
 
RockStor - A Cloud Object System based on Hadoop
RockStor -  A Cloud Object System based on HadoopRockStor -  A Cloud Object System based on Hadoop
RockStor - A Cloud Object System based on Hadoop
 
Fans of running gump
Fans of running gumpFans of running gump
Fans of running gump
 
Hadoop compress-stream
Hadoop compress-streamHadoop compress-stream
Hadoop compress-stream
 
Ganglia轻度使用指南
Ganglia轻度使用指南Ganglia轻度使用指南
Ganglia轻度使用指南
 
DaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionDaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solution
 
Big data and cloud
Big data and cloudBig data and cloud
Big data and cloud
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221a
 

Dernier

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Dernier (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Learning from google megastore (Part-1)

  • 1. Learning from Google Megastore Part1: Data Model and Transactions in single datacenter (w/o Replication and Paxos) Schubert Zhang April 9, 2011
  • 3. Three Aspects of Megastore • Data Model to be a DB – Data layout – Indexing • Transactions and ACID – Within Entity Group – Across Entity Group • Replication across datacenter (not be researched in detail in this presentation) – Synchronous replication – Optimized Paxos 2012/3/25 3
  • 4. What is? Megastore is: A database over Bigtable, with High Availability across datacenters. Bigdata Philosophy: fine-grained partitioning to make things easy, data placement for relations, and Paxos then, a simple API/Language for convenience of usage! 2012/3/25 4
  • 5. Target Applications • Interactive online services • Application developers – User facing applications – Be familiar with RDBMS, SQL – Difficult to give up “read- • Conflicting requirements modify-write” idiom – Highly scalable (size, – But now need high scalability throughput) for bigdata – Rapid development, fast time- to-market – Responsive, Low latency – Consistent view of data – Highly available • Reads vs. Writes – 20 billion:3 billion, daily @Google – 7:1 • Bigdata – Petabyte of primary data – Across datacenters 2012/3/25 5
  • 6. NoSQL + RDBMS = Megastore • NoSQL datastore (Bigtable) • Megastore database – Pros – High scalability • Highly scalable – Distributed transactions • Highly available within DC – Consistency guarantees (across hosts) – Fully serializable ACID – Cons semantics within entity-groups • Limited API – Convenience, rapid • Loose consistency models development for applications • Complicate application blend development • + High Availability • RDBMS – Within-DC (Bigtable) – Pros – Across-DC replication, Paxos (synchronously write within EG) • Rich set of features for convenience, rapid – Strong consistency guarantees development for applications (synchronously replicate) • Transactions – Reasonable latency, seamless failover • ACID semantics – Cons • Difficult to scale 2012/3/25 6
  • 7. Design Principles • Taking a middle ground in the RDBMS vs. NoSQL design space: – partition the datastore and – replicate each partition separately, – providing full ACID semantics within partitions, – but only limited/loose consistency guarantees across them. • Use Paxos to build a highly available system: – provides reasonable latencies for interactive applications while – synchronously replicating writes across geographically distributed datacenters, – to achieve across-DC high availability and a consistent view of the data. • Approachs: – for database scale, partitioning data into a vast space of small databases, each with its own replicated log stored in a per-replica Bigtable; – for availability, implementing a synchronous, fault-tolerant log replicator optimized for cross-DC replication. 2012/3/25 7
  • 8. EG: Entity-Groups • Entity-Group concept is the footstone of scalability and availability! – Fine-grained partitions of data – Fine-grained control over data’s partitioning and locality – Like many mini-databases – To scale throughput and localize outages – Each independently and synchronously replicated across-DC The data for most Internet • An physical EG in Bigtable consist of services can be suitably – A write-ahead-log (for ACID transactions) partitioned (e.g., by user) to – Related data (pre-joined) make this approach viable. – Local indexes (with also ACID) – … Like a mini-database (locally complete) Nearly all applications built on – And a inbox for receiving across-EG messages Megastore have found ways to draw EG boundaries. • Size of a EG – Not too large, Not too small – A priori/natural or deliberate grouping of data for fast operations – If too large: serializable ACID make long latency and low throughput – If too small: many across-EG expensive consistency operations (e.g. 2PC), or looser consistency asynchronous messaging 2012/3/25 8
  • 9. Schematic Diagrams A EG like a mini-DB WAL (logs) Primary Data Local Indexes Inbox for Queue Messages EG 2 …… EG n Megastore layout in Bigtable 2012/3/25 9
  • 10. Many WAL vs. Single WAL • Many replicated logs each governing its own EG, to improve availability and throughput. – Independent and concurrent operations for different EG – Only operations within a EG need to be serialized – Temporary long-wait and failed operations does not impact other EG • Many WAL to scale throughput and localize outages • WAL is stored with each EG in Bigtable • Examples with the same tenet – The asynchronous and concurrent RPC communication framework of HBase and Hadoop IPC. 2012/3/25 10
  • 11. Consistency Levels and the Approaches • Within each EG: Full ACID semantics – Single-Phase-Commit ACID transactions – And commit entity is replicated via Paxos across-DC • Across-EG: Limited consistency guarantees (two methods for tow levels) – Two-Phase-Commit (expensive, long latency) -> strong consistency – Or, Typically leverage efficient asynchronous messaging (queue!, inexpensive, low latency) -> loose (or eventual) consistency • Two-phase-commit vs. asynchronous-messaging – Two-Phase-Commit transactions • Strong consistency • Expensive • Long latency and low throughput • Usually for low-traffic operations – Asynchronous-messaging • Loose consistency, may be inconsistent (or may be eventual consistency) • Inexpensive • Usually for heavy-traffic operations • Objects to be made consistent: – Data, Local Indexes, within EG : strong (via WAL, ACID) – Data, Global Indexes, cross-EG : strong (via 2PC) or looser (via messaging) – Replicas within DC : strong (via GFS and Bigtable) – Replicas across DC : strong (via Paxos) 2012/3/25 11
  • 12. The two Faces of ACID Transactions • Frontface: – Simplify development for applications – Reasoning about correctness • Backface: – Performance reduce – Latency – Throughput 2012/3/25 12
  • 13. Architecture of Megastore – How it deploy? • How it deploy – a client library (DB logic) – and auxiliary servers (for across-DC replication) • Applications link to the client library 2012/3/25 13
  • 14. Data Model and Semantics to be a database … 2012/3/25 14
  • 15. Principles to be a DBMS • Provides traditional database features, such as secondary indexes, etc. • but only those features that can scale within user-tolerable latency limits, • and only with the semantics that EG partitioning scheme can support. Feature set carefully chosen, tradeoffs. 2012/3/25 15
  • 16. Data Model (concepts for database) • A Data Model is a notation for describing data or information. • Consists of 3 parts, generally – Structure of the data – Operations on the data – Constraints on the data • Megastore Data Model: Relational Model + Scale – Limited relational model – Bigtable’s scalability • High Level Model vs. Physical Level Model – Physical Level • Complicate application development • Bigtable’s data model is at physical level – High Level • Let programmers to write code conveniently • Language, SQL 2012/3/25 16
  • 17. Data Model • Schemaful • Primary key − Strongly typed (Primitives or PB) – Built from a sequence of − Required, optional or repeated properties − All entities in a table have the same set of allowable properties. – Must be unique within the table − Nested Protocol-Buffers? An EG= a root entity + all entities Entities Properties Schemas Tables in child tables that reference it (primary (name, (name) (name) key) type) EG Root Child tables Property- table (foreign Entities 111 (EG key) key=EG key) Property- Entity-11 112 Entity Table-1 Property- Photo Entity-12 113 Entity Schema User Entity-21 Entity Table-2 Book Entity-22 Entity schema related hierarchical data 2012/3/25 17
  • 18. SQL-Like Schema Language (DDL) CREATE SCHEMA DemoApp; Additional Qualifiers: CREATE TABLE User { DESC|ASC|SCATTER required int64 userId; required string name; ------------------------------------ } PRIMARY KEY(userId), ENTITY GROUP ROOT; CREATE TABLE Book{ required int64 userId; CREATE TABLE Photo { required int32 bookId; required int64 userId; required int64 time; required int32 photoId; required int64 time; required string url; required string url; repeated string tag; optional string thumbUrl; } PRIMARY KEY([DESC|ASC|SCATTER] userId, repeated string tag; [DESC|ASC|SCATTER] bookId), } PRIMARY KEY(userId, photoId), IN TABLE User, IN TABLE User, ENTITY GROUP KEY(userId) REFERENCES User; ENTITY GROUP KEY(userId) REFERENCES User; CREATE LOCAL INDEX PhotosByTime CREATE LOCAL INDEX BooksByTime ON Photo(userId, time); ON Book([DESC|ASC|SCATTER] userId, [DESC|ASC] time); CREATE GLOBAL INDEX PhotosByTag ON Photo(tag) STORING (thumbUrl); 2012/3/25 18
  • 19. Data Placement in Bigtable (principles) Pre-join with Keys, for performance … • Lets applications control the placement of hierarchical/related data, to minimize latency and maximize throughput – Storing data that is accessed together in nearby rows, or – Denormalized into the same row • The data for a EG are held in contiguous ranges of Bigtable rows, for – Low latency – High throughput – Cache efficiency • Pre-Joining with keys – Primary keys to cluster entities that will be read together. – Each entity maps into a single Bigtable row. – Primary key values are concatenated to form the Bigtable row key – Each remaining property occupies its own Bigtable column – Entity-group key as the prefix of Primary key (row key) – Sorted keys ascending or descending – SCATTER (two-byte hash prefix), to prevent hotspots in Bigtable – Recursive for arbitrary join depths (multiple levels of “IN TABLE”) 2012/3/25 19
  • 20. Data Placement in Bigtable (details) Pre-join with Keys, for performance … • Bigtable row key = primary key of each table • Bigtable column name = <table name>.<property name> – Allowing entities from different Megastore tables to be mapped into the same Bigtable row without collision. • Store the transaction and replication log and metadata for the EG in root entity’s Bigtable row. – Because Bigtable provides per-row transactions. • Indexes: Each index entry is represented as a single Bigtable row – Bigtable row key = <indexed property values> + <primary key> – Bigtable cell columns: denormalized properties 2012/3/25 20
  • 21. Data Placement in Bigtable (examples) STORING Transaction Meta User Table Photo Table Denormalized Row Key Root. Root. User. Photo. Photo. Photo. Photo. PhotosByTag. WAL meta name time url thumbUrl tag thumbUrl <U1> Log3 commit Jack Log2 offset Root User Log1 applied offset … EG for U1 <U1,P1> T1 URL1 TURL1 girl, car Photo Local Index Global Index Data PhotosByTime PhotosByTag <U1,P2> T2 URL2 TURL2 dress, girl <U1,T1><U1,P1> <U1,T2><U1,P2> <car><U1,P1> TURL1 <dress><U1,P2> TURL2 <girl><U1,P1> TURL1 <girl><U1,P2> TURL2 2012/3/25 21
  • 22. Secondary Indexes • Secondary indexes can be declared on any list of entity properties(optional is ok), including repeated properties, as well as sub-fields within ProtocolBuffers, and full-text index. • Local Indexes – Within EG – Obey ACID semantics • The index entries are stored in the entity group and are updated atomically and consistently with the primary entity data. • Global Indexes – Span EGs – Looser consistency (or may eventual) • Not guaranteed to reflect all recent updates. (may inconsistent with the primary data?) • It is a trick to keep consistent between Global Indexes and primary data!? – Special Two-Phase-Commit? and – Read-Repair? 2012/3/25 22
  • 23. Secondary Indexes and Demoralization • STORING clause for copied data in index entities – Avoid the indirect access of primary entities, it is very expensive random access. – But, keeping consistent is a issue! • Inline Indexes – Index entries from the source entities appear as a virtual repeated column in the target entry. – An inline index can be created on any table (child) that has a foreign key referencing another table (parent) by using the first primary key of the target entity as the first components of the index. Inline Index Repeated Columns Inline User Row Key User. PhotosByTime. PhotosByTime. Photo. Photo. Parent Table name T1 T2 time thumbUrl <U1> Jack <P1> <P2> Photo Child Table <U1,P1> T1 TURL1 <U1,P2> T2 TURL2 CREATE INLINE INDEX PhotosByTime ON Photo(userId, time); 2012/3/25 23
  • 24. Inline Indexes for many-to-many Relationships • Coupled with repeated indexes, inline indexes can also be used to implement many-to-many relationships more efficiently than by maintaining a many-to-many link table. Inline Index many-to-many Repeated Columns Inline Row User. PhotosByTag. PhotosByTag. PhotosByTag. Photo. Photo. User Key name car dress girl time thumbUrl Parent Table <U1> Jack <P1> <P2> <P1> <P2> Photo <U1,P1> T1 TURL1 Child Table <U1,P2> T2 TURL2 <U2> Tom <P1> <U2,P1> T3 TURL3 CREATE INLINE INDEX PhotosByTag ON Photo(userId, tag); 2012/3/25 24
  • 25. API • Cost-transparent API – Match application developers’ intuitions – High-volume interactive workloads benefit more from predictable performance than from an expressive query language. • Normalized relational schemas rely on joins at query time to service user operations, is not the right model for Megastore applications. – Pre-joins – Denormalization • SQL-Like Schema language (DDL, for data structures and data placement) – Fine-grained control over physical locality • Hierarchical layouts (pre-joins) • Declarative denormalization – Eliminate the need for most joins • Queries API against particular tables and indexes – Range Scans – Lookups • Schema changes require corresponding modifications to the query implementation code 2012/3/25 25
  • 26. Query Joins • Query Joins, when required, are implemented in application code. • Index-based join • Merge joins – Multiple queries returns primary keys for the same table, in the same order. – Then intersection of keys for them. • Outer joins – Index lookup (return small result set) – Parallel index lookups using the results of the above lookup • Other joins …? 2012/3/25 26
  • 27. Query Joins - Merge Joins Query-1 SELECT * FROM Photo WHERE tag=girl girl & car Intersection or & or | girl | car SELECT * FROM Photo WHERE tag=car Query-2 Use the global index: PhotosByTag Just like: SELECT * FROM Photo WHERE tag=girl AND tag=car or SELECT * FROM Photo WHERE tag=girl OR tag=car Strictly, Merge Join is not a real join in the lingo of SQL, but is really a “Join”. 2012/3/25 27
  • 28. Query Joins - Outer Joins name=Jack, userId=U1,U2 Query-2 Query-1 userId=U1,U2 Parallel Index Lookup Query-2 Index lookup T1<time<T10 Parallel Index Lookup SELECT name, userId FROM User SELECT thumbnUrl FROM Photo WHERE name=Jack WHERE time>T1 AND time<T10; (suppose there is a index: … Parallel for each userId. UsersByName) Just like: SELECT User.name, User.userId, Photo.thumbUrl FROM User LEFT OUTER JOIN Photo ON Photo.userId=User.userId WHERE User.name=Jack AND Photo.time>T1 and Photo.time<T10 Example of result: Jack, U1, TURL1 Jack, U2, NULL 2012/3/25 28
  • 29. Transactions and Concurrency Control • An EG as a mini-database, serializable ACID transactions . • Transactions within-EG – A transaction writes its mutations into the EG's WAL, then the mutations are applied to the data. – Readers use the timestamp of the last fully applied transaction to avoid seeing partial updates. • MVCC: Multi-Version Concurrency Control (very important) – Use Bigtable cell’s timestamps/versions – Readers and writers don't block each other, and reads are isolated from writes for the duration of a transaction. (How? See MVCC in Wikipedia) • Write patterns – A write transaction always begins with a current read to determine the next available log position. (This current read only ensures that all previously committed writes to be applied.) – The commit operation gathers mutations into a log entry, assigns it a timestamp higher than any previous one, and appends it to the log (and using Paxos for replicate across-DC). – The write operation can return to the client at any point after Commit. Write Op Commit Read Op • Read patterns – Current Read Metadata and WAL of EG root Check for – Snapshot Read recover committed logs ad In Bigtable Re – Inconsistent Read The apply may be async Apply Tables data and Indexes data in Bigtable 2012/3/25 29
  • 30. Transactions and Concurrency Control - Write Last committed position Writer (ts2) When failure occurs here: Last fully applied position (ts1)  Transaction-1: Metadata Very safe. Transaction-3 Writing  Transaction-2: Serializable Transactions (ongoing) Writ writing but not commit e not gather and append into log Safe, no data loss, but Transaction-2 (commited) Write Commit Mutation-22-ts2 should be recovered from committed but not fully applied partially applied data log to data, when doing Transaction-1 Writ Mutation-21-ts2 “current read” or “write” (committed, applied) Com e operations. mit Mutation-12-ts1 Data-part1-ts2 Transactions Mutation-11-ts1 Data-part2-ts1  Transaction-3: a log entry WAL Data-part1-ts1 Not complete, failed. assign it a timestamp Application will get failed committed and fully applied Data (Use Bigtable timestamp return. Figure : The state of Transaction System for a EG for MVCC) Note: The commit operation gathers mutations into a log entry, assigns it a timestamp to it. A write transaction always begins with ensuring that all previously committed writes to be applied (via a current read)! 2012/3/25 30
  • 31. Transactions Read Patterns and Lifecycle • Current Read • A complete transaction – Only within-EG lifecycle – When starting a current read, – Read the transaction system first ensures that all previously • Get timestamp of the last committed writes are applied. committed transaction from (Just like the recovery of metadata. commit-logs.) – Application logic – Then the application reads at the • Read-modify-write. timestamp of the latest – Commit committed transaction. • Gathers mutations into a log entry, assigns it a higher • Snapshot Read timestamp. – Only within-EG • Replicate across-DC via – Picks up the timestamp of the Paxos. last known fully applied • Can return to client here. transaction and reads from there. ---------------------------------------- – Some committed transactions (following job may be asynchronous) may not yet be applied. ---------------------------------------- – Apply • Inconsistent Read • Write mutations into data – Read the latest values directly, and indexes. may get partially applied data, for aggressive latency. – Clean up • Delete fully applied logs. 2012/3/25 31
  • 32. Transactions Read Patterns – Current Read (1) Check latest Current Reader committed writes Last committed position (ts2) (3) Update metadata Last fully applied position (ts1) -> (ts2) Metadata Transaction-3 Writing Serializable Transactions (ongoing) Writ writing but not commit (4) Read data at ts2 e not gather and append into log (2) Apply previous Transaction-2 Write committed writes (commited) Commit Mutation-22-ts2 committed but not fully applied Transaction-1 Mutation-21-ts2 Data-part2-ts2 (committed, applied) Writ Com e mit Mutation-12-ts1 Data-part1-ts2 Transactions Mutation-11-ts1 Data-part2-ts1 WAL a log entry Data-part1-ts1 assign it a timestamp committed and Data fully applied (Use Bigtable timestamp Figure : The state of Transaction System for a EG for MVCC) Do recovery write before read data. 2012/3/25 32
  • 33. Transactions Read Patterns – Snapshot Read Snapshot Reader Last committed position (1) Get last fully applied ts (ts2) Last fully applied position (ts1) Metadata Transaction-3 Writing Serializable Transactions (ongoing) Writ writing but not commit e not gather and append into log Transaction-2 Write (2) Read data at ts1 (commited) Commit Mutation-22-ts2 committed but not fully applied Transaction-1 Mutation-21-ts2 (committed, applied) Writ Com e mit Mutation-12-ts1 Data-part1-ts2 Transactions Mutation-11-ts1 Data-part2-ts1 WAL a log entry Data-part1-ts1 assign it a timestamp committed and Data fully applied (Use Bigtable timestamp Figure : The state of Transaction System for a EG for MVCC) The very easy read pattern. 2012/3/25 33
  • 34. Transactions Read Patterns – Inconsistent Read Inconsistent Reader Last committed position (ts2) Last fully applied position (ts1) Metadata (1) Directly read partial data Transaction-3 Writing Serializable Transactions (ongoing) Writ writing but not commit e not gather and append into log Transaction-2 Write (commited) Commit Mutation-22-ts2 committed but not partially fully applied applied data Transaction-1 Mutation-21-ts2 (committed, applied) Writ Com e mit Mutation-12-ts1 Data-part1-ts2 Transactions Mutation-11-ts1 Data-part2-ts1 WAL a log entry Data-part1-ts1 assign it a timestamp committed and Data fully applied (Use Bigtable timestamp Figure : The state of Transaction System for a EG for MVCC) The application must tolerate the stale or partially applied data. 2012/3/25 34
  • 35. Two-Phase-Commit Expensive, Long Latency 2012/3/25 35
  • 36. Replication for High Availability … I need more study about Paxos, so it is not go-to- detailed. 2012/3/25 36
  • 37. Replication • Within-DC – Across hosts – Built-in from Bigtable and GFS • Across-DC – … synchronous and consistent for each write 2012/3/25 37
  • 38. Replication cross-DC • Traditional strategies (not work) • EG based synchronously – Asynchronous Master/Slave • Asynchronously propagate replicate each write • Master supports fast ACID transactions • Low latency • Data loss risk • Use Paxos • Downtime for failover – No distinguished master • Heavyweight master • Required a mediate mastership(e.g. – Replicate write-ahead-log ZooKeeper) – Synchronously replicating writes (each log append blocks – Synchronous Master/Slave • No data loss on acknowledgments from a • Downtime for failover majority of replicas, and • Long latency replicas in the minority catch • Heavyweight master up as they are able) • Required a mediate mastership(e.g. ZooKeeper) – Any node can initiate writes and reads – Optimistic Replication – Reasonable latency • No distinguished master • Asynchronously propagate • Availability and latency are excellent – Extensions • No mutation order and transactions are impossible • Allows local reads at any up- • Like Cassandra/Dynamo to-date replica • Permits single-roundtrip writes 2012/3/25 38
  • 39. Paxos • Traditional usages – Locking – Master election – Replication of metadata and configurations • Megastore use Paxos – Replicate primary user data across-DC on every write – For across-DC high availability 2012/3/25 39
  • 40. To Study More … 2012/3/25 40
  • 41. Valuable References • P. Helland. Life beyond distributed transactions: an apostate's opinion. In CIDR, pages 132-141, 2007. – The philosophy inspire of Megastore 2012/3/25 41