SlideShare une entreprise Scribd logo
1  sur  33
www.twosigma.com
Smooth Storage
September 13, 2018Proprietary and Confidential – Not for Redistribution
A storage system for managing structured time
series data at Two Sigma
Saurabh Goel
saurabh.goel@twosigma.com
Disclaimer
This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer
to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon
for, investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates
(collectively, “Two Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without
notice. The document may employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of
such information and the use of such information in no way implies an endorsement of the source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two
Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for
identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark
does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
Motivation
September 13, 2018
• Why have specialized storage for time series data ?
 Extremely common at Two Sigma
 Time is one of the primary dimensions along which applications want to partition and
filter data
 Scale – in terms of both size and access
 Optimizing for the target application workload and requirements
Proprietary and Confidential – Not for Redistribution
Smooth’s design emphasis
September 13, 2018
• Optimized for range queries and range updates executed in parallel per table
• File system like operations but with database like properties like atomicity
and an isolation model for concurrent access
• Centrally managed service at TS
• Higher expectations around reliability, availability, and multi-tenancy
(security, access control, fair sharing of resources, etc)
• Storage efficiency is also a major concern given the overall size of data stored
Proprietary and Confidential – Not for Redistribution
File system ------------------------------ Smooth --------------- Database
Target Application characteristics
September 13, 2018
• Parallel time partitioned jobs that move a lot of data
• Tend to be batch oriented; care more about throughput than latency
• New use cases are demanding better latency, smaller IO, more query power
• Not good for workloads that require very low latencies or issue large numbers
of small reads and writes
Proprietary and Confidential – Not for Redistribution
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
Data Model
September 13, 2018Proprietary and Confidential – Not for Redistribution
• Tables with schema; mandatory time column
• Rows ordered and indexed by time
• Not relational – duplicate timestamps/rows allowed; no notion of primary key
but users can enforce PK constraints in their applications
• Easy to update schema
• Can store wide sparse schemas efficiently
Write API
September 13, 2018
Updates a given time range atomically; the existing rows belonging to the range
are replaced by the given set of new rows
Proprietary and Confidential – Not for Redistribution
WriteSession s = write(table, [10, 42));
s.addRow(<10, ..>);
s.addRow(<15, ..>);
// repeated timestamp is ok
s.addRow(<15, ..>);
// rows must be added in non-decreasing order
s.addRow(<10, ..>);
// rows must lie within the given time range
s.addRow(<50, ..>);
s.commit();
Write API
September 13, 2018Proprietary and Confidential – Not for Redistribution
• Set of write operations to a table forms a total order; internally each write
gets a unique, strictly monotonically increasing logical commit timestamp
• Distributed atomic writes are possible
• Delete is just a special case of update where no new rows are written
Read API
September 13, 2018Proprietary and Confidential – Not for Redistribution
• Rows returned are based on the latest committed view of the table at the
start of the read operation. Remains isolated from concurrent writes.
Read API
• Snapshot reads over a given time range
Iterator<Row> i = read(table, time range);
while(i.hasNext()) {
doSomething(i.next());
}
Other Operations
September 13, 2018
• Some operations that are not officially supported but a natural fit for smooth
• Distributed snapshot reads
• Reads in the past, permanent snapshots
• Atomic read-modify-write operations using optimistic concurrency control
(OCC) on the commit time
Proprietary and Confidential – Not for Redistribution
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
Table Implementation
September 13, 2018Proprietary and Confidential – Not for Redistribution
Time column
Shard 2
Shard 1
overwritten time range
Committime
c1
c2
Data
file
Replica
Data file contains the new
set of ordered rows;
immutable and indexed;
potentially replicated
Shard is the internal representation
of an update operation;
semantically immutable
Data layer
Metadata layer
Read Algorithm
September 13, 2018Proprietary and Confidential – Not for Redistribution
Time column
Committime
Shard 1
Shard 2
Shard 3
Shard 4
Read this range
start of
read
Reads are implemented by
concatenating together visible
subranges of overlapping shards - we
call this the “read plan”
The underlying data file per shard is
ordered and indexed and can efficiently
select rows belonging to visible sub-
ranges
Data File format
September 13, 2018
The underlying data file is indexed using a simple two level static B+Tree
Proprietary and Confidential – Not for Redistribution
Data File format
September 13, 2018
A data file has one index block and individually compressed data blocks laid out
contiguously
• Data block is the unit of read; variable sized and compressed; typically small
number of MBs; allow random access and parallelization
• Currently use lz4 for most of the files; very low overhead but still gives us
about 2x compression on average; have used gzip for some of the cold data
files
Proprietary and Confidential – Not for Redistribution
Compaction
September 13, 2018
Problem: overwrites of random time ranges and small writes
• Excessive fragmentation of the read plan; leads to slow reads, and excessive
seeks on the backend data stores reducing overall serving capacity
• Metadata bloat; small shards/files means larger metadata on smooth and
object stores
• Garbage; data under hidden ranges can be garbage collected
Proprietary and Confidential – Not for Redistribution
Compaction Process
September 13, 2018Proprietary and Confidential – Not for Redistribution
Time column
Committime
Shard 1
Shard 2
Shard 3
Shard 4
New compacted shard
committed here
New compacted
shard
Deleted after the new
shard is committed
Underlying data files
are not immediately
deleted to support
ongoing reads
Only contiguous fragments can be combined
together!
Comparing with LSM
September 13, 2018
Similar to Log Structured Merge (LSM) tree
• Smooth impl is log structured
• immutable shards with embedded B-trees are similar to “sstables”
• both have compaction processes aimed at similar objectives
• Differ in details – each shard carries with itself a “bulk delete” tombstone
whose handling is deferred till compaction time
• read algorithm is different – no row level comparison for “next” operation
• Key-value stores can use similar ideas to optimize bulk deletes
Proprietary and Confidential – Not for Redistribution
Write Amplification
September 13, 2018
• Write amplification = actual bytes written to storage / bytes written by user
• Has not been an issue in practice – less than 10 on average
• If the write workload gets more challenging (i.e. higher rate of small random
writes)
• Use leveled compaction similar to traditional key-value based LSM storage
engines
• by allowing non-contiguous shards to be combined – shards essentially get moved
into data files
• would make our read algorithm more complex - need to merge read plans from all
levels
Proprietary and Confidential – Not for Redistribution
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
System Architecture
September 13, 2018Proprietary and Confidential – Not for Redistribution
System Architecture
September 13, 2018
• All smooth metadata is stored on Microsoft Sql Server which gets replicated
to backup servers in a remote data center
• Stateless metadata servers front the database providing functions like
authorization, quota enforcement, and qos (fair sharing of resources)
• Applications link with a smooth client library in order to access smooth
Proprietary and Confidential – Not for Redistribution
System Architecture
September 13, 2018
• Data files are stored in object stores
• Multiple different types of OSs can be plugged into smooth and federated
together for scaling, or replicated across for geo-redundancy/availability, or
used for storage tiering.
• Currently we use HDFS for warm data and CELFS for cold data; CELFS is an
internal archival file system at TS
Proprietary and Confidential – Not for Redistribution
Virtues of Immutability
September 13, 2018
• A design principle we have been using is immutability - both physical (write-
once data files) and semantic (shards)
• The combination of linear metadata (i.e. strictly increasing commit
timestamps) and immutable elements means that user reads and updates, the
shard compaction process, and physical data movement process can operate
in parallel with no interference and with minimal coordination
• Data files can be cached without worrying about consistency
This simple model has been central to keeping the system simple, robust and
scalable.
Proprietary and Confidential – Not for Redistribution
Some Statistics
September 13, 2018
• Multiple PBs of unique compressed data
• Read peaks in excess of 100 GB/s (before decompressing)
• 100s of millions of files/shards
• 10s of millions of tables
• 10s of thousands of concurrent requests
Proprietary and Confidential – Not for Redistribution
Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
Looking Forward
September 13, 2018
• Multi-datacenter and public cloud read scaling
• CDN like distributed caching layer that spans even to sites that don’t store
data
• Encryption at rest may be important for cloud use cases
• More cost-efficient multi-dc replication and cold data storage
• Data stores that use erasure coding
• More efficient data encoding and compression
• Data stores that can replicate data across data centers and support
desirable failover semantics
Proprietary and Confidential – Not for Redistribution
Looking Forward
September 13, 2018
• Performance
• Performance consistency is a major concern - tail latencies are a major issue
with HDFS
• Issues with slow serialization and parsing of rows
• More challenging workloads
• Interactive workloads are becoming common – latency sensitive
• Column filtering
• Complex read queries
Proprietary and Confidential – Not for Redistribution
Looking Forward
September 13, 2018
Complex queries
• Common for time series datasets to have multiple sub-series merged together
by time, like prices per stock ticker. The sub-series is typically identified by
another column. The cardinality of this column is generally in 10k to 20k
range
• Example query: given an arbitrary subset of tickers and a time range, return all
matching rows ordered by time
• In reality each ticker has its own time range, and there are several variations
of this query
• Looking at new kinds of indexing
Proprietary and Confidential – Not for Redistribution
Looking Forward
September 13, 2018
• Moving away from a “thick” smooth client
• Enables quick iteration and bug fixes
• Multi-language support
• Opens up many architectural possibilities like caching, easier access control,
Qos, etc
• Various other reliability, multi-tenancy, metadata scaling, security and
operability improvements
Proprietary and Confidential – Not for Redistribution
September 13, 2018
Thank You!
Proprietary and Confidential – Not for Redistribution

Contenu connexe

Tendances

Apache Knox - Hadoop Security Swiss Army Knife
Apache Knox - Hadoop Security Swiss Army KnifeApache Knox - Hadoop Security Swiss Army Knife
Apache Knox - Hadoop Security Swiss Army Knife
DataWorks Summit
 

Tendances (20)

Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
 
Zuul @ Netflix SpringOne Platform
Zuul @ Netflix SpringOne PlatformZuul @ Netflix SpringOne Platform
Zuul @ Netflix SpringOne Platform
 
KNIME Software Overview
KNIME Software OverviewKNIME Software Overview
KNIME Software Overview
 
New Relic
New RelicNew Relic
New Relic
 
Prometheus + Grafana = Awesome Monitoring
Prometheus + Grafana = Awesome MonitoringPrometheus + Grafana = Awesome Monitoring
Prometheus + Grafana = Awesome Monitoring
 
Apache Knox - Hadoop Security Swiss Army Knife
Apache Knox - Hadoop Security Swiss Army KnifeApache Knox - Hadoop Security Swiss Army Knife
Apache Knox - Hadoop Security Swiss Army Knife
 
Elk stack
Elk stackElk stack
Elk stack
 
Principles of System Observability
Principles of System Observability Principles of System Observability
Principles of System Observability
 
VictoriaMetrics 2023 Roadmap
VictoriaMetrics 2023 RoadmapVictoriaMetrics 2023 Roadmap
VictoriaMetrics 2023 Roadmap
 
Grafana
GrafanaGrafana
Grafana
 
Secure your Application with Google cloud armor
Secure your Application with Google cloud armorSecure your Application with Google cloud armor
Secure your Application with Google cloud armor
 
Wazuh Security Platform
Wazuh Security PlatformWazuh Security Platform
Wazuh Security Platform
 
Graylog Engineering - Design Your Architecture
Graylog Engineering - Design Your ArchitectureGraylog Engineering - Design Your Architecture
Graylog Engineering - Design Your Architecture
 
2021/0/15 - Solarwinds supply chain attack: why we should take it sereously
2021/0/15 - Solarwinds supply chain attack: why we should take it sereously2021/0/15 - Solarwinds supply chain attack: why we should take it sereously
2021/0/15 - Solarwinds supply chain attack: why we should take it sereously
 
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
 
5 Anti-Patterns in Api Design - buildstuff
5 Anti-Patterns in Api Design - buildstuff5 Anti-Patterns in Api Design - buildstuff
5 Anti-Patterns in Api Design - buildstuff
 
Graylog for open stack 3 steps to know why
Graylog for open stack    3 steps to know whyGraylog for open stack    3 steps to know why
Graylog for open stack 3 steps to know why
 
Infrastructure-as-Code with Pulumi - Better than all the others (like Ansible)?
Infrastructure-as-Code with Pulumi- Better than all the others (like Ansible)?Infrastructure-as-Code with Pulumi- Better than all the others (like Ansible)?
Infrastructure-as-Code with Pulumi - Better than all the others (like Ansible)?
 
Taking advantage of Prometheus relabeling
Taking advantage of Prometheus relabelingTaking advantage of Prometheus relabeling
Taking advantage of Prometheus relabeling
 
Prometheus with Grafana - AddWeb Solution
Prometheus with Grafana - AddWeb SolutionPrometheus with Grafana - AddWeb Solution
Prometheus with Grafana - AddWeb Solution
 

Similaire à Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma

Similaire à Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma (20)

Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
Serverless Datalake Day with AWS
Serverless Datalake Day with AWSServerless Datalake Day with AWS
Serverless Datalake Day with AWS
 
Choosing the Right Database for My Workload: Purpose-Built Databases
Choosing the Right Database for My Workload: Purpose-Built Databases Choosing the Right Database for My Workload: Purpose-Built Databases
Choosing the Right Database for My Workload: Purpose-Built Databases
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
 
Big Data@Scale
 Big Data@Scale Big Data@Scale
Big Data@Scale
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
 
Using Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeUsing Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-Purpose
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architecture
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Amazon Aurora: Database Week SF
Amazon Aurora: Database Week SFAmazon Aurora: Database Week SF
Amazon Aurora: Database Week SF
 
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
 
BI & Analytics
BI & AnalyticsBI & Analytics
BI & Analytics
 
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
 

Plus de Two Sigma

Waiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-ScalerWaiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-Scaler
Two Sigma
 
The Language of Compression - Leif Walsh
The Language of Compression - Leif WalshThe Language of Compression - Leif Walsh
The Language of Compression - Leif Walsh
Two Sigma
 

Plus de Two Sigma (20)

The State of Open Data on School Bullying
The State of Open Data on School BullyingThe State of Open Data on School Bullying
The State of Open Data on School Bullying
 
Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018
 
Future of Pandas - Jeff Reback
Future of Pandas - Jeff RebackFuture of Pandas - Jeff Reback
Future of Pandas - Jeff Reback
 
BeakerX - Tiezheng Li
BeakerX - Tiezheng LiBeakerX - Tiezheng Li
BeakerX - Tiezheng Li
 
Engineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee JooEngineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee Joo
 
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel HudsonBringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
 
Waiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-ScalerWaiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-Scaler
 
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia YeResponsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
 
Archival Storage at Two Sigma - Josh Leners
Archival Storage at Two Sigma - Josh LenersArchival Storage at Two Sigma - Josh Leners
Archival Storage at Two Sigma - Josh Leners
 
The Language of Compression - Leif Walsh
The Language of Compression - Leif WalshThe Language of Compression - Leif Walsh
The Language of Compression - Leif Walsh
 
Identifying Emergent Behaviors in Complex Systems - Jane Adams
Identifying Emergent Behaviors in Complex Systems - Jane AdamsIdentifying Emergent Behaviors in Complex Systems - Jane Adams
Identifying Emergent Behaviors in Complex Systems - Jane Adams
 
Algorithmic Data Science = Theory + Practice
Algorithmic Data Science = Theory + PracticeAlgorithmic Data Science = Theory + Practice
Algorithmic Data Science = Theory + Practice
 
HUOHUA: A Distributed Time Series Analysis Framework For Spark
HUOHUA: A Distributed Time Series Analysis Framework For SparkHUOHUA: A Distributed Time Series Analysis Framework For Spark
HUOHUA: A Distributed Time Series Analysis Framework For Spark
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
 
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
 
Graph Summarization with Quality Guarantees
Graph Summarization with Quality GuaranteesGraph Summarization with Quality Guarantees
Graph Summarization with Quality Guarantees
 
Rademacher Averages: Theory and Practice
Rademacher Averages: Theory and PracticeRademacher Averages: Theory and Practice
Rademacher Averages: Theory and Practice
 
Credit-Implied Volatility
Credit-Implied VolatilityCredit-Implied Volatility
Credit-Implied Volatility
 
Principles of REST API Design
Principles of REST API DesignPrinciples of REST API Design
Principles of REST API Design
 

Dernier

Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 

Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma

  • 1. www.twosigma.com Smooth Storage September 13, 2018Proprietary and Confidential – Not for Redistribution A storage system for managing structured time series data at Two Sigma Saurabh Goel saurabh.goel@twosigma.com
  • 2. Disclaimer This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for, investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity. The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.
  • 3. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  • 4. Motivation September 13, 2018 • Why have specialized storage for time series data ?  Extremely common at Two Sigma  Time is one of the primary dimensions along which applications want to partition and filter data  Scale – in terms of both size and access  Optimizing for the target application workload and requirements Proprietary and Confidential – Not for Redistribution
  • 5. Smooth’s design emphasis September 13, 2018 • Optimized for range queries and range updates executed in parallel per table • File system like operations but with database like properties like atomicity and an isolation model for concurrent access • Centrally managed service at TS • Higher expectations around reliability, availability, and multi-tenancy (security, access control, fair sharing of resources, etc) • Storage efficiency is also a major concern given the overall size of data stored Proprietary and Confidential – Not for Redistribution File system ------------------------------ Smooth --------------- Database
  • 6. Target Application characteristics September 13, 2018 • Parallel time partitioned jobs that move a lot of data • Tend to be batch oriented; care more about throughput than latency • New use cases are demanding better latency, smaller IO, more query power • Not good for workloads that require very low latencies or issue large numbers of small reads and writes Proprietary and Confidential – Not for Redistribution
  • 7. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  • 8. Data Model September 13, 2018Proprietary and Confidential – Not for Redistribution • Tables with schema; mandatory time column • Rows ordered and indexed by time • Not relational – duplicate timestamps/rows allowed; no notion of primary key but users can enforce PK constraints in their applications • Easy to update schema • Can store wide sparse schemas efficiently
  • 9. Write API September 13, 2018 Updates a given time range atomically; the existing rows belonging to the range are replaced by the given set of new rows Proprietary and Confidential – Not for Redistribution WriteSession s = write(table, [10, 42)); s.addRow(<10, ..>); s.addRow(<15, ..>); // repeated timestamp is ok s.addRow(<15, ..>); // rows must be added in non-decreasing order s.addRow(<10, ..>); // rows must lie within the given time range s.addRow(<50, ..>); s.commit();
  • 10. Write API September 13, 2018Proprietary and Confidential – Not for Redistribution • Set of write operations to a table forms a total order; internally each write gets a unique, strictly monotonically increasing logical commit timestamp • Distributed atomic writes are possible • Delete is just a special case of update where no new rows are written
  • 11. Read API September 13, 2018Proprietary and Confidential – Not for Redistribution • Rows returned are based on the latest committed view of the table at the start of the read operation. Remains isolated from concurrent writes. Read API • Snapshot reads over a given time range Iterator<Row> i = read(table, time range); while(i.hasNext()) { doSomething(i.next()); }
  • 12. Other Operations September 13, 2018 • Some operations that are not officially supported but a natural fit for smooth • Distributed snapshot reads • Reads in the past, permanent snapshots • Atomic read-modify-write operations using optimistic concurrency control (OCC) on the commit time Proprietary and Confidential – Not for Redistribution
  • 13. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  • 14. Table Implementation September 13, 2018Proprietary and Confidential – Not for Redistribution Time column Shard 2 Shard 1 overwritten time range Committime c1 c2 Data file Replica Data file contains the new set of ordered rows; immutable and indexed; potentially replicated Shard is the internal representation of an update operation; semantically immutable Data layer Metadata layer
  • 15. Read Algorithm September 13, 2018Proprietary and Confidential – Not for Redistribution Time column Committime Shard 1 Shard 2 Shard 3 Shard 4 Read this range start of read Reads are implemented by concatenating together visible subranges of overlapping shards - we call this the “read plan” The underlying data file per shard is ordered and indexed and can efficiently select rows belonging to visible sub- ranges
  • 16. Data File format September 13, 2018 The underlying data file is indexed using a simple two level static B+Tree Proprietary and Confidential – Not for Redistribution
  • 17. Data File format September 13, 2018 A data file has one index block and individually compressed data blocks laid out contiguously • Data block is the unit of read; variable sized and compressed; typically small number of MBs; allow random access and parallelization • Currently use lz4 for most of the files; very low overhead but still gives us about 2x compression on average; have used gzip for some of the cold data files Proprietary and Confidential – Not for Redistribution
  • 18. Compaction September 13, 2018 Problem: overwrites of random time ranges and small writes • Excessive fragmentation of the read plan; leads to slow reads, and excessive seeks on the backend data stores reducing overall serving capacity • Metadata bloat; small shards/files means larger metadata on smooth and object stores • Garbage; data under hidden ranges can be garbage collected Proprietary and Confidential – Not for Redistribution
  • 19. Compaction Process September 13, 2018Proprietary and Confidential – Not for Redistribution Time column Committime Shard 1 Shard 2 Shard 3 Shard 4 New compacted shard committed here New compacted shard Deleted after the new shard is committed Underlying data files are not immediately deleted to support ongoing reads Only contiguous fragments can be combined together!
  • 20. Comparing with LSM September 13, 2018 Similar to Log Structured Merge (LSM) tree • Smooth impl is log structured • immutable shards with embedded B-trees are similar to “sstables” • both have compaction processes aimed at similar objectives • Differ in details – each shard carries with itself a “bulk delete” tombstone whose handling is deferred till compaction time • read algorithm is different – no row level comparison for “next” operation • Key-value stores can use similar ideas to optimize bulk deletes Proprietary and Confidential – Not for Redistribution
  • 21. Write Amplification September 13, 2018 • Write amplification = actual bytes written to storage / bytes written by user • Has not been an issue in practice – less than 10 on average • If the write workload gets more challenging (i.e. higher rate of small random writes) • Use leveled compaction similar to traditional key-value based LSM storage engines • by allowing non-contiguous shards to be combined – shards essentially get moved into data files • would make our read algorithm more complex - need to merge read plans from all levels Proprietary and Confidential – Not for Redistribution
  • 22. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  • 23. System Architecture September 13, 2018Proprietary and Confidential – Not for Redistribution
  • 24. System Architecture September 13, 2018 • All smooth metadata is stored on Microsoft Sql Server which gets replicated to backup servers in a remote data center • Stateless metadata servers front the database providing functions like authorization, quota enforcement, and qos (fair sharing of resources) • Applications link with a smooth client library in order to access smooth Proprietary and Confidential – Not for Redistribution
  • 25. System Architecture September 13, 2018 • Data files are stored in object stores • Multiple different types of OSs can be plugged into smooth and federated together for scaling, or replicated across for geo-redundancy/availability, or used for storage tiering. • Currently we use HDFS for warm data and CELFS for cold data; CELFS is an internal archival file system at TS Proprietary and Confidential – Not for Redistribution
  • 26. Virtues of Immutability September 13, 2018 • A design principle we have been using is immutability - both physical (write- once data files) and semantic (shards) • The combination of linear metadata (i.e. strictly increasing commit timestamps) and immutable elements means that user reads and updates, the shard compaction process, and physical data movement process can operate in parallel with no interference and with minimal coordination • Data files can be cached without worrying about consistency This simple model has been central to keeping the system simple, robust and scalable. Proprietary and Confidential – Not for Redistribution
  • 27. Some Statistics September 13, 2018 • Multiple PBs of unique compressed data • Read peaks in excess of 100 GB/s (before decompressing) • 100s of millions of files/shards • 10s of millions of tables • 10s of thousands of concurrent requests Proprietary and Confidential – Not for Redistribution
  • 28. Outline September 13, 2018 • Motivation and design emphasis • Data Model and API • Implementation of the data model • System Architecture • Looking Forward Proprietary and Confidential – Not for Redistribution
  • 29. Looking Forward September 13, 2018 • Multi-datacenter and public cloud read scaling • CDN like distributed caching layer that spans even to sites that don’t store data • Encryption at rest may be important for cloud use cases • More cost-efficient multi-dc replication and cold data storage • Data stores that use erasure coding • More efficient data encoding and compression • Data stores that can replicate data across data centers and support desirable failover semantics Proprietary and Confidential – Not for Redistribution
  • 30. Looking Forward September 13, 2018 • Performance • Performance consistency is a major concern - tail latencies are a major issue with HDFS • Issues with slow serialization and parsing of rows • More challenging workloads • Interactive workloads are becoming common – latency sensitive • Column filtering • Complex read queries Proprietary and Confidential – Not for Redistribution
  • 31. Looking Forward September 13, 2018 Complex queries • Common for time series datasets to have multiple sub-series merged together by time, like prices per stock ticker. The sub-series is typically identified by another column. The cardinality of this column is generally in 10k to 20k range • Example query: given an arbitrary subset of tickers and a time range, return all matching rows ordered by time • In reality each ticker has its own time range, and there are several variations of this query • Looking at new kinds of indexing Proprietary and Confidential – Not for Redistribution
  • 32. Looking Forward September 13, 2018 • Moving away from a “thick” smooth client • Enables quick iteration and bug fixes • Multi-language support • Opens up many architectural possibilities like caching, easier access control, Qos, etc • Various other reliability, multi-tenancy, metadata scaling, security and operability improvements Proprietary and Confidential – Not for Redistribution
  • 33. September 13, 2018 Thank You! Proprietary and Confidential – Not for Redistribution

Notes de l'éditeur

  1. A shard is semantically immutable, i.e. it always returns the same set of rows The physical representation of the underlying data can change in format or storage location or be replicated
  2. Gets the read plan for the entire time range and finds areas with excessive fragmentation (many small fragments) Selects a contiguous segment of the read plan containing fragments to be fixed, and rewrites them as a single new shard. The commit time of the new shard is the max of participating input shards – this makes sure the compaction process does not interfere with ongoing writes The underlying data files for the deleted shards are not immediately removed so that references from read plans of ongoing reads remain valid