This talk is about archival storage at Two Sigma. We begin by presenting CelFS, Two Sigma’s geo-distributed file system which has been in deployment for over ten years. Although CelFS has scaled to serve tens of petabytes of data, it uses physical partitioning to provide quality of service guarantees, it has a high replication overhead, and cannot take advantage of outsourced cold storage (e.g., Amazon’s Glaclier or Google’s coldline). In the rest of the talk, we describe our response to these limitations in Jaks, a new storage system to reduce the TCO of CelFS and serve as the backend for other systems at Two Sigma. We also discuss how we hedge risk in changing such a foundational system.
2. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
What is Two Sigma?
September 13, 2018
• Technology company applying data science platform to investment
management
• Follow the scientific method for finding investment strategies
• Over 2/3 technical staff; 72% non-financial
• 10,000 data sources
• 35 PB of data
• 95000 CPUs; 1.7 PB Memory
3. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
“If x, then
y and z
correlate”
Bloomberg, Thompson Reuters
Analysis/news
Prices, order books, trades
Market data
“We look beyond the obvious. So
we can find connections that lead
to the next great investment idea”
Other data
Data at Two Sigma
September 13, 2018
Modeling/
Research
“when x,
buy y and
sell z”
Trading tactic
$$$
4. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
This talk
September 13, 2018
• Celfs: evolution of an archival file store
• Jaks: a next generation backend
• What an academic has learned in industry
5. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
Celfs: the architecture
September 13, 2018
Celfs stores filesystem
snapshots, or views.
Root servers name and locate
views.
Data servers locate and store
files.
Metadata
Server
Root
server
Root
Server
Data
server
Data
server
Data
server
Data
server
Data
server
Data
server
Data
server
Data
server
Data
server
Data
server
NYC CHI
6. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
2017-09-21
------
File 1, File A
client 9/21
/home/dir/:
File 1
File A
client 9/22
/home/dir/:
File 1
File 2
File A
File B
client 9/23
/home/dir/:
File 3
File C
Celfs: the data model
September 13, 2018
Cel 1.
------
File 1, File A
Cel 2
------
File 2, File B
Cel 3.
------
File 3, File C
LATEST
------
File 1, File A
File 2, File B
File 3, File C
2017-09-22
------
File 1, File A
File 2, File B
2017-09-23
------
File 1, File A
File 2, File B
File 3, File C
7. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
Celfs: the teleology
September 13, 2018
• Archival storage — root servers and data servers are multi-datacenter
• CDN — publish information in one datacenter to another with strong
consistency guarantees
• High bandwidth data source — because cels are randomly distributed a large
view will often be able to make use of the whole cluster’s bandwidth
8. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
Celfs drawback: storage TCO
September 13, 2018
Single unit of scaling:
Lots of data center real-estate, power, cooling, etc.
Data has three total copies (vulnerable to a small number of disk failures)
Data
server
9. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
Celfs drawback: performance isolation and scalability
September 13, 2018
Data
server
Data
server
Data
server
Data
server
Data
server
Large-scale
computations
Fairness based on per-user limits, so
single user can’t utilize whole system.
Cluster-level isolation makes scaling
trade-offs worse!
10. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
This talk
September 13, 2018
• Celfs: evolution of an archival file store
• Jaks: a next generation backend
• What an academic learned in industry
11. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
JAKS: Just another keystone for storage
September 13, 2018
Most simply:
put(Object) -> id
get(id) -> Object
delete(id) -> ok
Under the hood:
• Tiered storage
• End-to-end encryption
• Quality of service
…
12. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
Storage tiers: where data lives
September 13, 2018
Bandwidth/
Speed
Cost/GB
RAM
SSD
Erasure encoded disk arrays
Offline storage (Glacier/Coldine/Tape)
100s Gbps
1000s Mbps
10s Mbps
13. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
JAKS: implementing storage tiers
September 13, 2018
Metadata
Server
Metadata
Server
Metadata
Gateway
Data
Gateway
Data
Gateway
Data
Gateway
Data
Gateway
Data
Gateway
Data
Gateway
consistent
metadata store
backing store
other
sites
client
14. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
JAKS: implementing storage tiers
September 13, 2018
• Clients only talk to gateways in their site
• Freedom to change backing store and metadata store
• Data gateways are unit of scaling for bandwidth; their RAM/SSDs scale cache
• Clients load-balance across gateways to make full use of cluster
• Random for metadata
• Consistent hash for data
15. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
Caching in Jaks
September 13, 2018
• Data in Jaks can be cached with three policies
• Pinned — data guaranteed to not be evicted (regardless of use) until some
future point in time
• Long cycle — data is not evicted until it hasn’t been used for a few weeks
• Short cycle — data is not evicted until it hasn’t been used for a few days
16. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
Measuring access time in Jaks
September 13, 2018
• Use two times
• mtime (when a file was created)
• atime (when a file was accessed)
• Can’t use filesystem “atime” because of SSD wear
• Use off-disk Bloom filters measuring daily access
17. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
Cache eviction in detail (today is Oct 13)
September 13, 2018
Dec 5
Oct 12
Oct 7
Oct 1
Oct 13
Oct 9
Long Cycle Short Cycle
Periodically:
1. Evict aged out entries
2. Check space, evict random if full
18. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
End-to-end encryption
September 13, 2018
Metadata
Server
Metadata
Server
Metadata
Gateway
Data
Gateway
Data
Gateway
Data
Gateway
Data
Gateway
Data
Gateway
Data
Gateway
consistent
metadata
store
backing
store
client
get(27)
secret
secret
PUT hash(data)
200 OK
19. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
End-to-end encryption details
September 13, 2018
• Use authenticated encryption scheme (AES-OCB)
• Derive baking store names from object’s secret
• End-to-end check is powerful!!
20. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
Performance isolation and bursty workflows
September 13, 2018
Data
server
Data
server
Large-scale
computations
Requirements:
• Allow user to take advantage of
whole system if idle
• Prevent oversubscription from
degrading service below SLA
21. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
Quality of service: Admission controllers
September 13, 2018
• Need to limit bandwidth resources
• Inbound/outbound traffic per network interface
• Inbound/outbound traffic per backend
• Need to limit fixed resources
• Database connections (in Metadata servers)
• Staging space (for uncached writes/reads)
22. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
Quality of Service: queuing and allocation
September 13, 2018
Background work
Research workflow
Trading daemons
Rachel
Barry
TomTina
Beth
RalphRandy
Medium priority
Guarantee 60%
Gets 40% of excess
Lowest priority
Guarantee 10%
Gets 50% of excess
Highest priority
Guarantee 30%
Gets 10% of excess
23. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
Quality of Service: flow control
September 13, 2018
• How to allocate resources like network bandwidth?
• Undersubscribe the OS sub-optimal utilization
• Oversubscribe the OS less control over allocation
• Need performance feedback to determine how much flow to allocate
• How can we measure TCP performance from the user level?
24. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
Measuring TCP performance from user space
September 13, 2018
Server:
Send 54 KB
Wait 27 us
Send 54 KB
…
Case 1: client can receive at maximum allowed rate.
- Send buffer never fills up
Case 2: client can’t receive at maximum allowed rate.
- Send buffer fills up
Gotchas:
- This feedback only works when RTT is low
- Feedback only effective if transfers are long
- Still need to account for duty cycle on backend
25. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
Quality of Service: backpressure
September 13, 2018
Data Gateway
Ralph
client
Response, backlog info
Backlog at server is
communicated on every
response.
Clients use backlog to
rate limit.
Rejections (queue too
full) lead to exponential
backoff
26. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
JAKS: Just another keystone for storage
September 13, 2018
Most simply:
put(Object) -> id
get(id) -> Object
delete(id) -> ok
Under the hood:
• End-to-end encryption
• Tiered storage (cached, normal, cold)
• Quality of service
…
Not covered:
- slow clients
- high-availability restarts
- fault-tolerance
- consistent hashing strategy
- geographic replication
27. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
This talk
September 13, 2018
• Celfs: evolution of an archival file store
• Jaks: a next generation backend
• What an academic learned in industry
28. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
What an academic learned: measurement
September 13, 2018
Grad school: building measurement framework
• Need to test hypotheses
• Need to get graphs into the paper!
Industry: building measurement framework
• Need to validate changes and measure impact (aka “test hypotheses”)
• Need to understand performance
• Need to detect and anticipate problems
29. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
What an academic learned: hedging risk
September 13, 2018
Celfs is stable, important, and highly integrated
• can’t expect people to jump ship voluntarily
Need extensive exposure to find bugs and gain confidence
• Jaks development starts January 2016; End-to-end deployment in March 2016
• Finally made GA this month (still have a Celfs safety net)
30. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
What an academic learned: compatibility
September 13, 2018
Academic: thick clients allow more sophisticated fault-tolerance and scaling
Industry: thick clients allow more sophisticated bugs to persist
31. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
What an academic learned: build vs. buy decisions
September 13, 2018
Celfs — it’s 2006 and Hadoop is just being born from Apache Nutch
Jaks
• We want to avoid lock-in
• Geo-redundancy not a common ask for vendors
• We need performance isolation
Ultimately, we took a hybrid approach: building gateways
32. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
What an academic learned: unexpected failures
September 13, 2018
Jaks is designed to tolerate faults in gateways, backend stores, and other sites
• Failure handling is most important part of integration testing
Hard to predict all failure scenarios (Byzantine Fault Tolerance won’t help!)
• Firewall configuration creates partition to certain hosts
• MTU settings disable Kerberos negotiation
• Misuse of Kerberos library causes authentication failures
• Stale network info misdirects clients
33. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
Placeholder before backup slides
September 13, 2018
34. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
Gateway caching performance as a function of clients
reading 100 MB
September 13, 2018
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
1 10 20 40 80
0% hot
50% hot
75% hot
90% hot
100% hot
number of clients
MBps
35. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this
presentation for important disclosure information.
Small read performance (64 KB)
September 13, 2018
0
10,000
20,000
30,000
40,000
50,000
0 40 80 120 160
0% hot
50% hot
75% hot
90% hot
100% hot
number of clients
IOPS