SlideShare une entreprise Scribd logo
1  sur  32
Python and MongoDB as a Market Data Platform
Scalable storage of time series data
2014
Opinions expressed are those of the author and may not be shared by all personnel of Man Group plc
(‘Man’). These opinions are subject to change without notice, and are for information purposes only and do not
constitute an offer or invitation to make an investment in any financial instrument or in any product to which any
member of Man’s group of companies provides investment advisory or any other services. Any forward-looking
statements speak only as of the date on which they are made and are subject to risks and uncertainties that may
cause actual results to differ materially from those contained in the statements. Unless stated otherwise this
information is communicated by Man Investments Limited and AHL Partners LLP which are both authorised and
regulated in the UK by the Financial Conduct Authority.
2
Legalese…
3
The Problem
Financial data comes in different sizes…
• ~1MB 1x a day price data
• ~1GB x 1000s 9,000 x 9,000 data matrices
• ~40GB 1-minute data
• ~30TB Tick data
• > even larger data sets (options, …)
… and different shapes
• Time series of prices
• Event data
• News data
• What’s next?
4
Overview – Data shapes
Quant researchers
• Interactive work – latency sensitive
• Batch jobs run on a cluster – maximize throughput
• Historical data
• New data
• ... want control of storing their own data
Trading system
• Auditable – SVN for data
• Stable
• Performant
5
Overview – Data consumers
6
The Research Problem – Scale
lib.read(‘Equity Prices')
Out[4]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 9605 entries, 1983-01-31 21:30:00 to 2014-02-14 21:30:00
Columns: 8103 entries, AST10000 to AST9997
dtypes: float64(8631)
Equity Prices: 77M float64s
593MB of data = 4,744Mbits!
600 MB
Many different existing data stores
• Relational databases
• Tick databases
• Flat files
• HDF5 files
• Caches
7
Overview – Databases
Many different existing data stores
• Relational databases
• Tick databases
• Flat files
• HDF5 files
• Caches
8
Can we build one system to rule them all?
Overview – Databases
Goals
• 10 years of 1 minute data in <1s
• 200 instruments x all history x once a day data <1s
• Single data store for all data types
• 1x day data  Tick Data
• Data versioning + Audit
Requirements
• Fast – most data in-memory
• Complete – all data in single location
• Scalable – unbounded in size and number of clients
• Agile – rapid iterative development
9
Project Goals
10
Implementation
Impedance mismatch between Python/Pandas/Numpy and Existing Databases
- Machine cluster operating on data blocks
Vs
- Database doing the analytical work
MongoDB:
- Developer productivity
- Document  Python Dictionary
- Fast out the box
- Low latency
- High throughput
- Predictable performance
- Sharding / Replication for growth and scale out
- Free
- Great support
- Most widely used NoSQL DB
11
Implementation – Choosing MongoDB
12
Implementation – System Architecture
Python
client
rs0
mongo
d
500GB
rs1
mongod
500GB
rs2
mongod
500GB
rs3
mongod
500GB
rs4
mongod
500GB
configserve
r
configserve
r
configserve
r
mongos mongosmongos
Python
client
cn…
Python
client
{'_id': ObjectId(…'),
'c': 47,
'columns': {
'PRICE': {'data': Binary('...', 0),
'dtype': 'float64',
'rowmask': Binary('...', 0)},
'SIZE': {'data': Binary('...', 0),
'dtype': 'int64',
'endSeq': -1L,
'index': Binary('...', 0),
'segment': 1296568173000L,
'sha': abcd123456,
'start': 1296568173000L,
'end': 1298569664000L,
'symbol': ‘AST1209',
'v': 2}
Data bucketed into named Libraries
• One minute
• Daily
• User-data: jbloggs.EOD
• Metadata Index
Pluggable library types:
• VersionStore
• TickStore
• Metadata store
• … others …
© Man 2013 13
Implementation – Mongoose
Mongoose key-value store
14
Implementation - MongooseAPI
from ahl.mongo import Mongoose
m = Mongoose('research') # Connect to the data store
m.list_libraries() # What data libraries are available
library = m[‘jbloggs.EOD’] # Get a Library
library.list_symbols() # List symbols
library.write(‘SYMBOL’, <TS or other data>) # Write
library.read(‘SYMBOL’, version=…) # Read, with an optional version
library.snapshot('snapshot-name') # Create a named snapshot of the library
Library.list_snapshots()
15
Implementation – Version Store
Snap A
Snap B
Sym1, v1
Sym2, v3
Sym2, v4
Sym2, v4
Sym2, v4
16
Implementation – VersionStore: A chunk
17
Implementation – VersionStore: A version
18
Implementation – VersionStore: Bringing it together
_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB
class PickleStore(object):
def write(collection, version, symbol, item):
# Try to pickle it. This is best effort
pickled = lz4.compressHC(cPickle.dumps(item))
for i in xrange(len(pickled) / _CHUNK_SIZE + 1):
segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])}
segment['segment'] = i
sha = checksum(symbol, segment)
collection.update({'symbol': symbol, 'sha': sha},
{'$set': segment,
'$addToSet': {'parent': version['_id']}},
upsert=True)
19
Implementation – Arbitrary Data
_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB
class PickleStore(object):
def write(collection, version, symbol, item):
# Try to pickle it. This is best effort
pickled = lz4.compressHC(cPickle.dumps(item))
for i in xrange(len(pickled) / _CHUNK_SIZE + 1):
segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])}
segment['segment'] = i
sha = checksum(symbol, segment)
collection.update({'symbol': symbol, 'sha': sha},
{'$set': segment,
'$addToSet': {'parent': version['_id']}},
upsert=True)
20
Implementation – Arbitrary Data
_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB
class PickleStore(object):
def write(collection, version, symbol, item):
# Try to pickle it. This is best effort
pickled = lz4.compressHC(cPickle.dumps(item))
for i in xrange(len(pickled) / _CHUNK_SIZE + 1):
segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])}
segment['segment'] = i
sha = checksum(symbol, segment)
collection.update({'symbol': symbol, 'sha': sha},
{'$set': segment,
'$addToSet': {'parent': version['_id']}},
upsert=True)
21
Implementation – Arbitrary Data
class PickleStore(object):
def read(self, collection, version, symbol):
data = ''.join([x['data'] for x in collection.find({'symbol': symbol,
'parent': version['_id']},
sort=[('segment', pymongo.ASCENDING)])])
return cPickle.loads(lz4.decompress(data))
22
Implementation – Arbitrary Data
23
Implementation – DataFrames
def do_write(df, version):
records = df.to_records()
version['dtype'] = str(records.dtype)
chunk_size = _CHUNK_SIZE / records.dtype.itemsize
... chunk_and_store ...
def do_read(version):
... read_chunks ...
data = ''.join(chunks)
dtype = np.dtype(version['dtype'])
recs = np.fromstring(data, dtype=dtype)
return DataFrame.from_records(recs)
24
Results
Flat files on NFS – Random market
25
Results – Performance Once a Day Data
HDF5 files – Random instrument
26
Results – Performance One Minute Data
Random E-Mini S&P contract from 2013
© Man 2013 27
Results – TickStore – 8 parallel
Random E-Mini S&P contract from 2013
© Man 2013 28
Results – TickStore
Random E-Mini S&P contract from 2013
© Man 2013 29
Results – TickStore Throughput
Random E-Mini S&P contract from 2013
30
Results – System Load
OtherTick Mongo (x2)N Tasks = 32
Built a system to store data of any shape and size
- Reduced impedance between Python language and the data store
Low latency:
- 1xDay data: 4ms for 10,000 rows (vs. 2,210ms from SQL)
- OneMinute / Tick data: 1s for 3.5M rows Python (vs. 15s – 40s+ from OtherTick)
- 1s for 15M rows Java
Parallel Access:
- Cluster with 256+ concurrent data access
- Consistent throughput – little load on the Mongo server
Efficient:
- 10-15x reduction in network load
- Negligible decompression cost (lz4: 1.8Gb/s)
31
Conclusions
32
Questions?

Contenu connexe

Tendances

NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and UsesSuvradeep Rudra
 
3. R- list and data frame
3. R- list and data frame3. R- list and data frame
3. R- list and data framekrishna singh
 
Data Analysis with Python Pandas
Data Analysis with Python PandasData Analysis with Python Pandas
Data Analysis with Python PandasNeeru Mittal
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modelingaksrauf
 
Python – Object Oriented Programming
Python – Object Oriented Programming Python – Object Oriented Programming
Python – Object Oriented Programming Raghunath A
 
An Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBAn Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBMongoDB
 
MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)
MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)
MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)MongoDB
 
Delivering Trusted Insights with Integrated Data Quality for Collibra
Delivering Trusted Insights with Integrated Data Quality for CollibraDelivering Trusted Insights with Integrated Data Quality for Collibra
Delivering Trusted Insights with Integrated Data Quality for CollibraPrecisely
 
Data Analysis in Python-NumPy
Data Analysis in Python-NumPyData Analysis in Python-NumPy
Data Analysis in Python-NumPyDevashish Kumar
 
MongoDB Memory Management Demystified
MongoDB Memory Management DemystifiedMongoDB Memory Management Demystified
MongoDB Memory Management DemystifiedMongoDB
 
data modeling and models
data modeling and modelsdata modeling and models
data modeling and modelssabah N
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandasmaikroeder
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
 
Neo4j GraphTalk Helsinki - Introduction and Graph Use Cases
Neo4j GraphTalk Helsinki - Introduction and Graph Use CasesNeo4j GraphTalk Helsinki - Introduction and Graph Use Cases
Neo4j GraphTalk Helsinki - Introduction and Graph Use CasesNeo4j
 

Tendances (20)

Data warehousing
Data warehousingData warehousing
Data warehousing
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
 
3. R- list and data frame
3. R- list and data frame3. R- list and data frame
3. R- list and data frame
 
Data Analysis with Python Pandas
Data Analysis with Python PandasData Analysis with Python Pandas
Data Analysis with Python Pandas
 
Pandas
PandasPandas
Pandas
 
Pandas
PandasPandas
Pandas
 
Data frame operations
Data frame operationsData frame operations
Data frame operations
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
Python – Object Oriented Programming
Python – Object Oriented Programming Python – Object Oriented Programming
Python – Object Oriented Programming
 
An Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBAn Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDB
 
MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)
MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)
MongoDB Schema Design (Event: An Evening with MongoDB Houston 3/11/15)
 
Delivering Trusted Insights with Integrated Data Quality for Collibra
Delivering Trusted Insights with Integrated Data Quality for CollibraDelivering Trusted Insights with Integrated Data Quality for Collibra
Delivering Trusted Insights with Integrated Data Quality for Collibra
 
Data Analysis in Python-NumPy
Data Analysis in Python-NumPyData Analysis in Python-NumPy
Data Analysis in Python-NumPy
 
MongoDB Memory Management Demystified
MongoDB Memory Management DemystifiedMongoDB Memory Management Demystified
MongoDB Memory Management Demystified
 
data modeling and models
data modeling and modelsdata modeling and models
data modeling and models
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandas
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 
Neo4j GraphTalk Helsinki - Introduction and Graph Use Cases
Neo4j GraphTalk Helsinki - Introduction and Graph Use CasesNeo4j GraphTalk Helsinki - Introduction and Graph Use Cases
Neo4j GraphTalk Helsinki - Introduction and Graph Use Cases
 
Data models
Data modelsData models
Data models
 

Similaire à Python and MongoDB as a Market Data Platform by James Blackburn

Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Stefan Urbanek
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for HadoopJim Dowling
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
Big Data and Machine Learning with FIWARE
Big Data and Machine Learning with FIWAREBig Data and Machine Learning with FIWARE
Big Data and Machine Learning with FIWAREFernando Lopez Aguilar
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchSylvain Wallez
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseFlorian Lautenschlager
 
IBM Cognos tutorial - ABC LEARN
IBM Cognos tutorial - ABC LEARNIBM Cognos tutorial - ABC LEARN
IBM Cognos tutorial - ABC LEARNabclearnn
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDBMongoDB
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.GeeksLab Odessa
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareData Con LA
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olapSalah Amean
 
Ugif 04 2011 france ug04042011-jroy_ts
Ugif 04 2011   france ug04042011-jroy_tsUgif 04 2011   france ug04042011-jroy_ts
Ugif 04 2011 france ug04042011-jroy_tsUGIF
 
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!Daniel Cousineau
 
High Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudHigh Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudMongoDB
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto
 
Ten things to consider for interactive analytics on write once workloads
Ten things to consider for interactive analytics on write once workloadsTen things to consider for interactive analytics on write once workloads
Ten things to consider for interactive analytics on write once workloadsAbinasha Karana
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDBMongoDB
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 

Similaire à Python and MongoDB as a Market Data Platform by James Blackburn (20)

Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for Hadoop
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Big Data and Machine Learning with FIWARE
Big Data and Machine Learning with FIWAREBig Data and Machine Learning with FIWARE
Big Data and Machine Learning with FIWARE
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series database
 
IBM Cognos tutorial - ABC LEARN
IBM Cognos tutorial - ABC LEARNIBM Cognos tutorial - ABC LEARN
IBM Cognos tutorial - ABC LEARN
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDB
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
 
Ugif 04 2011 france ug04042011-jroy_ts
Ugif 04 2011   france ug04042011-jroy_tsUgif 04 2011   france ug04042011-jroy_ts
Ugif 04 2011 france ug04042011-jroy_ts
 
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
 
High Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudHigh Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal Cloud
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
Ten things to consider for interactive analytics on write once workloads
Ten things to consider for interactive analytics on write once workloadsTen things to consider for interactive analytics on write once workloads
Ten things to consider for interactive analytics on write once workloads
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 

Plus de PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 

Plus de PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Dernier

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Dernier (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Python and MongoDB as a Market Data Platform by James Blackburn

  • 1. Python and MongoDB as a Market Data Platform Scalable storage of time series data 2014
  • 2. Opinions expressed are those of the author and may not be shared by all personnel of Man Group plc (‘Man’). These opinions are subject to change without notice, and are for information purposes only and do not constitute an offer or invitation to make an investment in any financial instrument or in any product to which any member of Man’s group of companies provides investment advisory or any other services. Any forward-looking statements speak only as of the date on which they are made and are subject to risks and uncertainties that may cause actual results to differ materially from those contained in the statements. Unless stated otherwise this information is communicated by Man Investments Limited and AHL Partners LLP which are both authorised and regulated in the UK by the Financial Conduct Authority. 2 Legalese…
  • 4. Financial data comes in different sizes… • ~1MB 1x a day price data • ~1GB x 1000s 9,000 x 9,000 data matrices • ~40GB 1-minute data • ~30TB Tick data • > even larger data sets (options, …) … and different shapes • Time series of prices • Event data • News data • What’s next? 4 Overview – Data shapes
  • 5. Quant researchers • Interactive work – latency sensitive • Batch jobs run on a cluster – maximize throughput • Historical data • New data • ... want control of storing their own data Trading system • Auditable – SVN for data • Stable • Performant 5 Overview – Data consumers
  • 6. 6 The Research Problem – Scale lib.read(‘Equity Prices') Out[4]: <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 9605 entries, 1983-01-31 21:30:00 to 2014-02-14 21:30:00 Columns: 8103 entries, AST10000 to AST9997 dtypes: float64(8631) Equity Prices: 77M float64s 593MB of data = 4,744Mbits! 600 MB
  • 7. Many different existing data stores • Relational databases • Tick databases • Flat files • HDF5 files • Caches 7 Overview – Databases
  • 8. Many different existing data stores • Relational databases • Tick databases • Flat files • HDF5 files • Caches 8 Can we build one system to rule them all? Overview – Databases
  • 9. Goals • 10 years of 1 minute data in <1s • 200 instruments x all history x once a day data <1s • Single data store for all data types • 1x day data  Tick Data • Data versioning + Audit Requirements • Fast – most data in-memory • Complete – all data in single location • Scalable – unbounded in size and number of clients • Agile – rapid iterative development 9 Project Goals
  • 11. Impedance mismatch between Python/Pandas/Numpy and Existing Databases - Machine cluster operating on data blocks Vs - Database doing the analytical work MongoDB: - Developer productivity - Document  Python Dictionary - Fast out the box - Low latency - High throughput - Predictable performance - Sharding / Replication for growth and scale out - Free - Great support - Most widely used NoSQL DB 11 Implementation – Choosing MongoDB
  • 12. 12 Implementation – System Architecture Python client rs0 mongo d 500GB rs1 mongod 500GB rs2 mongod 500GB rs3 mongod 500GB rs4 mongod 500GB configserve r configserve r configserve r mongos mongosmongos Python client cn… Python client {'_id': ObjectId(…'), 'c': 47, 'columns': { 'PRICE': {'data': Binary('...', 0), 'dtype': 'float64', 'rowmask': Binary('...', 0)}, 'SIZE': {'data': Binary('...', 0), 'dtype': 'int64', 'endSeq': -1L, 'index': Binary('...', 0), 'segment': 1296568173000L, 'sha': abcd123456, 'start': 1296568173000L, 'end': 1298569664000L, 'symbol': ‘AST1209', 'v': 2}
  • 13. Data bucketed into named Libraries • One minute • Daily • User-data: jbloggs.EOD • Metadata Index Pluggable library types: • VersionStore • TickStore • Metadata store • … others … © Man 2013 13 Implementation – Mongoose
  • 14. Mongoose key-value store 14 Implementation - MongooseAPI from ahl.mongo import Mongoose m = Mongoose('research') # Connect to the data store m.list_libraries() # What data libraries are available library = m[‘jbloggs.EOD’] # Get a Library library.list_symbols() # List symbols library.write(‘SYMBOL’, <TS or other data>) # Write library.read(‘SYMBOL’, version=…) # Read, with an optional version library.snapshot('snapshot-name') # Create a named snapshot of the library Library.list_snapshots()
  • 15. 15 Implementation – Version Store Snap A Snap B Sym1, v1 Sym2, v3 Sym2, v4 Sym2, v4 Sym2, v4
  • 18. 18 Implementation – VersionStore: Bringing it together
  • 19. _CHUNK_SIZE = 15 * 1024 * 1024 # 15MB class PickleStore(object): def write(collection, version, symbol, item): # Try to pickle it. This is best effort pickled = lz4.compressHC(cPickle.dumps(item)) for i in xrange(len(pickled) / _CHUNK_SIZE + 1): segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])} segment['segment'] = i sha = checksum(symbol, segment) collection.update({'symbol': symbol, 'sha': sha}, {'$set': segment, '$addToSet': {'parent': version['_id']}}, upsert=True) 19 Implementation – Arbitrary Data
  • 20. _CHUNK_SIZE = 15 * 1024 * 1024 # 15MB class PickleStore(object): def write(collection, version, symbol, item): # Try to pickle it. This is best effort pickled = lz4.compressHC(cPickle.dumps(item)) for i in xrange(len(pickled) / _CHUNK_SIZE + 1): segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])} segment['segment'] = i sha = checksum(symbol, segment) collection.update({'symbol': symbol, 'sha': sha}, {'$set': segment, '$addToSet': {'parent': version['_id']}}, upsert=True) 20 Implementation – Arbitrary Data
  • 21. _CHUNK_SIZE = 15 * 1024 * 1024 # 15MB class PickleStore(object): def write(collection, version, symbol, item): # Try to pickle it. This is best effort pickled = lz4.compressHC(cPickle.dumps(item)) for i in xrange(len(pickled) / _CHUNK_SIZE + 1): segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])} segment['segment'] = i sha = checksum(symbol, segment) collection.update({'symbol': symbol, 'sha': sha}, {'$set': segment, '$addToSet': {'parent': version['_id']}}, upsert=True) 21 Implementation – Arbitrary Data
  • 22. class PickleStore(object): def read(self, collection, version, symbol): data = ''.join([x['data'] for x in collection.find({'symbol': symbol, 'parent': version['_id']}, sort=[('segment', pymongo.ASCENDING)])]) return cPickle.loads(lz4.decompress(data)) 22 Implementation – Arbitrary Data
  • 23. 23 Implementation – DataFrames def do_write(df, version): records = df.to_records() version['dtype'] = str(records.dtype) chunk_size = _CHUNK_SIZE / records.dtype.itemsize ... chunk_and_store ... def do_read(version): ... read_chunks ... data = ''.join(chunks) dtype = np.dtype(version['dtype']) recs = np.fromstring(data, dtype=dtype) return DataFrame.from_records(recs)
  • 25. Flat files on NFS – Random market 25 Results – Performance Once a Day Data
  • 26. HDF5 files – Random instrument 26 Results – Performance One Minute Data
  • 27. Random E-Mini S&P contract from 2013 © Man 2013 27 Results – TickStore – 8 parallel
  • 28. Random E-Mini S&P contract from 2013 © Man 2013 28 Results – TickStore
  • 29. Random E-Mini S&P contract from 2013 © Man 2013 29 Results – TickStore Throughput
  • 30. Random E-Mini S&P contract from 2013 30 Results – System Load OtherTick Mongo (x2)N Tasks = 32
  • 31. Built a system to store data of any shape and size - Reduced impedance between Python language and the data store Low latency: - 1xDay data: 4ms for 10,000 rows (vs. 2,210ms from SQL) - OneMinute / Tick data: 1s for 3.5M rows Python (vs. 15s – 40s+ from OtherTick) - 1s for 15M rows Java Parallel Access: - Cluster with 256+ concurrent data access - Consistent throughput – little load on the Mongo server Efficient: - 10-15x reduction in network load - Negligible decompression cost (lz4: 1.8Gb/s) 31 Conclusions