Scalable tabular (SFrame, SArray) and graph (SGraph) data-structures built for out-of-core data analysis.
The SFrame package provides the complete implementation of:
SFrame
SArray
SGraph
The C++ SDK surface area (gl_sframe, gl_sarray, gl_sgraph)
4. 4
Python API
user movie rating
netflix_tr.frame
sf = gl.Sframe.read_csv(‘netflix.csv’)
sf2 = gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
ratingsf[‘nrating’] = sf2[‘rating’]
sf
user
item
rating
nrating
5. 5
Python API
user movie rating
netflix_tr.frame
sf = gl.Sframe.read_csv(‘netflix.csv’)
sf
user
item
rating
sf2 = gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
ratingsf[‘nrating’] = sf2[‘rating’]
nrating
diff = sf[‘rating’] - sf2[‘rating’]
diff
anonymous
6. 6
Python API
user movie rating
netflix_tr.frame
sf = gl.Sframe.read_csv(‘netflix.csv’)
sf
user
item
rating
sf2 = gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
ratingsf[‘nrating’] = sf2[‘rating’]
nrating
diff = sf[‘rating’] - sf2[‘rating’]
diff
anonymous
sf[‘diff’] = diff
diff
Not a SQL Frontend
Filtering
sf[sf[‘rating’] >= 3]
Joins
Sf.join(user_table, on=‘user_id’)
Random/Array indexing
row10 = sf[10]
Table_with_every_other_row = sf[::2]
Rather Fast Parallelized UDFs (Interproc SHM)
sf[‘rating’].apply(lambda x: x*x)
7. 7
Column Types Supported
• Boring Scalar Types
- int64, double, string
• Interesting Scalar Types
- Datetime.datetime, image
• For the Mathematician Type
- array(‘d’)
• For the all real data is ugly types
- List, dict
(Arbitrary union types. Ex: List can contain anything
including other lists and dicts.)
8. 8
What Are SFrames
Physical Storage Layer
Compressed Column Store
(with some interesting properties)
Lazy Query Optimization /
Execution
C++ Coroutine Exec Pipeline
Python API
Heavily Pandas Inspired
(+ immutable data considerations)
File System Abstraction Local HDFS S3
Cache
Type aware compression
methods. Very aggressive
numeric compression.
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
160MB
9. 9
Query Planning
Physical Storage Layer
Compressed Column Store
(with some interesting properties)
Lazy Query Optimization /
Execution
C++ Coroutine Exec Pipeline
Python API
Heavily Pandas Inspired
(+ immutable data considerations)
File System Abstraction Local HDFS S3
Cache
p['X4'] = p['X3'] + p['X2']
g= p[p['X1'] < 10]
10. 10
Language Binding
• Python Bindings
- Our oldest binding.
Via Cython + Interprocess Comm to a C++ binary.
• R Bindings
- Via our RCpp C++11 Bindings (exported in
SDK)
• C++11 Bindings
auto g = gl_sframe();
g["hello"] = gl_sarray::from_sequence(0,1000);
g["world"] = 2;
g["hello"] = (g["hello"] / 2)
.astype(flex_type_enum::INTEGER);
auto ret = g.groupby({"hello"},
{{"sum of world",aggregate::SUM("world")}});
ret = ret.sort({"hello"});
cout << ret;
Columns:
hello integer
sum of world integer
Rows: 500
Data:
+----------------+----------------+
| hello | sum of world |
+----------------+----------------+
| 0 | 4 |
| 1 | 4 |
| 2 | 4 |
| 3 | 4 |
| 4 | 4 |
| 5 | 4 |
| 6 | 4 |
| 7 | 4 |
| 8 | 4 |
| 9 | 4 |
+----------------+----------------+
[500 rows x 2 columns]
11. 11
Common Crawl Graph
1x r3.8xlarge using 1x SSD.
3.5 billion Nodes and 128 billion Edges
PageRank: 9 min per iteration.
Connected Components: ~ 1 hr.
There isn’t any general purpose library out there capable of this.
Fix and update this slide
Align with stages?
Can we discuss pricing here?
Somewhat more expressive than SQL-backed dataframe solutions. It shares a lot more properties with Pandas than with SQL. You can append, modify columns, etc. The only thing you cannot do, is modify individual values.
- Filtering, joins are standard.
- It is an actual table. Arbitrary indexing is fine. Sometimes it might result in a materialization which is costly. But once materialized indexing is not too bad!
- parallelized lambdas! C++ process interprocess shared memory C++ embedded libpython
What are the
I have struggled to present this. It is really difficult to explain what this is.
Only recent that I figured out the reason.
It is not 1 thing.
It is really 3 or 4 things.
- Python API, heavy Pandas inspired. Does a ton of stuff. Also has a rather nice scalable graph datastructure to go with it
- A physical storage layer. Heavy compressed column store with type-specific compression routines. Especially aggressive for numeric types. It comes with a file system abstraction (for C++ people fstream, general_fstream) that can read from many places.
A special “cache” filesystem which basically is an “in memory file” that dumps to disk when memory gets full. This is how we get compressed in memory performance
- And I am not even talking about our Graph Datastructure either. But talk to me if you want to hear more.
- Potentially the youngest part of the code base, with the most bang for the buck now if you come in and make improvements, is the query engine. Lazy evaluation, and so we can do query optimization, query planning, query execution.
Python Sframe API. Our oldest language binding. Why? We can talk about this another time. Some due to old design decisions. This does mean that copies from Python are slow. That said, the architecture makes it very easyto eliminate interprocess comm entirely, but there is one very interesting oddity which we have to resolve first.
R Sframe API (which we are trying to stabilize right now, and will be released open source as well. Unfortunately under GPL as is traditional in R. But it really just wraps the C++11 Sframe API)
There are some other parts here which I am not talking about. For instance our Graph Datastructure which is optimized for bulk compute (not But talk to me if you want to hear more.
If you were to try to represent this in memory, it is a minimum of a TB of memory or so, excluding overheads.Canonical
Q:
Performance? Pretty good. Single machine performance about comparable to 5 node spark, or Hive clusters. Still much room to go: recent versions have had a regression as we switched out the query execution engine for something more “correct”.