SFrame

•Télécharger en tant que PPTX, PDF•

3 j'aime•2,132 vues

Scalable tabular (SFrame, SArray) and graph (SGraph) data-structures built for out-of-core data analysis. The SFrame package provides the complete implementation of: SFrame SArray SGraph The C++ SDK surface area (gl_sframe, gl_sarray, gl_sgraph)

Données & analyses

SFrames
Yucheng Low
Chief Architect @ Dato

Scalable Machine Learning
recommenders, other task-oriented ML,
boosted decision trees, deep learning,
pattern mining, many others, etc
GraphLab Create
SGraphSFrameLocal
HDFS
S3
2
Compressed In-Core or
Out-of-core scalable datastructures
C++11

SGraphSFrameLocal
HDFS
S3
3
Compressed In-Core or
Out-of-core scalable datastructures
https://github.com/dato-code/sframe

4
Python API
user movie rating
netflix_tr.frame
sf = gl.Sframe.read_csv(‘netflix.csv’)
sf2 = gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
ratingsf[‘nrating’] = sf2[‘rating’]
sf
user
item
rating
nrating

5
Python API
user movie rating
netflix_tr.frame
sf = gl.Sframe.read_csv(‘netflix.csv’)
sf
user
item
rating
sf2 = gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
ratingsf[‘nrating’] = sf2[‘rating’]
nrating
diff = sf[‘rating’] - sf2[‘rating’]
diff
anonymous

6
Python API
user movie rating
netflix_tr.frame
sf = gl.Sframe.read_csv(‘netflix.csv’)
sf
user
item
rating
sf2 = gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
ratingsf[‘nrating’] = sf2[‘rating’]
nrating
diff = sf[‘rating’] - sf2[‘rating’]
diff
anonymous
sf[‘diff’] = diff
diff
Not a SQL Frontend
Filtering
sf[sf[‘rating’] >= 3]
Joins
Sf.join(user_table, on=‘user_id’)
Random/Array indexing
row10 = sf[10]
Table_with_every_other_row = sf[::2]
Rather Fast Parallelized UDFs (Interproc SHM)
sf[‘rating’].apply(lambda x: x*x)

7
Column Types Supported
• Boring Scalar Types
- int64, double, string
• Interesting Scalar Types
- Datetime.datetime, image
• For the Mathematician Type
- array(‘d’)
• For the all real data is ugly types
- List, dict
(Arbitrary union types. Ex: List can contain anything
including other lists and dicts.)

8
What Are SFrames
Physical Storage Layer
Compressed Column Store
(with some interesting properties)
Lazy Query Optimization /
Execution
C++ Coroutine Exec Pipeline
Python API
Heavily Pandas Inspired
(+ immutable data considerations)
File System Abstraction Local HDFS S3
Cache
Type aware compression
methods. Very aggressive
numeric compression.
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
160MB

9
Query Planning
Physical Storage Layer
Compressed Column Store
(with some interesting properties)
Lazy Query Optimization /
Execution
C++ Coroutine Exec Pipeline
Python API
Heavily Pandas Inspired
(+ immutable data considerations)
File System Abstraction Local HDFS S3
Cache
p['X4'] = p['X3'] + p['X2']
g= p[p['X1'] < 10]

$10 Language Binding • Python Bindings - Our oldest binding. Via Cython + Interprocess Comm to a C++ binary. • R Bindings - Via our RCpp  C++11 Bindings (exported in SDK) • C++11 Bindings auto g = gl_sframe(); g["hello"] = gl_sarray::from_sequence(0,1000); g["world"] = 2; g["hello"] = (g["hello"] / 2) .astype(flex_type_enum::INTEGER); auto ret = g.groupby({"hello"}, {{"sum of world",aggregate::SUM("world")}}); ret = ret.sort({"hello"}); cout << ret; Columns: hello integer sum of world integer Rows: 500 Data: +----------------+----------------+ | hello | sum of world | +----------------+----------------+ | 0 | 4 | | 1 | 4 | | 2 | 4 | | 3 | 4 | | 4 | 4 | | 5 | 4 | | 6 | 4 | | 7 | 4 | | 8 | 4 | | 9 | 4 | +----------------+----------------+ [500 rows x 2 columns]$

11
Common Crawl Graph
1x r3.8xlarge  using 1x SSD.
3.5 billion Nodes and 128 billion Edges
PageRank: 9 min per iteration.
Connected Components: ~ 1 hr.
There isn’t any general purpose library out there capable of this.

12
https://github.com/dato-code/sframe
pip install sframe

Recommandé

New Capabilities in the PyData EcosystemTuri, Inc.

Scalable data structures for data scienceTuri, Inc.

GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...Turi, Inc.

What’s New in the Berkeley Data Analytics StackTuri, Inc.

Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...Turi, Inc.

DeepLearning4J: Open Source Neural Net PlatformTuri, Inc.

Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf

Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks

Recommandé

New Capabilities in the PyData EcosystemTuri, Inc.

Scalable data structures for data scienceTuri, Inc.

GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...Turi, Inc.

What’s New in the Berkeley Data Analytics StackTuri, Inc.

Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...Turi, Inc.

DeepLearning4J: Open Source Neural Net PlatformTuri, Inc.

Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf

Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks

Spark Summit EU 2015: Reynold Xin KeynoteDatabricks

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine

H2O Overview with Amy Wang at useR! AalborgSri Ambati

What's New in Apache Spark 2.3 & Why Should You CareDatabricks

Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf

Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf

Predictive churn h20_dsxNdjido Ardo BAR

Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkJan Wiegelmann

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks

Extending the google_assistantNdjido Ardo BAR

Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...MLconf

V like Velocity, Predicting in Real-Time with Azure MLBarbara Fusinska

Python and H2O with Cliff Click at PyData Dallas 2015Sri Ambati

VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...Spark Summit

Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks

Managing data workflows with LuigiTeemu Kurppa

sparklyr - Jeff AllenSri Ambati

h2oensemble with Erin Ledell at useR! AalborgSri Ambati

Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinDatabricks

Intro to Python Data Analysis in WakariKarissa Rae McKelvey

Introduction to Recommender SystemsTuri, Inc.

Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Turi, Inc.

Contenu connexe

Tendances

Spark Summit EU 2015: Reynold Xin KeynoteDatabricks

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine

H2O Overview with Amy Wang at useR! AalborgSri Ambati

What's New in Apache Spark 2.3 & Why Should You CareDatabricks

Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf

Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf

Predictive churn h20_dsxNdjido Ardo BAR

Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkJan Wiegelmann

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks

Extending the google_assistantNdjido Ardo BAR

Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...MLconf

V like Velocity, Predicting in Real-Time with Azure MLBarbara Fusinska

Python and H2O with Cliff Click at PyData Dallas 2015Sri Ambati

VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...Spark Summit

Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks

Managing data workflows with LuigiTeemu Kurppa

sparklyr - Jeff AllenSri Ambati

h2oensemble with Erin Ledell at useR! AalborgSri Ambati

Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinDatabricks

Intro to Python Data Analysis in WakariKarissa Rae McKelvey

Tendances (20)

Spark Summit EU 2015: Reynold Xin Keynote

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16

H2O Overview with Amy Wang at useR! Aalborg

What's New in Apache Spark 2.3 & Why Should You Care

Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016

Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...

Predictive churn h20_dsx

Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

Extending the google_assistant

Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...

V like Velocity, Predicting in Real-Time with Azure ML

Python and H2O with Cliff Click at PyData Dallas 2015

VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...

Resource-Efficient Deep Learning Model Selection on Apache Spark

Managing data workflows with Luigi

sparklyr - Jeff Allen

h2oensemble with Erin Ledell at useR! Aalborg

Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin

Intro to Python Data Analysis in Wakari

En vedette

Introduction to Recommender SystemsTuri, Inc.

Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Turi, Inc.

2012 DuPage Environmental SummitNapervilleNCEC

EhealthhoyJESSICA NATALI MONJA SANTISTEBAN

I love free_nsta2010Jan Coley

Green itJESSICA NATALI MONJA SANTISTEBAN

Jean marie delbecqCentro Mineiro de Referência em Resíduos

Kiss fewer frogs - BNI INSOMNIACSMuneer Samnani

DieHarder (CCS 2010, WOOT 2011)Emery Berger

Socialising the enterpriseIntranet Future

Alba Lucia Sanchez Mejia astrydquintero

Presentacion de economica politicaabelardoac

La auténtica felicidad alicia hpdehp1961

Social Media for EventsJulius Solaris

Missao Piaui Diario da Serra 2016Alexandre Naime Barbosa

Australian Junior Mining Exploration Companyjoel_fishlock

BNI Achievers Chapter - 10mins The Story About MeLeik Hong, Leow 廖翊翃

LINK UP - How your business can benefit from LinkedInIntranet Future

Presentacion de economica politicaabelardoac

Herbert AllenHerbert Allen

En vedette (20)

Introduction to Recommender Systems

Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015

2012 DuPage Environmental Summit

Ehealthhoy

I love free_nsta2010

Green it

Jean marie delbecq

Kiss fewer frogs - BNI INSOMNIACS

DieHarder (CCS 2010, WOOT 2011)

Socialising the enterprise

Alba Lucia Sanchez Mejia

Presentacion de economica politica

La auténtica felicidad alicia hp

Social Media for Events

Missao Piaui Diario da Serra 2016

Australian Junior Mining Exploration Company

BNI Achievers Chapter - 10mins The Story About Me

LINK UP - How your business can benefit from LinkedIn

Presentacion de economica politica

Herbert Allen

Similaire à SFrame

FleetDBDiego Pacheco

Flink internals web Kostas Tzoumas

Relational Database Access with Python ‘sans’ ORM Mark Rees

FastR+Apache FlinkJuan Fumero

Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks

Apache Flink internalsKostas Tzoumas

FP - Découverte de Play Framework ScalaKévin Margueritte

New Developments in SparkDatabricks

How to Get Your Website Into the CloudAll Things Open

carrow - Go bindings to Apache Arrow via C++-APIYoni Davidson

ClojureScript for the webMichiel Borkent

Reproducible Computational Research in RSamuel Bosch

Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos

Relational Database Access with PythonMark Rees

Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Spark Summit

Adios hadoop, Hola Spark! T3chfest 2015dhiguero

De Java 8 a Java 17Víctor Leonel Orozco López

R sharing 101Omnia Safaan

Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit

Similaire à SFrame (20)

FleetDB

Flink internals web

Relational Database Access with Python ‘sans’ ORM

FastR+Apache Flink

Build Large-Scale Data Analytics and AI Pipeline Using RayDP

Apache Flink internals

FP - Découverte de Play Framework Scala

New Developments in Spark

How to Get Your Website Into the Cloud

carrow - Go bindings to Apache Arrow via C++-API

ClojureScript for the web

Reproducible Computational Research in R

Apache Spark Workshop, Apr. 2016, Euangelos Linardos

Relational Database Access with Python

Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...

Adios hadoop, Hola Spark! T3chfest 2015

De Java 8 a Java 17

R sharing 101

Natural Language Processing with CNTK and Apache Spark with Ali Zaidi

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla

Plus de Turi, Inc.

Webinar - Analyzing VideoTuri, Inc.

Webinar - Patient Readmission RiskTuri, Inc.

Webinar - Know Your Customer - Arya (20160526)Turi, Inc.

Webinar - Product Matching - Palombo (20160428)Turi, Inc.

Webinar - Pattern Mining Log Data - Vega (20160426)Turi, Inc.

Webinar - Fraud Detection - Palombo (20160428)Turi, Inc.

Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.

Pattern Mining: Extracting Value from Log DataTuri, Inc.

Intelligent Applications with Machine Learning ToolkitsTuri, Inc.

Text Analysis with Machine LearningTuri, Inc.

Machine Learning with GraphLab CreateTuri, Inc.

Machine Learning in Production with Dato Predictive ServicesTuri, Inc.

Machine Learning in 2016: Live Q&A with Carlos GuestrinTuri, Inc.

Machine learning in productionTuri, Inc.

Overview of Machine Learning and Feature EngineeringTuri, Inc.

Building Personalized Data Products with DatoTuri, Inc.

Getting Started With Dato - August 2015Turi, Inc.

Towards a Comprehensive Machine Learning BenchmarkTuri, Inc.

Dato KeynoteTuri, Inc.

Anomaly Detection Using Isolation ForestsTuri, Inc.

Plus de Turi, Inc. (20)

Webinar - Analyzing Video

Webinar - Patient Readmission Risk

Webinar - Know Your Customer - Arya (20160526)

Webinar - Product Matching - Palombo (20160428)

Webinar - Pattern Mining Log Data - Vega (20160426)

Webinar - Fraud Detection - Palombo (20160428)

Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Pattern Mining: Extracting Value from Log Data

Intelligent Applications with Machine Learning Toolkits

Text Analysis with Machine Learning

Machine Learning with GraphLab Create

Machine Learning in Production with Dato Predictive Services

Machine Learning in 2016: Live Q&A with Carlos Guestrin

Machine learning in production

Overview of Machine Learning and Feature Engineering

Building Personalized Data Products with Dato

Getting Started With Dato - August 2015

Towards a Comprehensive Machine Learning Benchmark

Dato Keynote

Anomaly Detection Using Isolation Forests

Dernier

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

VidaXL dropshipping via API with DroFx.pptxolyaivanovalion

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

Introduction-to-Machine-Learning (1).pptxfirstjob4

Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums

Data-Analysis for Chicago Crime Data 2023ymrp368

Smarteg dropshipping via API with DroFx.pptxolyaivanovalion

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

Halmar dropshipping via API with DroFxolyaivanovalion

Edukaciniai dropshipping via API with DroFxolyaivanovalion

Dernier (20)

Ravak dropshipping via API with DroFx.pptx

VidaXL dropshipping via API with DroFx.pptx

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...

Generative AI on Enterprise Cloud with NiFi and Milvus

Introduction-to-Machine-Learning (1).pptx

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779

Sampling (random) method and Non random.ppt

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...

Data-Analysis for Chicago Crime Data 2023

Smarteg dropshipping via API with DroFx.pptx

FESE Capital Markets Fact Sheet 2024 Q1.pdf

Halmar dropshipping via API with DroFx

Edukaciniai dropshipping via API with DroFx

SFrame

1. SFrames Yucheng Low Chief Architect @ Dato

2. Scalable Machine Learning recommenders, other task-oriented ML, boosted decision trees, deep learning, pattern mining, many others, etc GraphLab Create SGraphSFrameLocal HDFS S3 2 Compressed In-Core or Out-of-core scalable datastructures C++11

3. SGraphSFrameLocal HDFS S3 3 Compressed In-Core or Out-of-core scalable datastructures https://github.com/dato-code/sframe

4. 4 Python API user movie rating netflix_tr.frame sf = gl.Sframe.read_csv(‘netflix.csv’) sf2 = gl.SFrame(‘netflix_norm.frame’) user movie rating netflix_norm.frame sf2 user item ratingsf[‘nrating’] = sf2[‘rating’] sf user item rating nrating

5. 5 Python API user movie rating netflix_tr.frame sf = gl.Sframe.read_csv(‘netflix.csv’) sf user item rating sf2 = gl.SFrame(‘netflix_norm.frame’) user movie rating netflix_norm.frame sf2 user item ratingsf[‘nrating’] = sf2[‘rating’] nrating diff = sf[‘rating’] - sf2[‘rating’] diff anonymous

6. 6 Python API user movie rating netflix_tr.frame sf = gl.Sframe.read_csv(‘netflix.csv’) sf user item rating sf2 = gl.SFrame(‘netflix_norm.frame’) user movie rating netflix_norm.frame sf2 user item ratingsf[‘nrating’] = sf2[‘rating’] nrating diff = sf[‘rating’] - sf2[‘rating’] diff anonymous sf[‘diff’] = diff diff Not a SQL Frontend Filtering sf[sf[‘rating’] >= 3] Joins Sf.join(user_table, on=‘user_id’) Random/Array indexing row10 = sf[10] Table_with_every_other_row = sf[::2] Rather Fast Parallelized UDFs (Interproc SHM) sf[‘rating’].apply(lambda x: x*x)

7. 7 Column Types Supported • Boring Scalar Types - int64, double, string • Interesting Scalar Types - Datetime.datetime, image • For the Mathematician Type - array(‘d’) • For the all real data is ugly types - List, dict (Arbitrary union types. Ex: List can contain anything including other lists and dicts.)

8. 8 What Are SFrames Physical Storage Layer Compressed Column Store (with some interesting properties) Lazy Query Optimization / Execution C++ Coroutine Exec Pipeline Python API Heavily Pandas Inspired (+ immutable data considerations) File System Abstraction Local HDFS S3 Cache Type aware compression methods. Very aggressive numeric compression. Netflix Dataset, 99M rows, 3 columns, ints 1.4GB raw 289MB gzip compressed 160MB

9. 9 Query Planning Physical Storage Layer Compressed Column Store (with some interesting properties) Lazy Query Optimization / Execution C++ Coroutine Exec Pipeline Python API Heavily Pandas Inspired (+ immutable data considerations) File System Abstraction Local HDFS S3 Cache p['X4'] = p['X3'] + p['X2'] g= p[p['X1'] < 10]

10. 10 Language Binding • Python Bindings - Our oldest binding. Via Cython + Interprocess Comm to a C++ binary. • R Bindings - Via our RCpp  C++11 Bindings (exported in SDK) • C++11 Bindings auto g = gl_sframe(); g["hello"] = gl_sarray::from_sequence(0,1000); g["world"] = 2; g["hello"] = (g["hello"] / 2) .astype(flex_type_enum::INTEGER); auto ret = g.groupby({"hello"}, {{"sum of world",aggregate::SUM("world")}}); ret = ret.sort({"hello"}); cout << ret; Columns: hello integer sum of world integer Rows: 500 Data: +----------------+----------------+ | hello | sum of world | +----------------+----------------+ | 0 | 4 | | 1 | 4 | | 2 | 4 | | 3 | 4 | | 4 | 4 | | 5 | 4 | | 6 | 4 | | 7 | 4 | | 8 | 4 | | 9 | 4 | +----------------+----------------+ [500 rows x 2 columns]

11. 11 Common Crawl Graph 1x r3.8xlarge  using 1x SSD. 3.5 billion Nodes and 128 billion Edges PageRank: 9 min per iteration. Connected Components: ~ 1 hr. There isn’t any general purpose library out there capable of this.

12. 12 https://github.com/dato-code/sframe pip install sframe

Notes de l'éditeur

Scalable semi-out-of-core computation.
Fix and update this slide Align with stages? Can we discuss pricing here?
Somewhat more expressive than SQL-backed dataframe solutions. It shares a lot more properties with Pandas than with SQL. You can append, modify columns, etc. The only thing you cannot do, is modify individual values. - Filtering, joins are standard. - It is an actual table. Arbitrary indexing is fine. Sometimes it might result in a materialization which is costly. But once materialized indexing is not too bad! - parallelized lambdas! C++ process  interprocess shared memory  C++ embedded libpython
What are the
I have struggled to present this. It is really difficult to explain what this is. Only recent that I figured out the reason. It is not 1 thing. It is really 3 or 4 things. - Python API, heavy Pandas inspired. Does a ton of stuff. Also has a rather nice scalable graph datastructure to go with it - A physical storage layer. Heavy compressed column store with type-specific compression routines. Especially aggressive for numeric types. It comes with a file system abstraction (for C++ people fstream, general_fstream) that can read from many places. A special “cache” filesystem which basically is an “in memory file” that dumps to disk when memory gets full. This is how we get compressed in memory performance - And I am not even talking about our Graph Datastructure either. But talk to me if you want to hear more.
- Potentially the youngest part of the code base, with the most bang for the buck now if you come in and make improvements, is the query engine. Lazy evaluation, and so we can do query optimization, query planning, query execution.
Python Sframe API. Our oldest language binding. Why? We can talk about this another time. Some due to old design decisions. This does mean that copies from Python are slow. That said, the architecture makes it very easyto eliminate interprocess comm entirely, but there is one very interesting oddity which we have to resolve first. R Sframe API (which we are trying to stabilize right now, and will be released open source as well. Unfortunately under GPL as is traditional in R. But it really just wraps the C++11 Sframe API)
There are some other parts here which I am not talking about. For instance our Graph Datastructure which is optimized for bulk compute (not But talk to me if you want to hear more. If you were to try to represent this in memory, it is a minimum of a TB of memory or so, excluding overheads.Canonical
Q: Performance? Pretty good. Single machine performance about comparable to 5 node spark, or Hive clusters. Still much room to go: recent versions have had a regression as we switched out the query execution engine for something more “correct”.