SlideShare une entreprise Scribd logo
1  sur  12
SFrames
Yucheng Low
Chief Architect @ Dato
Scalable Machine Learning
recommenders, other task-oriented ML,
boosted decision trees, deep learning,
pattern mining, many others, etc
GraphLab Create
SGraphSFrameLocal
HDFS
S3
2
Compressed In-Core or
Out-of-core scalable datastructures
C++11
SGraphSFrameLocal
HDFS
S3
3
Compressed In-Core or
Out-of-core scalable datastructures
https://github.com/dato-code/sframe
4
Python API
user movie rating
netflix_tr.frame
sf = gl.Sframe.read_csv(‘netflix.csv’)
sf2 = gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
ratingsf[‘nrating’] = sf2[‘rating’]
sf
user
item
rating
nrating
5
Python API
user movie rating
netflix_tr.frame
sf = gl.Sframe.read_csv(‘netflix.csv’)
sf
user
item
rating
sf2 = gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
ratingsf[‘nrating’] = sf2[‘rating’]
nrating
diff = sf[‘rating’] - sf2[‘rating’]
diff
anonymous
6
Python API
user movie rating
netflix_tr.frame
sf = gl.Sframe.read_csv(‘netflix.csv’)
sf
user
item
rating
sf2 = gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
ratingsf[‘nrating’] = sf2[‘rating’]
nrating
diff = sf[‘rating’] - sf2[‘rating’]
diff
anonymous
sf[‘diff’] = diff
diff
Not a SQL Frontend
Filtering
sf[sf[‘rating’] >= 3]
Joins
Sf.join(user_table, on=‘user_id’)
Random/Array indexing
row10 = sf[10]
Table_with_every_other_row = sf[::2]
Rather Fast Parallelized UDFs (Interproc SHM)
sf[‘rating’].apply(lambda x: x*x)
7
Column Types Supported
• Boring Scalar Types
- int64, double, string
• Interesting Scalar Types
- Datetime.datetime, image
• For the Mathematician Type
- array(‘d’)
• For the all real data is ugly types
- List, dict
(Arbitrary union types. Ex: List can contain anything
including other lists and dicts.)
8
What Are SFrames
Physical Storage Layer
Compressed Column Store
(with some interesting properties)
Lazy Query Optimization /
Execution
C++ Coroutine Exec Pipeline
Python API
Heavily Pandas Inspired
(+ immutable data considerations)
File System Abstraction Local HDFS S3
Cache
Type aware compression
methods. Very aggressive
numeric compression.
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
160MB
9
Query Planning
Physical Storage Layer
Compressed Column Store
(with some interesting properties)
Lazy Query Optimization /
Execution
C++ Coroutine Exec Pipeline
Python API
Heavily Pandas Inspired
(+ immutable data considerations)
File System Abstraction Local HDFS S3
Cache
p['X4'] = p['X3'] + p['X2']
g= p[p['X1'] < 10]
10
Language Binding
• Python Bindings
- Our oldest binding.
Via Cython + Interprocess Comm to a C++ binary.
• R Bindings
- Via our RCpp  C++11 Bindings (exported in
SDK)
• C++11 Bindings
auto g = gl_sframe();
g["hello"] = gl_sarray::from_sequence(0,1000);
g["world"] = 2;
g["hello"] = (g["hello"] / 2)
.astype(flex_type_enum::INTEGER);
auto ret = g.groupby({"hello"},
{{"sum of world",aggregate::SUM("world")}});
ret = ret.sort({"hello"});
cout << ret;
Columns:
hello integer
sum of world integer
Rows: 500
Data:
+----------------+----------------+
| hello | sum of world |
+----------------+----------------+
| 0 | 4 |
| 1 | 4 |
| 2 | 4 |
| 3 | 4 |
| 4 | 4 |
| 5 | 4 |
| 6 | 4 |
| 7 | 4 |
| 8 | 4 |
| 9 | 4 |
+----------------+----------------+
[500 rows x 2 columns]
11
Common Crawl Graph
1x r3.8xlarge  using 1x SSD.
3.5 billion Nodes and 128 billion Edges
PageRank: 9 min per iteration.
Connected Components: ~ 1 hr.
There isn’t any general purpose library out there capable of this.
12
https://github.com/dato-code/sframe
pip install sframe

Contenu connexe

Tendances

Spark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin KeynoteSpark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin KeynoteDatabricks
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
 
H2O Overview with Amy Wang at useR! Aalborg
H2O Overview with Amy Wang at useR! AalborgH2O Overview with Amy Wang at useR! Aalborg
H2O Overview with Amy Wang at useR! AalborgSri Ambati
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Predictive churn h20_dsx
Predictive churn h20_dsxPredictive churn h20_dsx
Predictive churn h20_dsxNdjido Ardo BAR
 
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkDistributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkJan Wiegelmann
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
 
Extending the google_assistant
Extending the google_assistantExtending the google_assistant
Extending the google_assistantNdjido Ardo BAR
 
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...MLconf
 
V like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLV like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLBarbara Fusinska
 
Python and H2O with Cliff Click at PyData Dallas 2015
Python and H2O with Cliff Click at PyData Dallas 2015Python and H2O with Cliff Click at PyData Dallas 2015
Python and H2O with Cliff Click at PyData Dallas 2015Sri Ambati
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...Spark Summit
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
 
Managing data workflows with Luigi
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with LuigiTeemu Kurppa
 
sparklyr - Jeff Allen
sparklyr - Jeff Allensparklyr - Jeff Allen
sparklyr - Jeff AllenSri Ambati
 
h2oensemble with Erin Ledell at useR! Aalborg
h2oensemble with Erin Ledell at useR! Aalborgh2oensemble with Erin Ledell at useR! Aalborg
h2oensemble with Erin Ledell at useR! AalborgSri Ambati
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinDatabricks
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariKarissa Rae McKelvey
 

Tendances (20)

Spark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin KeynoteSpark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin Keynote
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
 
H2O Overview with Amy Wang at useR! Aalborg
H2O Overview with Amy Wang at useR! AalborgH2O Overview with Amy Wang at useR! Aalborg
H2O Overview with Amy Wang at useR! Aalborg
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Predictive churn h20_dsx
Predictive churn h20_dsxPredictive churn h20_dsx
Predictive churn h20_dsx
 
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkDistributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Extending the google_assistant
Extending the google_assistantExtending the google_assistant
Extending the google_assistant
 
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
 
V like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLV like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure ML
 
Python and H2O with Cliff Click at PyData Dallas 2015
Python and H2O with Cliff Click at PyData Dallas 2015Python and H2O with Cliff Click at PyData Dallas 2015
Python and H2O with Cliff Click at PyData Dallas 2015
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
Managing data workflows with Luigi
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with Luigi
 
sparklyr - Jeff Allen
sparklyr - Jeff Allensparklyr - Jeff Allen
sparklyr - Jeff Allen
 
h2oensemble with Erin Ledell at useR! Aalborg
h2oensemble with Erin Ledell at useR! Aalborgh2oensemble with Erin Ledell at useR! Aalborg
h2oensemble with Erin Ledell at useR! Aalborg
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 

En vedette

Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender SystemsTuri, Inc.
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Turi, Inc.
 
2012 DuPage Environmental Summit
2012 DuPage Environmental Summit2012 DuPage Environmental Summit
2012 DuPage Environmental SummitNapervilleNCEC
 
I love free_nsta2010
I love free_nsta2010I love free_nsta2010
I love free_nsta2010Jan Coley
 
Kiss fewer frogs - BNI INSOMNIACS
Kiss fewer frogs - BNI INSOMNIACSKiss fewer frogs - BNI INSOMNIACS
Kiss fewer frogs - BNI INSOMNIACSMuneer Samnani
 
DieHarder (CCS 2010, WOOT 2011)
DieHarder (CCS 2010, WOOT 2011)DieHarder (CCS 2010, WOOT 2011)
DieHarder (CCS 2010, WOOT 2011)Emery Berger
 
Socialising the enterprise
Socialising the enterpriseSocialising the enterprise
Socialising the enterpriseIntranet Future
 
Alba Lucia Sanchez Mejia
Alba Lucia Sanchez Mejia	Alba Lucia Sanchez Mejia
Alba Lucia Sanchez Mejia astrydquintero
 
Presentacion de economica politica
Presentacion de economica politicaPresentacion de economica politica
Presentacion de economica politicaabelardoac
 
La auténtica felicidad alicia hp
La  auténtica felicidad alicia hpLa  auténtica felicidad alicia hp
La auténtica felicidad alicia hpdehp1961
 
Social Media for Events
Social Media for EventsSocial Media for Events
Social Media for EventsJulius Solaris
 
Australian Junior Mining Exploration Company
Australian Junior Mining Exploration CompanyAustralian Junior Mining Exploration Company
Australian Junior Mining Exploration Companyjoel_fishlock
 
BNI Achievers Chapter - 10mins The Story About Me
BNI Achievers Chapter - 10mins The Story About MeBNI Achievers Chapter - 10mins The Story About Me
BNI Achievers Chapter - 10mins The Story About MeLeik Hong, Leow 廖翊翃
 
LINK UP - How your business can benefit from LinkedIn
LINK UP - How your business can benefit from LinkedInLINK UP - How your business can benefit from LinkedIn
LINK UP - How your business can benefit from LinkedInIntranet Future
 
Presentacion de economica politica
Presentacion de economica politicaPresentacion de economica politica
Presentacion de economica politicaabelardoac
 

En vedette (20)

Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender Systems
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
 
2012 DuPage Environmental Summit
2012 DuPage Environmental Summit2012 DuPage Environmental Summit
2012 DuPage Environmental Summit
 
Ehealthhoy
EhealthhoyEhealthhoy
Ehealthhoy
 
I love free_nsta2010
I love free_nsta2010I love free_nsta2010
I love free_nsta2010
 
Green it
Green itGreen it
Green it
 
Jean marie delbecq
Jean marie delbecqJean marie delbecq
Jean marie delbecq
 
Kiss fewer frogs - BNI INSOMNIACS
Kiss fewer frogs - BNI INSOMNIACSKiss fewer frogs - BNI INSOMNIACS
Kiss fewer frogs - BNI INSOMNIACS
 
DieHarder (CCS 2010, WOOT 2011)
DieHarder (CCS 2010, WOOT 2011)DieHarder (CCS 2010, WOOT 2011)
DieHarder (CCS 2010, WOOT 2011)
 
Socialising the enterprise
Socialising the enterpriseSocialising the enterprise
Socialising the enterprise
 
Alba Lucia Sanchez Mejia
Alba Lucia Sanchez Mejia	Alba Lucia Sanchez Mejia
Alba Lucia Sanchez Mejia
 
Presentacion de economica politica
Presentacion de economica politicaPresentacion de economica politica
Presentacion de economica politica
 
La auténtica felicidad alicia hp
La  auténtica felicidad alicia hpLa  auténtica felicidad alicia hp
La auténtica felicidad alicia hp
 
Social Media for Events
Social Media for EventsSocial Media for Events
Social Media for Events
 
Missao Piaui Diario da Serra 2016
Missao Piaui Diario da Serra 2016Missao Piaui Diario da Serra 2016
Missao Piaui Diario da Serra 2016
 
Australian Junior Mining Exploration Company
Australian Junior Mining Exploration CompanyAustralian Junior Mining Exploration Company
Australian Junior Mining Exploration Company
 
BNI Achievers Chapter - 10mins The Story About Me
BNI Achievers Chapter - 10mins The Story About MeBNI Achievers Chapter - 10mins The Story About Me
BNI Achievers Chapter - 10mins The Story About Me
 
LINK UP - How your business can benefit from LinkedIn
LINK UP - How your business can benefit from LinkedInLINK UP - How your business can benefit from LinkedIn
LINK UP - How your business can benefit from LinkedIn
 
Presentacion de economica politica
Presentacion de economica politicaPresentacion de economica politica
Presentacion de economica politica
 
Herbert Allen
Herbert AllenHerbert Allen
Herbert Allen
 

Similaire à SFrame

Relational Database Access with Python ‘sans’ ORM
Relational Database Access with Python ‘sans’ ORM  Relational Database Access with Python ‘sans’ ORM
Relational Database Access with Python ‘sans’ ORM Mark Rees
 
FastR+Apache Flink
FastR+Apache FlinkFastR+Apache Flink
FastR+Apache FlinkJuan Fumero
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
FP - Découverte de Play Framework Scala
FP - Découverte de Play Framework ScalaFP - Découverte de Play Framework Scala
FP - Découverte de Play Framework ScalaKévin Margueritte
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
How to Get Your Website Into the Cloud
How to Get Your Website Into the CloudHow to Get Your Website Into the Cloud
How to Get Your Website Into the CloudAll Things Open
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIYoni Davidson
 
ClojureScript for the web
ClojureScript for the webClojureScript for the web
ClojureScript for the webMichiel Borkent
 
Reproducible Computational Research in R
Reproducible Computational Research in RReproducible Computational Research in R
Reproducible Computational Research in RSamuel Bosch
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Relational Database Access with Python
Relational Database Access with PythonRelational Database Access with Python
Relational Database Access with PythonMark Rees
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Spark Summit
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015dhiguero
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
 

Similaire à SFrame (20)

FleetDB
FleetDBFleetDB
FleetDB
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Relational Database Access with Python ‘sans’ ORM
Relational Database Access with Python ‘sans’ ORM  Relational Database Access with Python ‘sans’ ORM
Relational Database Access with Python ‘sans’ ORM
 
FastR+Apache Flink
FastR+Apache FlinkFastR+Apache Flink
FastR+Apache Flink
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
FP - Découverte de Play Framework Scala
FP - Découverte de Play Framework ScalaFP - Découverte de Play Framework Scala
FP - Découverte de Play Framework Scala
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
How to Get Your Website Into the Cloud
How to Get Your Website Into the CloudHow to Get Your Website Into the Cloud
How to Get Your Website Into the Cloud
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
 
ClojureScript for the web
ClojureScript for the webClojureScript for the web
ClojureScript for the web
 
Reproducible Computational Research in R
Reproducible Computational Research in RReproducible Computational Research in R
Reproducible Computational Research in R
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Relational Database Access with Python
Relational Database Access with PythonRelational Database Access with Python
Relational Database Access with Python
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
De Java 8 a Java 17
De Java 8 a Java 17De Java 8 a Java 17
De Java 8 a Java 17
 
R sharing 101
R sharing 101R sharing 101
R sharing 101
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 

Plus de Turi, Inc.

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing VideoTuri, Inc.
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission RiskTuri, Inc.
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Turi, Inc.
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Turi, Inc.
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Turi, Inc.
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Turi, Inc.
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataTuri, Inc.
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsTuri, Inc.
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine LearningTuri, Inc.
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab CreateTuri, Inc.
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesTuri, Inc.
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinTuri, Inc.
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in productionTuri, Inc.
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringTuri, Inc.
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with DatoTuri, Inc.
 
Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Turi, Inc.
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTuri, Inc.
 
Anomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation ForestsAnomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation ForestsTuri, Inc.
 

Plus de Turi, Inc. (20)

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing Video
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission Risk
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning Toolkits
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine Learning
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive Services
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos Guestrin
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with Dato
 
Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Getting Started With Dato - August 2015
Getting Started With Dato - August 2015
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
 
Dato Keynote
Dato KeynoteDato Keynote
Dato Keynote
 
Anomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation ForestsAnomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation Forests
 

Dernier

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 

Dernier (20)

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 

SFrame

  • 2. Scalable Machine Learning recommenders, other task-oriented ML, boosted decision trees, deep learning, pattern mining, many others, etc GraphLab Create SGraphSFrameLocal HDFS S3 2 Compressed In-Core or Out-of-core scalable datastructures C++11
  • 3. SGraphSFrameLocal HDFS S3 3 Compressed In-Core or Out-of-core scalable datastructures https://github.com/dato-code/sframe
  • 4. 4 Python API user movie rating netflix_tr.frame sf = gl.Sframe.read_csv(‘netflix.csv’) sf2 = gl.SFrame(‘netflix_norm.frame’) user movie rating netflix_norm.frame sf2 user item ratingsf[‘nrating’] = sf2[‘rating’] sf user item rating nrating
  • 5. 5 Python API user movie rating netflix_tr.frame sf = gl.Sframe.read_csv(‘netflix.csv’) sf user item rating sf2 = gl.SFrame(‘netflix_norm.frame’) user movie rating netflix_norm.frame sf2 user item ratingsf[‘nrating’] = sf2[‘rating’] nrating diff = sf[‘rating’] - sf2[‘rating’] diff anonymous
  • 6. 6 Python API user movie rating netflix_tr.frame sf = gl.Sframe.read_csv(‘netflix.csv’) sf user item rating sf2 = gl.SFrame(‘netflix_norm.frame’) user movie rating netflix_norm.frame sf2 user item ratingsf[‘nrating’] = sf2[‘rating’] nrating diff = sf[‘rating’] - sf2[‘rating’] diff anonymous sf[‘diff’] = diff diff Not a SQL Frontend Filtering sf[sf[‘rating’] >= 3] Joins Sf.join(user_table, on=‘user_id’) Random/Array indexing row10 = sf[10] Table_with_every_other_row = sf[::2] Rather Fast Parallelized UDFs (Interproc SHM) sf[‘rating’].apply(lambda x: x*x)
  • 7. 7 Column Types Supported • Boring Scalar Types - int64, double, string • Interesting Scalar Types - Datetime.datetime, image • For the Mathematician Type - array(‘d’) • For the all real data is ugly types - List, dict (Arbitrary union types. Ex: List can contain anything including other lists and dicts.)
  • 8. 8 What Are SFrames Physical Storage Layer Compressed Column Store (with some interesting properties) Lazy Query Optimization / Execution C++ Coroutine Exec Pipeline Python API Heavily Pandas Inspired (+ immutable data considerations) File System Abstraction Local HDFS S3 Cache Type aware compression methods. Very aggressive numeric compression. Netflix Dataset, 99M rows, 3 columns, ints 1.4GB raw 289MB gzip compressed 160MB
  • 9. 9 Query Planning Physical Storage Layer Compressed Column Store (with some interesting properties) Lazy Query Optimization / Execution C++ Coroutine Exec Pipeline Python API Heavily Pandas Inspired (+ immutable data considerations) File System Abstraction Local HDFS S3 Cache p['X4'] = p['X3'] + p['X2'] g= p[p['X1'] < 10]
  • 10. 10 Language Binding • Python Bindings - Our oldest binding. Via Cython + Interprocess Comm to a C++ binary. • R Bindings - Via our RCpp  C++11 Bindings (exported in SDK) • C++11 Bindings auto g = gl_sframe(); g["hello"] = gl_sarray::from_sequence(0,1000); g["world"] = 2; g["hello"] = (g["hello"] / 2) .astype(flex_type_enum::INTEGER); auto ret = g.groupby({"hello"}, {{"sum of world",aggregate::SUM("world")}}); ret = ret.sort({"hello"}); cout << ret; Columns: hello integer sum of world integer Rows: 500 Data: +----------------+----------------+ | hello | sum of world | +----------------+----------------+ | 0 | 4 | | 1 | 4 | | 2 | 4 | | 3 | 4 | | 4 | 4 | | 5 | 4 | | 6 | 4 | | 7 | 4 | | 8 | 4 | | 9 | 4 | +----------------+----------------+ [500 rows x 2 columns]
  • 11. 11 Common Crawl Graph 1x r3.8xlarge  using 1x SSD. 3.5 billion Nodes and 128 billion Edges PageRank: 9 min per iteration. Connected Components: ~ 1 hr. There isn’t any general purpose library out there capable of this.

Notes de l'éditeur

  1. Scalable semi-out-of-core computation.
  2. Fix and update this slide Align with stages? Can we discuss pricing here?
  3. Somewhat more expressive than SQL-backed dataframe solutions. It shares a lot more properties with Pandas than with SQL. You can append, modify columns, etc. The only thing you cannot do, is modify individual values. - Filtering, joins are standard. - It is an actual table. Arbitrary indexing is fine. Sometimes it might result in a materialization which is costly. But once materialized indexing is not too bad! - parallelized lambdas! C++ process  interprocess shared memory  C++ embedded libpython
  4. What are the
  5. I have struggled to present this. It is really difficult to explain what this is. Only recent that I figured out the reason. It is not 1 thing. It is really 3 or 4 things. - Python API, heavy Pandas inspired. Does a ton of stuff. Also has a rather nice scalable graph datastructure to go with it - A physical storage layer. Heavy compressed column store with type-specific compression routines. Especially aggressive for numeric types. It comes with a file system abstraction (for C++ people fstream, general_fstream) that can read from many places. A special “cache” filesystem which basically is an “in memory file” that dumps to disk when memory gets full. This is how we get compressed in memory performance - And I am not even talking about our Graph Datastructure either. But talk to me if you want to hear more.
  6. - Potentially the youngest part of the code base, with the most bang for the buck now if you come in and make improvements, is the query engine. Lazy evaluation, and so we can do query optimization, query planning, query execution.
  7. Python Sframe API. Our oldest language binding. Why? We can talk about this another time. Some due to old design decisions. This does mean that copies from Python are slow. That said, the architecture makes it very easyto eliminate interprocess comm entirely, but there is one very interesting oddity which we have to resolve first. R Sframe API (which we are trying to stabilize right now, and will be released open source as well. Unfortunately under GPL as is traditional in R. But it really just wraps the C++11 Sframe API)
  8. There are some other parts here which I am not talking about. For instance our Graph Datastructure which is optimized for bulk compute (not But talk to me if you want to hear more. If you were to try to represent this in memory, it is a minimum of a TB of memory or so, excluding overheads.Canonical
  9. Q: Performance? Pretty good. Single machine performance about comparable to 5 node spark, or Hive clusters. Still much room to go: recent versions have had a regression as we switched out the query execution engine for something more “correct”.