This video was recorded in San Francisco on February 5th, 2019 and can be viewed here: https://youtu.be/EIzih12_xmk
Bio: Pasha is a Hacker Scientist at H2O.ai. He holds an MS in Applied Physics and Mathematics from Moscow Institute of Physics and Technology, an MA in Economics from New Economic School (Moscow), and a Ph.D. in Economics (econometrics) from Stanford University. During his education, he obtained knowledge in Computer Science, Machine Learning, Statistics, and Econometrics. Prior to coming to H2O.ai, Pasha was working at a stealth-level machine learning startup Machinify.com as a data scientist/frontend engineer; before that as an engineer at Facebook; and before as a senior quantitative analyst at a business consulting company Keystone Strategy, working on big data analysis.
Bio: Oleksiy is a maker scientist and hacker at H2O.ai, focusing on highly optimized algorithms for machine learning and data analysis. He holds M.S., summa cum laude, and Ph.D. degrees in applied mathematics from the National University of Kharkiv, Ukraine. In 2009, Oleksiy was selected as a research fellow by CERN and contributed to R&D for Large Hadron Collider and the next generation of high energy particle accelerators. In 2013 he joined SLAC and Stanford University to develop high-performance simulation suite for 3D multi-physics modeling. Oleksiy authored more than 60 scientific papers, was an invited speaker at major international conferences, prominent institutions, and companies worldwide. In his free time, he enjoys snowboarding, playing soccer and basketball, guitar and drums, What? Where? When? and Jeopardy!
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Machine Learning and Data Munging in H2O Driverless AI with datatable
1. Machine Learning
and Data Munging
in H2O Driverless AI
with datatable
Pasha Stetsenko & Oleksiy Kononenko
H2O.ai
@pydatatable
#H2OWORLD
2. Machine Learning with datatable
https://www.kaggle.com/c/microsoft-malware-prediction/discussion/75478
3. datatable.models
Follow the Regularized Leader (FTRL) algo for binomial classification
proposed by H. B. McMahan et al., “Ad click prediction: a view from the
trenches” https://research.google.com/pubs/archive/41159.pdf
• Python frontend, C++ backend
• Parallelized with OpenMP and Hogwild
• Supports boolean, integer, real and string features
• Hashing trick based on Murmur hash function
• Second-order feature interactions
• One-vs.-rest multinomial classification and regression for continuous
targets (experimental)
4. Python code example
For detailed help please refer to https://datatable.readthedocs.io/en/latest/ftrl.html
5. Five reasons to use datatable FTRL
1. Reliable: integrated into H2O Driverless AI as of v1.5
2. Fast: million rows in seconds
3. It’s all datatable: read data, munge data, fit/predict, save results
4. Already on Kaggle, thanks to Bojan Tunguz, Kaggle GM:
https://www.kaggle.com/tunguz/eda-with-python-datatable
5. Open source: MIT license
6. What is datatable anyways?
• R data.table is one of the top 10
most popular R packages
• Python datatable was started in an
attempt to mimic the internal design
and API of R data.table
• First customer: Driverless AI
Capabilities: • Efficient multi-threaded algorithms
• Memory-thrifty
• Memory-mapped on-disk datasets
• Native C++ implementation
• Open source
7. Load and view data Shows progress bar
while parsing
Type and size of each
column
Integer columns with
NAs are parsed as
integer
> 5x times faster than
pandas.read_csv()
8. • A large portion of data is ingested into DAI through fread
• Automatically detects parse parameters
• Multi-threaded parsing
• Recovers from encoding errors
• Reads CSV and Excel files
• Reads files inside archives
fread: a doorway to Driverless AI
9. Save to binary file
• 300 ms to write a 750MB file, or 2.5GB/s (on MacOS laptop)
• Writes are cached at OS level and deferred to a later stage
• Write immediately followed by a read from another process is equivalent to
direct memory sharing
• Opening a .jay file is nearly instant
11. Examples
Find the average flight duration for each flight
Remove from DT all records where average flight duration is either
negative or NA
For each carrier, select 3 longest flights
12. Notes:
R data.table logo is available in Source Code Form at
https://github.com/Rdatatable/data.table/blob/master/rdatatable.svg
The End
Thanks for watching