This presentation was made on June 9th, 2020.
Video recording of the session can be viewed here: https://youtu.be/OCB9sTUnUug
In this meetup with Sanyam Bhutani, Machine Learning Engineer at H2O.ai, he gives a recap of the eight annual ICLR (International Conference on Learning Representations) 2020 - a niche deep learning conference whose focus is to study how to learn representations of data, which is basically what deep learning does.
Sanyam goes through a few of his favorite selected papers from this year’s ICLR, note this session may not be able to capture the richness of all papers or allow a detailed discussion.
You will be able to find Sanyam in our community slack (https://www.h2o.ai/slack-community/), please feel free to start a discussion with him, if you send a emoji greeting, you’ll find the answers.
Following are the papers we will look into:
U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation
AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty
Your classifier is secretly an energy based model and you should treat it like one
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Reformer: The Efficient Transformer
Generative Models for Effective ML on Private, Decentralized Datasets
Once for All: Train One Network and Specialize it for Efficient Deployment
Thieves on Sesame Street! Model Extraction of BERT-based APIs
Plug and Play Language Models: A Simple Approach to Controlled Text Generation
BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning
Real or Not Real, that is the Question
1. 1
ICLR 2020 Recap
Selected Paper summaries and discussions
Sanyam Bhutani
ML Engineer & AI Content Creator
bhutanisanyam1
🎙: ctdsshow
2. Democratizing AI
Our mission to use AI for Good permeates into everything we do
AI Transformation
Bringing AI to industry by helping
companies transform their
businesses with H2O.ai.
Trusted Partner
AI4GOOD
Bringing AI to impact by augmenting
non-profits and social ventures with
technological resources and
capabilities.
Impact/Social
Open Source
An industry leader in providing
open source, cutting edge AI & ML
platforms (H2O-3).
Community
3. Confidential3
Founded in Silicon Valley 2012
Funding: $147M | Series D
Investors: Goldman Sachs, Ping An,
Wells Fargo, NVIDIA, Nexus Ventures
We are Established
We Make World-class AI Platforms
We are Global
H2O Open Source Machine Learning
H2O Driverless AI: Automatic Machine Learning
H2O Q: AI platform for business users
Mountain View, NYC, London, Paris, Ottawa,
Prague, Chennai, Singapore
220+ 1K
20K 180K
Universities
Companies Using
H2O Open Source
Meetup Members
Experts
H2O.ai Snapshot
We are Passionate about Customers
4X customers, 2 years, all industries, all continents
Aetna/CVS, Allergan, AT&T, CapitalOne, CBA, Citi,
Coca Cola, Bredesco, Dish, Disney, Franklin
Templeton, Genentech, Kaiser Permanente, Lego,
Merck, Pepsi, Reckitt Benckiser, Roche
4. Confidential4
Our Team is Made up of the World’s Leading Data Scientists
Your projects are backed by 10% of the World’s Data
Science Grandmasters who are relentless in solving
your critical problems.
14. 14
• Using attention to guide different
geometric transforms
• Introduction of a new normalising
function
• Image 2 Image translation (And
Backwards!)
To Summarise
16. 16
• Why do you need image
augmentations?
• Test and Train split should be similar
• Comparison of recent techniques
• Why is AugMix promising?
Image Augmentations
18. 18
• Mixes augmented images and enforces consistent embeddings of the augmented images, which results in increased robustness and improved uncertainty calibration.
• AutoAugment
• AugMix does not require tuning to work correctly: enables plug-and-play data augmentation
To Summarise
23. 23
• Progress in NLP as a
measure of GLUE score
• What is GLUE Score?
• Normalised by Pre-
Training FLOPs
Pre-Training Progress
24. 24
• BERT family uses MLM
• Suggested: A bi-
directional model that
learns from all of the
tokens rather than some
% masks
Masked LM & ELECTRA
25. 25
• BERT family uses MLM
• Suggested: A bi-
directional model that
learns from all of the
tokens rather than some
% masks
Masked LM & ELECTRA
26. 26
• BERT family uses MLM
• Suggested: A bi-
directional model that
learns from all of the
tokens rather than some
% masks
Masked LM & ELECTRA
ELECTRA Pre-Training outperforms MLM Pre-Training
27. 27
• Replacing token detection: a new self-supervised task for language
representation learning.
• Training a text encoder to distinguish input tokens from high-quality
negative samples produced by an small generator network
• It works well even when using relatively small amounts of compute
• 45x/8x speedup over Train/Inference when compared to BERT-
Base
To Summarise
29. 29
• At some point further model increases
become harder due to GPU/TPU
memory limitations
• Is having better NLP models as easy as
having larger models?
• How can we reduce Parameters?
Introduction
30. 30
• Token Embeddings are sparsely populated -> Reduce size by projections
• Re-Use Parameters of repeated operations
Proposed Changes
31. 31
•Sentence Order Prediction for
capturing inter-sentence coherence
•Remove Dropout!
•Adding more data increases
performance
Three More Tricks!
33. 33
• Efficient Deployment of DL models
across devices
• Conventional approach: Train
specialised Models: Think SqueezeNet,
MobileNet,etc
• Training Costs $$$, Engineering costs
$$$
Introduction
34. 34
• Train Once, Specialise for deployment
• Key Idea: Decouple model training from
architectural search
• Algorithm proposed: Progressive
Shrinking
Proposed Approach
35. 35
• Replacing token detection: a new self-supervised task for language
representation learning.
• Training a text encoder to distinguish input tokens from high-quality
negative samples produced by an small generator network
• It works well even when using relatively small amounts of compute
• 45x/8x speedup over Train/Inference when compared to BERT-
Base
To Summarise
36. 36
• Replacing token detection: a new self-supervised task for language
representation learning.
• Training a text encoder to distinguish input tokens from high-quality
negative samples produced by an small generator network
• It works well even when using relatively small amounts of compute
• 45x/8x speedup over Train/Inference when compared to BERT-
Base
To Summarise
38. 38
• Random sentences to understand the model
• After performing a large number of attacks, you have labels and dataset
• Note: These are economically practical (Cheaper than trying to train a model)
• Note 2: This is not model distillation, it’s IP Theft
Attacks
40. 40
• Membership classification: Flagging
queries
• API Watermarking: Some % of queries
are return a wrong output,
“watermarked queries” and their
outputs are stored on the API side.
• Note: Both of these would fail against
smart attacks
Suggested Solutions
43. 43
• LMs can generate coherent, relatable
text, either from scratch or by
completing a passage started by the
user.
• BUT, they are hard to steer or control.
• Can also be triggered by certain
adversarial attacks
Introduction
44. 44
• Controlled generation: Adding knobs with
conditional probability
• Consists of 3 Steps:
Controlling the Mammoth
46. 46
• Controlled generation: Adding knobs with
conditional probability
• Consists of 3 Steps
• Also allows reduction in toxicity
63% to ~5%!
Controlling the Mammoth
48. 48
• Modelling is important: Looking at data is
a large part of the pipeline
• Manual data inspection is problematic for
privacy-sensitive dataset
• Problem: Your model resides on your
server, data on end devices
Introduction
49. 49
• Modelling is important: Looking at data is
a large part of the pipeline
• Manual data inspection is problematic for
privacy-sensitive dataset
• Problem: Your model resides on your
server, data on end devices
Suggested Solutions
50. 50
• DP: Federated GANs:
- Train on user device
- Inspect generated data
• Repository showcases:
- Language Modelling with DP RNN
- Image Modelling with DP GANs
Suggested Solutions
Who is H2O.ai? get high-res pic for Sri
H2O.ai was founded about 5 years ago, and closed a Series D round in August 2019 with Goldman Sachs leading the round, and Ping An insurance and finance out of China contributing as well. Customer investment side led the round including Wells Fargo and strategic partner NVidia.
H2O.ai is the open source creator and inventor of H2O open source. Nearly 20,000 organizations, businesses, governments, universities use H2O..
H2O.ai also brought H2O Driverless AI to market in late 2017. It is the premier product for automatic machine learning., and this presentation covers what it is, its value and who is using it.
The team is over 200 people, with some of the world best AI experts including Kaggle Grandmasters. Kaggle is a online tournament for Data Scientists, who compete for fame and money by delivering the best data science results. Companies offer a challenge and some prize money, and data scientists spend time fine-tuning their models, to get results. When they win a number of competitions, they can claim a Grandmaster title/status, similar to a Chess Grandmaster. H2O.ai has 13 out of the top 140 of the 100 Grandmasters on the planet today.. H2O.ai talent extends to distributed computing experts, visualizations experts (Leland Wilkinson),
Finally, H2O.ai is global. Headquartered in Mountain View, CA. We have offices in Prague (AI Center of Excellence), London, NYC, and India.