PR-043: HyperNetworks

•

0 j'aime•1,034 vues

Paper review: "HyperNetworks" by David Ha, Andrew Dai, Quoc V. Le (ICLR2017) Presented at Tensorflow-KR paper review forum (#PR12) by Taesu Kim Paper link: https://arxiv.org/abs/1609.09106 Video link: https://www.youtube.com/watch?v=-tUQXSdEsMk (in Korean) http://www.neosapience.com

Technologie

HyperNetworks
Presented by Taesu Kim
Oct 29, 2017
Daivd Ha, Andrew Dai, Quoc V. Le
Google Brain
Published at ICLR 2017

HyperNetworks overview
› An approach of using one network to generate the weight for another network
› Motivated by HyperNEAT (Stanley et al 2009) and tried to resemble genotype
and phenotype in nature
› HyperNetwork can be viewed relaxed form of weight sharing across layers.
› It generates non-shared weights for LSTM and achieved near state-of-the-art
result
› It generates shared weights for CNN and achieve respectable results with fewer
learnable parameters

Conventional Networks
Feedforward
Networks
Recurrent
Networks

Modified HyperRNN
› HyperRNN requires Nz times larger memory requirements than basic RNN
› Make it more scalable and memory efficient
› Use intermediate hidden vector to parameterize a weight matrix: d(z) is
linear projection of z

HyperLSTM
https://github.com/hardmaru/supercell/
LSTM implementation

MNIST and CIFAR-10
40-1: N=6 k=1
40-2: N=6 k=2

Character-level Penn Treebank Language Model
› 1000 units of MainLSTM & Two version of HyperLSTM
– 128 units of HyperLSTM cell & 4 embedding size
– 128 units of HyperLSTM cell & 16 embedding size à dropout keep probability of 85%
› HyperLSTM outperforms than standard LSTM
› HyperLSTM also achieves similar improvements compared to Layer Normalization à combination of
Layer Normalization and Hyper LSTM achieves the best test perp.

Hutter Prize Wikipedia Language Model
› 1800 units of MainLSTM & 256 units of HyperLSTM cell with 64 embedding size & max sequence length : 250
› 2048 units of MainLSTM & 256 units of HyperLSTM cell with 64 embedding size & max sequence length : 300
› HyperLSTM also achieves similar improvements compared to Layer Normalization à combination of Layer Normalization and
Hyper LSTM achieves the best test perp.
› HyperLSTM converges more quickly compared to LSTM and Layer Norm LSTM

Hutter Prize Wikipedia Language Model
› Visualizing how the weight scaling vectors of the main LSTM change during the character sampling process.
› Regions of low intensity, where the weights of the main LSTM are relatively static, the types of phrases
generated seem more deterministic
– For example, the weights do not change much during the words Europeans, possessions and reservation.
› The regions of high intensity is when the Hyper LSTM cell is making relatively large changes to the weights
of the main LSTM

Hutter Prize Wikipedia Language Model
› Normalized Histogram plots of 𝜙(𝑐$) for different models during sampling
– 𝜙(𝑐$) is the hidden state of the LSTM before applying the output gate.
–
› Layer Norm reduces the saturation effects compared to the vanilla LSTM…..
› In HyperLSTM, most of the time the cell is saturated
– HyperLSTM cell’s dynamic weight adjustment policy appears to be doing something very different compared to statistical
normalization.
– Although this policy came up with ended up providing similar performance as LayerNorm

Handwriting sequence generation
› 12179 handwritten lines from 221 writers
› LSTM input is (x, y) coordinate of the pen location and binary indicator of pen-up/pen-down
› It can see that many of these weight changes occur at the boundaries between words, and between characters
› Dynamically generate the generative model is one of the key advantages of HyperLSTM over a normal LSTM

Machine translation
› WMT’14 En→Fr using the same test/validation set split described in the GNMT paper.
– GMNT network has 8 layers each of encoder/decoder
› HyperLSTM cell improves the performance of the existing GNMT model, achieving state-
of-the-art single model results for this dataset.
› It is demonstrated the applicability of Hyper Networks to large-scale models used in
production systems.

Follow us:
Contact us:
contact@neosapience.com
For more information:
http://www.neosapience.com

Recommandé

Random Features Strengthen Graph Neural Networksjoisino

Integer quantization for deep learning inference: principles and empirical ev...jemin lee

Autoencoders in Deep Learningmilad abbasi

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...Po-Chuan Chen

[DL輪読会]機械学習におけるカオス現象についてDeep Learning JP

Interpretability beyond feature attribution quantitative testing with concept...MLconf

What’s next for deep learning for Search?Bhaskar Mitra

HiPPO/S4解説Morpho, Inc.

Recommandé

Random Features Strengthen Graph Neural Networksjoisino

Integer quantization for deep learning inference: principles and empirical ev...jemin lee

Autoencoders in Deep Learningmilad abbasi

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...Po-Chuan Chen

[DL輪読会]機械学習におけるカオス現象についてDeep Learning JP

Interpretability beyond feature attribution quantitative testing with concept...MLconf

What’s next for deep learning for Search?Bhaskar Mitra

HiPPO/S4解説Morpho, Inc.

Your Classifier is Secretly an Energy based model and you should treat it lik...Seunghyun Hwang

Conformer reviewJune-Woo Kim

PFN Summer Internship 2021 / Kohei Shinohara: Charge Transfer Modeling in Neu...Preferred Networks

kaggle NFL 1st and Future - Impact DetectionKazuyuki Miyazawa

Unified Approach to Interpret Machine Learning Model: SHAP + LIMEDatabricks

"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...Edge AI and Vision Alliance

Hands-on ML - CH1Jamie (Taka) Wang

[기초개념] Graph Convolutional Network (GCN)Donghyeon Kim

딥러닝 기본 원리의 이해Hee Won Park

A Fast Implicit Gaussian Curvature FilterYuanhao Gong

合成経路探索 -論文まとめ- （PFN中郷孝祐）Preferred Networks

NeurIPS2020参加報告Sho Takase

Network embeddingSOYEON KIM

Graph Neural Network (한국어)Jungwon Kim

Multimodal Deep LearningUniversitat Politècnica de Catalunya

[논문리뷰] Data Augmentation for 1D 시계열 데이터Donghyeon Kim

【DL輪読会】Efficiently Modeling Long Sequences with Structured State SpacesDeep Learning JP

An introduction on normalizing flowsGrigoris C

【DL輪読会】Scaling Laws for Neural Language ModelsDeep Learning JP

Distributed machine learningStanley Wang

Long Short Term Memory (Neural Networks)Olusola Amusan

Speech Separation under Reverberant Condition.pdfssuser849b73

Contenu connexe

Tendances

Your Classifier is Secretly an Energy based model and you should treat it lik...Seunghyun Hwang

Conformer reviewJune-Woo Kim

PFN Summer Internship 2021 / Kohei Shinohara: Charge Transfer Modeling in Neu...Preferred Networks

kaggle NFL 1st and Future - Impact DetectionKazuyuki Miyazawa

Unified Approach to Interpret Machine Learning Model: SHAP + LIMEDatabricks

"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...Edge AI and Vision Alliance

Hands-on ML - CH1Jamie (Taka) Wang

[기초개념] Graph Convolutional Network (GCN)Donghyeon Kim

딥러닝 기본 원리의 이해Hee Won Park

A Fast Implicit Gaussian Curvature FilterYuanhao Gong

合成経路探索 -論文まとめ- （PFN中郷孝祐）Preferred Networks

NeurIPS2020参加報告Sho Takase

Network embeddingSOYEON KIM

Graph Neural Network (한국어)Jungwon Kim

Multimodal Deep LearningUniversitat Politècnica de Catalunya

[논문리뷰] Data Augmentation for 1D 시계열 데이터Donghyeon Kim

【DL輪読会】Efficiently Modeling Long Sequences with Structured State SpacesDeep Learning JP

An introduction on normalizing flowsGrigoris C

【DL輪読会】Scaling Laws for Neural Language ModelsDeep Learning JP

Distributed machine learningStanley Wang

Tendances (20)

Your Classifier is Secretly an Energy based model and you should treat it lik...

Conformer review

PFN Summer Internship 2021 / Kohei Shinohara: Charge Transfer Modeling in Neu...

kaggle NFL 1st and Future - Impact Detection

Unified Approach to Interpret Machine Learning Model: SHAP + LIME

"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...

Hands-on ML - CH1

[기초개념] Graph Convolutional Network (GCN)

딥러닝 기본 원리의 이해

A Fast Implicit Gaussian Curvature Filter

合成経路探索 -論文まとめ- （PFN中郷孝祐）

NeurIPS2020参加報告

Network embedding

Graph Neural Network (한국어)

Multimodal Deep Learning

[논문리뷰] Data Augmentation for 1D 시계열 데이터

【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces

An introduction on normalizing flows

【DL輪読会】Scaling Laws for Neural Language Models

Distributed machine learning

Similaire à PR-043: HyperNetworks

Long Short Term Memory (Neural Networks)Olusola Amusan

Speech Separation under Reverberant Condition.pdfssuser849b73

Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data StreamsDiego Marrón Vida

Convolutional Neural Network and RNN for OCR problem.Vishal Mishra

Talk about apache cassandra, TWJUG 2011Boris Yen

Talk About Apache CassandraJacky Chu

Convolutional Neural Networks : Popular Architecturesananth

PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...Jinwon Lee

Design Patterns for Distributed Non-Relational Databasesguestdfd1ec

A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...Subhajit Sahu

In datacenter performance analysis of a tensor processing unitJinwon Lee

DL for sentence classification project Write-upHoàng Triều Trịnh

Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Lviv Startup Club

FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...Numenta

MapR M7: Providing an enterprise quality Apache HBase APImcsrivas

What is 3d torusEurotech Aurora

ML Module 3 Non Linear Learning.pptxDebabrataPain1

tankala srinivas, palasashiva782

Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...balmanme

Exascale Deep Learning for Climate Analyticsinside-BigData.com

Similaire à PR-043: HyperNetworks (20)

Long Short Term Memory (Neural Networks)

Speech Separation under Reverberant Condition.pdf

Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams

Convolutional Neural Network and RNN for OCR problem.

Talk about apache cassandra, TWJUG 2011

Talk About Apache Cassandra

Convolutional Neural Networks : Popular Architectures

PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...

Design Patterns for Distributed Non-Relational Databases

A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...

In datacenter performance analysis of a tensor processing unit

DL for sentence classification project Write-up

Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...

FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...

MapR M7: Providing an enterprise quality Apache HBase API

What is 3d torus

ML Module 3 Non Linear Learning.pptx

tankala srinivas, palasa

Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...

Exascale Deep Learning for Climate Analytics

Plus de Taesu Kim

PR12-193 NISP: Pruning Networks using Neural Importance Score PropagationTaesu Kim

PR12-179 M3D-GAN: Multi-Modal Multi-Domain Translation with Universal AttentionTaesu Kim

PR12-165 Few-Shot Adversarial Learning of Realistic Neural Talking Head ModelsTaesu Kim

PR12-151 The Unreasonable Effectiveness of Deep Features as a Perceptual MetricTaesu Kim

PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networksTaesu Kim

Issues in AI product development and practices in audio applicationsTaesu Kim

Plus de Taesu Kim (6)

PR12-193 NISP: Pruning Networks using Neural Importance Score Propagation

PR12-179 M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention

PR12-165 Few-Shot Adversarial Learning of Realistic Neural Talking Head Models

PR12-151 The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks

Issues in AI product development and practices in audio applications

Dernier

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

Commit 2024 - Secret Management made easyAlfredo García Lavilla

How to write a Business Continuity PlanDatabarracks

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

WordPress Websites for Engineers: Elevate Your Brandgvaughan

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

From Family Reminiscence to Scholarly Archive .Alan Dix

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

unit 4 immunoblotting technique complete.pptxBkGupta21

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Dernier (20)

What is DBT - The Ultimate Data Build Tool.pdf

Commit 2024 - Secret Management made easy

How to write a Business Continuity Plan

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

WordPress Websites for Engineers: Elevate Your Brand

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Nell’iperspazio con Rocket: il Framework Web di Rust!

Unleash Your Potential - Namagunga Girls Coding Club

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

SIP trunking in Janus @ Kamailio World 2024

The Ultimate Guide to Choosing WordPress Pros and Cons

From Family Reminiscence to Scholarly Archive .

Connect Wave/ connectwave Pitch Deck Presentation

How AI, OpenAI, and ChatGPT impact business and software.

unit 4 immunoblotting technique complete.pptx

"Debugging python applications inside k8s environment", Andrii Soldatenko

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

PR-043: HyperNetworks

1. HyperNetworks Presented by Taesu Kim Oct 29, 2017 Daivd Ha, Andrew Dai, Quoc V. Le Google Brain Published at ICLR 2017

2. HyperNetworks overview › An approach of using one network to generate the weight for another network › Motivated by HyperNEAT (Stanley et al 2009) and tried to resemble genotype and phenotype in nature › HyperNetwork can be viewed relaxed form of weight sharing across layers. › It generates non-shared weights for LSTM and achieved near state-of-the-art result › It generates shared weights for CNN and achieve respectable results with fewer learnable parameters

3. Conventional Networks Feedforward Networks Recurrent Networks

4. Static HyperNetworks

5. HyperCNN

6. Dynamic HyperNetworks

7. HyperRNN

8. Modified HyperRNN › HyperRNN requires Nz times larger memory requirements than basic RNN › Make it more scalable and memory efficient › Use intermediate hidden vector to parameterize a weight matrix: d(z) is linear projection of z

9. HyperLSTM https://github.com/hardmaru/supercell/ LSTM implementation

10. MNIST and CIFAR-10 40-1: N=6 k=1 40-2: N=6 k=2

11. Character-level Penn Treebank Language Model › 1000 units of MainLSTM & Two version of HyperLSTM – 128 units of HyperLSTM cell & 4 embedding size – 128 units of HyperLSTM cell & 16 embedding size à dropout keep probability of 85% › HyperLSTM outperforms than standard LSTM › HyperLSTM also achieves similar improvements compared to Layer Normalization à combination of Layer Normalization and Hyper LSTM achieves the best test perp.

12. Hutter Prize Wikipedia Language Model › 1800 units of MainLSTM & 256 units of HyperLSTM cell with 64 embedding size & max sequence length : 250 › 2048 units of MainLSTM & 256 units of HyperLSTM cell with 64 embedding size & max sequence length : 300 › HyperLSTM also achieves similar improvements compared to Layer Normalization à combination of Layer Normalization and Hyper LSTM achieves the best test perp. › HyperLSTM converges more quickly compared to LSTM and Layer Norm LSTM

13. Hutter Prize Wikipedia Language Model › Visualizing how the weight scaling vectors of the main LSTM change during the character sampling process. › Regions of low intensity, where the weights of the main LSTM are relatively static, the types of phrases generated seem more deterministic – For example, the weights do not change much during the words Europeans, possessions and reservation. › The regions of high intensity is when the Hyper LSTM cell is making relatively large changes to the weights of the main LSTM

14. Hutter Prize Wikipedia Language Model › Normalized Histogram plots of 𝜙(𝑐$) for different models during sampling – 𝜙(𝑐$) is the hidden state of the LSTM before applying the output gate. – › Layer Norm reduces the saturation effects compared to the vanilla LSTM….. › In HyperLSTM, most of the time the cell is saturated – HyperLSTM cell’s dynamic weight adjustment policy appears to be doing something very different compared to statistical normalization. – Although this policy came up with ended up providing similar performance as LayerNorm

15. Handwriting sequence generation › 12179 handwritten lines from 221 writers › LSTM input is (x, y) coordinate of the pen location and binary indicator of pen-up/pen-down › It can see that many of these weight changes occur at the boundaries between words, and between characters › Dynamically generate the generative model is one of the key advantages of HyperLSTM over a normal LSTM

16. Machine translation › WMT’14 En→Fr using the same test/validation set split described in the GNMT paper. – GMNT network has 8 layers each of encoder/decoder › HyperLSTM cell improves the performance of the existing GNMT model, achieving state- of-the-art single model results for this dataset. › It is demonstrated the applicability of Hyper Networks to large-scale models used in production systems.

17. Follow us: Contact us: contact@neosapience.com For more information: http://www.neosapience.com