SlideShare a Scribd company logo
1 of 28
Download to read offline
Image to Prompts
Stable Diffusion - Image to Prompts
Featured Code Competition, Kaggle
Pin-Yen Liu, Po-Chun Huang, Po-Chuan Chen
June 2, 2023
1 / 28
Image to Prompts
Table of contents
1 Motivation
2 Dataset
3 Method
4 Analysis
5 Discussion
6 Reference
2 / 28
Image to Prompts
Motivation
Table of contents
1 Motivation
2 Dataset
3 Method
4 Analysis
5 Discussion
6 Reference
3 / 28
Image to Prompts
Motivation
Motivation
This project is a Kaggle competition called Image to Prompts [1].
In this competition, we need to reverse the typical direction of a
generative text-to-image model: instead of generating an image from
a text prompt.
We want to create a model which can predict the text prompt given a
generated image. And making predictions on a dataset containing a
wide variety of (prompt, image) pairs generated by Stable Diffusion
2.0, to understand how reversible the latent relationship is.
4 / 28
Image to Prompts
Motivation
Here has an example, if we have an image like this below
We may need to generate a prompt ultrasaurus holding a black bean
taco and a piece of cheese in the woods for this image.
5 / 28
Image to Prompts
Dataset
Table of contents
1 Motivation
2 Dataset
3 Method
4 Analysis
5 Discussion
6 Reference
6 / 28
Image to Prompts
Dataset
Dataset
Because our method is to use pre-trained model. So there has no
dataset for our method.
But, there have some datasets that can be used in this competition.
The dataset is build in (image, prompt) pairs. For example:
DiffusuinDB [8] : 14 million image-text pairs
COCO (Microsoft Common Objects in Context) [6]
: 2.5 million labeled instances in 328k images
Laion2B-en [10] : 2.3 billion image-text pairs
7 / 28
Image to Prompts
Method
Table of contents
1 Motivation
2 Dataset
3 Method
ViT
CLIP interrogator
OFA
4 Analysis
5 Discussion
8 / 28
Image to Prompts
Method
Method
Our method is to ensemble the CLIP Interrogator, OFA model,and
ViT model.
Here’s the ratio for three different model
1 Vision Transformer (ViT) model: 74.88%
2 CLIP Interrogator: 21.12%
3 OFA model fine-tuned for image captioning: 4%
9 / 28
Image to Prompts
Method
ViT
ViT
The Vision Transformer (ViT) [3] is a transformer encoder model
(BERT-like), and pretrained on ImageNet-21k (14 million images,
21,843 classes) at resolution 224x224, and fine-tuned on ImageNet
2012 (1 million images, 1,000 classes) at resolution 224x224.
10 / 28
Image to Prompts
Method
CLIP interrogator
CLIP interrogator
The CLIP Interrogator is a prompt engineering tool that combines
OpenAI’s CLIP [7] and Salesforce’s BLIP [5] to optimize text
prompts to match a given image.
Figure: CLIP interrogator : A tool on Hugging Face Space
11 / 28
Image to Prompts
Method
CLIP interrogator
CLIP interrogator pipeline
Figure: CLIP Interrogator pipeline
12 / 28
Image to Prompts
Method
CLIP interrogator
BLIP
Figure: Framework of BLIP
13 / 28
Image to Prompts
Method
OFA
OFA
OFA[9] is a unified multimodal pretrained model that unifies
modalities (i.e., cross-modality, vision, language) and tasks (e.g.,
image generation, visual grounding, image captioning, image
classification, text generation, etc.) to a simple sequence-to-sequence
learning framework.
In here, we use OFA-large-caption for image captioning for
pre-trained model, it uses COCO dataset for training the model.
14 / 28
Image to Prompts
Analysis
Table of contents
1 Motivation
2 Dataset
3 Method
4 Analysis
Sentence Transformer
ViT
CLIP interrogator
OFA
5 Discussion
15 / 28
Image to Prompts
Analysis
Sentence Transformer
For the competition, we need to calculate Stable Diffusion Prompt
Embeddings.
Input will be the prompt sentence, output will be the embedding. We
can get the embedding with using sentence transformer. The model
called Sentence-BERT [8].
Figure: SBERT architecture at inference
16 / 28
Image to Prompts
Analysis
Sentence Transformer
Figure: Ground Truth Prompts
Figure: Text Embedding for each prompt. Each of them has 384 dimensions.
17 / 28
Image to Prompts
Analysis
ViT
ViT
Figure: Label generated by Vision Transformer
18 / 28
Image to Prompts
Analysis
CLIP interrogator
CLIP interrogator
Figure: Prompts generated by CLIP interrogator
19 / 28
Image to Prompts
Analysis
OFA
OFA Model EDA
Figure: Prompts generated by OFA model
20 / 28
Image to Prompts
Analysis
OFA
Model Caption SBERT Score
ViT
comic book, triceratops,
coffee mug, jigsaw puzzle,
book jacket, dust cover,
dust jacket
0.553
CLIP
Interrogator
a cartoon dinosaur with
a piece of cheese in its mouth,
boardgamegeek, gaia,
woodland background,
pancake flat head, inspired by
Charles Fremont Conner,
mobile learning app prototype,
is essentially arbitrary, blue scales,
museum item, nachos, prehistory, plates
0.739
OFA
A blue dinosaur eating a
piece of cheese in a forest
0.772
21 / 28
Image to Prompts
Discussion
Table of contents
1 Motivation
2 Dataset
3 Method
4 Analysis
5 Discussion
6 Reference
22 / 28
Image to Prompts
Discussion
Discussion
This competition aims to explore the reversibility of the latent
relationship between text prompts and generated images, challenging
participants to predict the text prompt from a given image.
It highlights the importance of prompt engineering in text-to-image
models and the potential for understanding the intricate relationships
between prompts and images.
The combination of CLIP Interrogator, OFA, and ViT models provides
a robust approach for analyzing and predicting prompt embeddings.
Figure: Ranking on Kaggle Leader-board (Top 36%)
23 / 28
Image to Prompts
Discussion
Future work
But the method for this competition can still be improved.
Using the generation of a large dataset of images and prompts, along
with modifications and training of models, demonstrates the potential
for further improvements in the field of text-to-image generation and
prompt engineering (Score: 0.66316, Rank 1) [2].
Or, the utilization of a ViT-based method and the creation of a
customized dataset, indicating the importance of supervised learning
in predicting sentence embeddings and its relevance to advancing
text-to-image models (Score: 0.65179, Rank 2) [4].
24 / 28
Image to Prompts
Reference
Table of contents
1 Motivation
2 Dataset
3 Method
4 Analysis
5 Discussion
6 Reference
25 / 28
Image to Prompts
Reference
References I
[1] Will Cukierski Ashley Chow inversion. Stable Diffusion -
Image to Prompts. 2023. url:
https://kaggle.com/competitions/stable-
diffusion-image-to-prompts.
[2] bestfitting. 1st Place Solution. 2023. url:
https://www.kaggle.com/competitions/stable-
diffusion-image-to-prompts/discussion/411237.
[3] Alexey Dosovitskiy et al. An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale. 2021. arXiv:
2010.11929 [cs.CV].
[4] KaizaburoChubachi. 2nd place solution. 2023. url:
https://www.kaggle.com/competitions/stable-
diffusion-image-to-prompts/discussion/410606.
26 / 28
Image to Prompts
Reference
References II
[5] Junnan Li et al. BLIP: Bootstrapping Language-Image
Pre-training for Unified Vision-Language Understanding and
Generation. 2022. arXiv: 2201.12086 [cs.CV].
[6] Tsung-Yi Lin et al. Microsoft COCO: Common Objects in
Context. 2015. arXiv: 1405.0312 [cs.CV].
[7] Alec Radford et al. Learning Transferable Visual Models From
Natural Language Supervision. 2021. arXiv: 2103.00020
[cs.CV].
[8] Nils Reimers and Iryna Gurevych. “Sentence-BERT: Sentence
Embeddings using Siamese BERT-Networks”. In: CoRR
abs/1908.10084 (2019). arXiv: 1908.10084. url:
http://arxiv.org/abs/1908.10084.
27 / 28
Image to Prompts
Reference
References III
[9] Peng Wang et al. OFA: Unifying Architectures, Tasks, and
Modalities Through a Simple Sequence-to-Sequence Learning
Framework. 2022. arXiv: 2202.03052 [cs.CV].
[10] Ryan Webster et al. On the De-duplication of LAION-2B. 2023.
arXiv: 2303.12733 [cs.CV].
28 / 28

More Related Content

Similar to Image_to_Prompts.pdf

Web Image Retrieval Using Visual Dictionary
Web Image Retrieval Using Visual DictionaryWeb Image Retrieval Using Visual Dictionary
Web Image Retrieval Using Visual Dictionaryijwscjournal
 
Web Image Retrieval Using Visual Dictionary
Web Image Retrieval Using Visual DictionaryWeb Image Retrieval Using Visual Dictionary
Web Image Retrieval Using Visual Dictionaryijwscjournal
 
Comprehensive Performance Comparison of Cosine, Walsh, Haar, Kekre, Sine, Sla...
Comprehensive Performance Comparison of Cosine, Walsh, Haar, Kekre, Sine, Sla...Comprehensive Performance Comparison of Cosine, Walsh, Haar, Kekre, Sine, Sla...
Comprehensive Performance Comparison of Cosine, Walsh, Haar, Kekre, Sine, Sla...CSCJournals
 
Extended Performance Appraise of Image Retrieval Using the Feature Vector as ...
Extended Performance Appraise of Image Retrieval Using the Feature Vector as ...Extended Performance Appraise of Image Retrieval Using the Feature Vector as ...
Extended Performance Appraise of Image Retrieval Using the Feature Vector as ...Waqas Tariq
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用Ryo Iwaki
 
A comparative analysis of retrieval techniques in content based image retrieval
A comparative analysis of retrieval techniques in content based image retrievalA comparative analysis of retrieval techniques in content based image retrieval
A comparative analysis of retrieval techniques in content based image retrievalcsandit
 
A COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVAL
A COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVALA COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVAL
A COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVALcscpconf
 
IMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAIDIMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAIDIRJET Journal
 
IRJET-Image Question Answering: A Review
IRJET-Image Question Answering: A ReviewIRJET-Image Question Answering: A Review
IRJET-Image Question Answering: A ReviewIRJET Journal
 
Cartoon Based Image Retrieval : An Indexing Approach
Cartoon Based Image Retrieval : An Indexing ApproachCartoon Based Image Retrieval : An Indexing Approach
Cartoon Based Image Retrieval : An Indexing Approachmlaij
 
WEB IMAGE RETRIEVAL USING CLUSTERING APPROACHES
WEB IMAGE RETRIEVAL USING CLUSTERING APPROACHESWEB IMAGE RETRIEVAL USING CLUSTERING APPROACHES
WEB IMAGE RETRIEVAL USING CLUSTERING APPROACHEScscpconf
 
A Hybrid Approach for Content Based Image Retrieval System
A Hybrid Approach for Content Based Image Retrieval SystemA Hybrid Approach for Content Based Image Retrieval System
A Hybrid Approach for Content Based Image Retrieval SystemIOSR Journals
 
Implementation of Picwords to Warping Pictures and Keywords through Calligram
Implementation of Picwords to Warping Pictures and Keywords through CalligramImplementation of Picwords to Warping Pictures and Keywords through Calligram
Implementation of Picwords to Warping Pictures and Keywords through CalligramIRJET Journal
 
Modelling User Interaction utilising Information Foraging Theory (and a bit o...
Modelling User Interaction utilising Information Foraging Theory (and a bit o...Modelling User Interaction utilising Information Foraging Theory (and a bit o...
Modelling User Interaction utilising Information Foraging Theory (and a bit o...Ingo Frommholz
 
TEMPLATE MATCHING TECHNIQUE FOR SEARCHING WORDS IN DOCUMENT IMAGES
TEMPLATE MATCHING TECHNIQUE FOR SEARCHING WORDS IN DOCUMENT IMAGESTEMPLATE MATCHING TECHNIQUE FOR SEARCHING WORDS IN DOCUMENT IMAGES
TEMPLATE MATCHING TECHNIQUE FOR SEARCHING WORDS IN DOCUMENT IMAGESIJCI JOURNAL
 
LEARNING TO RANK IMAGE TAGS WITH LIMITED TRAINING EXAMPLES - IEEE PROJECTS I...
LEARNING TO RANK IMAGE TAGS WITH LIMITED TRAINING EXAMPLES  - IEEE PROJECTS I...LEARNING TO RANK IMAGE TAGS WITH LIMITED TRAINING EXAMPLES  - IEEE PROJECTS I...
LEARNING TO RANK IMAGE TAGS WITH LIMITED TRAINING EXAMPLES - IEEE PROJECTS I...Nexgen Technology
 
Learning to rank image tags with limited
Learning to rank image tags with limitedLearning to rank image tags with limited
Learning to rank image tags with limitednexgentech15
 
LEARNING TO RANK IMAGE TAGS WITH LIMITED TRAINING EXAMPLES
LEARNING TO RANK IMAGE TAGS WITH LIMITED TRAINING EXAMPLESLEARNING TO RANK IMAGE TAGS WITH LIMITED TRAINING EXAMPLES
LEARNING TO RANK IMAGE TAGS WITH LIMITED TRAINING EXAMPLESnexgentechnology
 

Similar to Image_to_Prompts.pdf (20)

Web Image Retrieval Using Visual Dictionary
Web Image Retrieval Using Visual DictionaryWeb Image Retrieval Using Visual Dictionary
Web Image Retrieval Using Visual Dictionary
 
Web Image Retrieval Using Visual Dictionary
Web Image Retrieval Using Visual DictionaryWeb Image Retrieval Using Visual Dictionary
Web Image Retrieval Using Visual Dictionary
 
Comprehensive Performance Comparison of Cosine, Walsh, Haar, Kekre, Sine, Sla...
Comprehensive Performance Comparison of Cosine, Walsh, Haar, Kekre, Sine, Sla...Comprehensive Performance Comparison of Cosine, Walsh, Haar, Kekre, Sine, Sla...
Comprehensive Performance Comparison of Cosine, Walsh, Haar, Kekre, Sine, Sla...
 
Extended Performance Appraise of Image Retrieval Using the Feature Vector as ...
Extended Performance Appraise of Image Retrieval Using the Feature Vector as ...Extended Performance Appraise of Image Retrieval Using the Feature Vector as ...
Extended Performance Appraise of Image Retrieval Using the Feature Vector as ...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用
 
A comparative analysis of retrieval techniques in content based image retrieval
A comparative analysis of retrieval techniques in content based image retrievalA comparative analysis of retrieval techniques in content based image retrieval
A comparative analysis of retrieval techniques in content based image retrieval
 
A COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVAL
A COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVALA COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVAL
A COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVAL
 
IMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAIDIMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAID
 
IRJET-Image Question Answering: A Review
IRJET-Image Question Answering: A ReviewIRJET-Image Question Answering: A Review
IRJET-Image Question Answering: A Review
 
Cartoon Based Image Retrieval : An Indexing Approach
Cartoon Based Image Retrieval : An Indexing ApproachCartoon Based Image Retrieval : An Indexing Approach
Cartoon Based Image Retrieval : An Indexing Approach
 
IPT.pdf
IPT.pdfIPT.pdf
IPT.pdf
 
WEB IMAGE RETRIEVAL USING CLUSTERING APPROACHES
WEB IMAGE RETRIEVAL USING CLUSTERING APPROACHESWEB IMAGE RETRIEVAL USING CLUSTERING APPROACHES
WEB IMAGE RETRIEVAL USING CLUSTERING APPROACHES
 
A Hybrid Approach for Content Based Image Retrieval System
A Hybrid Approach for Content Based Image Retrieval SystemA Hybrid Approach for Content Based Image Retrieval System
A Hybrid Approach for Content Based Image Retrieval System
 
Implementation of Picwords to Warping Pictures and Keywords through Calligram
Implementation of Picwords to Warping Pictures and Keywords through CalligramImplementation of Picwords to Warping Pictures and Keywords through Calligram
Implementation of Picwords to Warping Pictures and Keywords through Calligram
 
Modelling User Interaction utilising Information Foraging Theory (and a bit o...
Modelling User Interaction utilising Information Foraging Theory (and a bit o...Modelling User Interaction utilising Information Foraging Theory (and a bit o...
Modelling User Interaction utilising Information Foraging Theory (and a bit o...
 
TEMPLATE MATCHING TECHNIQUE FOR SEARCHING WORDS IN DOCUMENT IMAGES
TEMPLATE MATCHING TECHNIQUE FOR SEARCHING WORDS IN DOCUMENT IMAGESTEMPLATE MATCHING TECHNIQUE FOR SEARCHING WORDS IN DOCUMENT IMAGES
TEMPLATE MATCHING TECHNIQUE FOR SEARCHING WORDS IN DOCUMENT IMAGES
 
LEARNING TO RANK IMAGE TAGS WITH LIMITED TRAINING EXAMPLES - IEEE PROJECTS I...
LEARNING TO RANK IMAGE TAGS WITH LIMITED TRAINING EXAMPLES  - IEEE PROJECTS I...LEARNING TO RANK IMAGE TAGS WITH LIMITED TRAINING EXAMPLES  - IEEE PROJECTS I...
LEARNING TO RANK IMAGE TAGS WITH LIMITED TRAINING EXAMPLES - IEEE PROJECTS I...
 
Learning to rank image tags with limited
Learning to rank image tags with limitedLearning to rank image tags with limited
Learning to rank image tags with limited
 
LEARNING TO RANK IMAGE TAGS WITH LIMITED TRAINING EXAMPLES
LEARNING TO RANK IMAGE TAGS WITH LIMITED TRAINING EXAMPLESLEARNING TO RANK IMAGE TAGS WITH LIMITED TRAINING EXAMPLES
LEARNING TO RANK IMAGE TAGS WITH LIMITED TRAINING EXAMPLES
 

More from Po-Chuan Chen

E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfE-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfPo-Chuan Chen
 
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Po-Chuan Chen
 
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfPo-Chuan Chen
 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Po-Chuan Chen
 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfPo-Chuan Chen
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...Po-Chuan Chen
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfPo-Chuan Chen
 
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfPo-Chuan Chen
 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfPo-Chuan Chen
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...Po-Chuan Chen
 
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfPo-Chuan Chen
 
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfPo-Chuan Chen
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfPo-Chuan Chen
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfPo-Chuan Chen
 
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfPo-Chuan Chen
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfPo-Chuan Chen
 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfPo-Chuan Chen
 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdfPo-Chuan Chen
 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfPo-Chuan Chen
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdfPo-Chuan Chen
 

More from Po-Chuan Chen (20)

E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfE-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
 
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
 
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdf
 
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdf
 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
 
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdf
 
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
 
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdf
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdf
 

Recently uploaded

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 

Recently uploaded (20)

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 

Image_to_Prompts.pdf

  • 1. Image to Prompts Stable Diffusion - Image to Prompts Featured Code Competition, Kaggle Pin-Yen Liu, Po-Chun Huang, Po-Chuan Chen June 2, 2023 1 / 28
  • 2. Image to Prompts Table of contents 1 Motivation 2 Dataset 3 Method 4 Analysis 5 Discussion 6 Reference 2 / 28
  • 3. Image to Prompts Motivation Table of contents 1 Motivation 2 Dataset 3 Method 4 Analysis 5 Discussion 6 Reference 3 / 28
  • 4. Image to Prompts Motivation Motivation This project is a Kaggle competition called Image to Prompts [1]. In this competition, we need to reverse the typical direction of a generative text-to-image model: instead of generating an image from a text prompt. We want to create a model which can predict the text prompt given a generated image. And making predictions on a dataset containing a wide variety of (prompt, image) pairs generated by Stable Diffusion 2.0, to understand how reversible the latent relationship is. 4 / 28
  • 5. Image to Prompts Motivation Here has an example, if we have an image like this below We may need to generate a prompt ultrasaurus holding a black bean taco and a piece of cheese in the woods for this image. 5 / 28
  • 6. Image to Prompts Dataset Table of contents 1 Motivation 2 Dataset 3 Method 4 Analysis 5 Discussion 6 Reference 6 / 28
  • 7. Image to Prompts Dataset Dataset Because our method is to use pre-trained model. So there has no dataset for our method. But, there have some datasets that can be used in this competition. The dataset is build in (image, prompt) pairs. For example: DiffusuinDB [8] : 14 million image-text pairs COCO (Microsoft Common Objects in Context) [6] : 2.5 million labeled instances in 328k images Laion2B-en [10] : 2.3 billion image-text pairs 7 / 28
  • 8. Image to Prompts Method Table of contents 1 Motivation 2 Dataset 3 Method ViT CLIP interrogator OFA 4 Analysis 5 Discussion 8 / 28
  • 9. Image to Prompts Method Method Our method is to ensemble the CLIP Interrogator, OFA model,and ViT model. Here’s the ratio for three different model 1 Vision Transformer (ViT) model: 74.88% 2 CLIP Interrogator: 21.12% 3 OFA model fine-tuned for image captioning: 4% 9 / 28
  • 10. Image to Prompts Method ViT ViT The Vision Transformer (ViT) [3] is a transformer encoder model (BERT-like), and pretrained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. 10 / 28
  • 11. Image to Prompts Method CLIP interrogator CLIP interrogator The CLIP Interrogator is a prompt engineering tool that combines OpenAI’s CLIP [7] and Salesforce’s BLIP [5] to optimize text prompts to match a given image. Figure: CLIP interrogator : A tool on Hugging Face Space 11 / 28
  • 12. Image to Prompts Method CLIP interrogator CLIP interrogator pipeline Figure: CLIP Interrogator pipeline 12 / 28
  • 13. Image to Prompts Method CLIP interrogator BLIP Figure: Framework of BLIP 13 / 28
  • 14. Image to Prompts Method OFA OFA OFA[9] is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image generation, visual grounding, image captioning, image classification, text generation, etc.) to a simple sequence-to-sequence learning framework. In here, we use OFA-large-caption for image captioning for pre-trained model, it uses COCO dataset for training the model. 14 / 28
  • 15. Image to Prompts Analysis Table of contents 1 Motivation 2 Dataset 3 Method 4 Analysis Sentence Transformer ViT CLIP interrogator OFA 5 Discussion 15 / 28
  • 16. Image to Prompts Analysis Sentence Transformer For the competition, we need to calculate Stable Diffusion Prompt Embeddings. Input will be the prompt sentence, output will be the embedding. We can get the embedding with using sentence transformer. The model called Sentence-BERT [8]. Figure: SBERT architecture at inference 16 / 28
  • 17. Image to Prompts Analysis Sentence Transformer Figure: Ground Truth Prompts Figure: Text Embedding for each prompt. Each of them has 384 dimensions. 17 / 28
  • 18. Image to Prompts Analysis ViT ViT Figure: Label generated by Vision Transformer 18 / 28
  • 19. Image to Prompts Analysis CLIP interrogator CLIP interrogator Figure: Prompts generated by CLIP interrogator 19 / 28
  • 20. Image to Prompts Analysis OFA OFA Model EDA Figure: Prompts generated by OFA model 20 / 28
  • 21. Image to Prompts Analysis OFA Model Caption SBERT Score ViT comic book, triceratops, coffee mug, jigsaw puzzle, book jacket, dust cover, dust jacket 0.553 CLIP Interrogator a cartoon dinosaur with a piece of cheese in its mouth, boardgamegeek, gaia, woodland background, pancake flat head, inspired by Charles Fremont Conner, mobile learning app prototype, is essentially arbitrary, blue scales, museum item, nachos, prehistory, plates 0.739 OFA A blue dinosaur eating a piece of cheese in a forest 0.772 21 / 28
  • 22. Image to Prompts Discussion Table of contents 1 Motivation 2 Dataset 3 Method 4 Analysis 5 Discussion 6 Reference 22 / 28
  • 23. Image to Prompts Discussion Discussion This competition aims to explore the reversibility of the latent relationship between text prompts and generated images, challenging participants to predict the text prompt from a given image. It highlights the importance of prompt engineering in text-to-image models and the potential for understanding the intricate relationships between prompts and images. The combination of CLIP Interrogator, OFA, and ViT models provides a robust approach for analyzing and predicting prompt embeddings. Figure: Ranking on Kaggle Leader-board (Top 36%) 23 / 28
  • 24. Image to Prompts Discussion Future work But the method for this competition can still be improved. Using the generation of a large dataset of images and prompts, along with modifications and training of models, demonstrates the potential for further improvements in the field of text-to-image generation and prompt engineering (Score: 0.66316, Rank 1) [2]. Or, the utilization of a ViT-based method and the creation of a customized dataset, indicating the importance of supervised learning in predicting sentence embeddings and its relevance to advancing text-to-image models (Score: 0.65179, Rank 2) [4]. 24 / 28
  • 25. Image to Prompts Reference Table of contents 1 Motivation 2 Dataset 3 Method 4 Analysis 5 Discussion 6 Reference 25 / 28
  • 26. Image to Prompts Reference References I [1] Will Cukierski Ashley Chow inversion. Stable Diffusion - Image to Prompts. 2023. url: https://kaggle.com/competitions/stable- diffusion-image-to-prompts. [2] bestfitting. 1st Place Solution. 2023. url: https://www.kaggle.com/competitions/stable- diffusion-image-to-prompts/discussion/411237. [3] Alexey Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2021. arXiv: 2010.11929 [cs.CV]. [4] KaizaburoChubachi. 2nd place solution. 2023. url: https://www.kaggle.com/competitions/stable- diffusion-image-to-prompts/discussion/410606. 26 / 28
  • 27. Image to Prompts Reference References II [5] Junnan Li et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. 2022. arXiv: 2201.12086 [cs.CV]. [6] Tsung-Yi Lin et al. Microsoft COCO: Common Objects in Context. 2015. arXiv: 1405.0312 [cs.CV]. [7] Alec Radford et al. Learning Transferable Visual Models From Natural Language Supervision. 2021. arXiv: 2103.00020 [cs.CV]. [8] Nils Reimers and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”. In: CoRR abs/1908.10084 (2019). arXiv: 1908.10084. url: http://arxiv.org/abs/1908.10084. 27 / 28
  • 28. Image to Prompts Reference References III [9] Peng Wang et al. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. 2022. arXiv: 2202.03052 [cs.CV]. [10] Ryan Webster et al. On the De-duplication of LAION-2B. 2023. arXiv: 2303.12733 [cs.CV]. 28 / 28