4. Image to Prompts
Motivation
Motivation
This project is a Kaggle competition called Image to Prompts [1].
In this competition, we need to reverse the typical direction of a
generative text-to-image model: instead of generating an image from
a text prompt.
We want to create a model which can predict the text prompt given a
generated image. And making predictions on a dataset containing a
wide variety of (prompt, image) pairs generated by Stable Diffusion
2.0, to understand how reversible the latent relationship is.
4 / 28
5. Image to Prompts
Motivation
Here has an example, if we have an image like this below
We may need to generate a prompt ultrasaurus holding a black bean
taco and a piece of cheese in the woods for this image.
5 / 28
7. Image to Prompts
Dataset
Dataset
Because our method is to use pre-trained model. So there has no
dataset for our method.
But, there have some datasets that can be used in this competition.
The dataset is build in (image, prompt) pairs. For example:
DiffusuinDB [8] : 14 million image-text pairs
COCO (Microsoft Common Objects in Context) [6]
: 2.5 million labeled instances in 328k images
Laion2B-en [10] : 2.3 billion image-text pairs
7 / 28
8. Image to Prompts
Method
Table of contents
1 Motivation
2 Dataset
3 Method
ViT
CLIP interrogator
OFA
4 Analysis
5 Discussion
8 / 28
9. Image to Prompts
Method
Method
Our method is to ensemble the CLIP Interrogator, OFA model,and
ViT model.
Here’s the ratio for three different model
1 Vision Transformer (ViT) model: 74.88%
2 CLIP Interrogator: 21.12%
3 OFA model fine-tuned for image captioning: 4%
9 / 28
10. Image to Prompts
Method
ViT
ViT
The Vision Transformer (ViT) [3] is a transformer encoder model
(BERT-like), and pretrained on ImageNet-21k (14 million images,
21,843 classes) at resolution 224x224, and fine-tuned on ImageNet
2012 (1 million images, 1,000 classes) at resolution 224x224.
10 / 28
11. Image to Prompts
Method
CLIP interrogator
CLIP interrogator
The CLIP Interrogator is a prompt engineering tool that combines
OpenAI’s CLIP [7] and Salesforce’s BLIP [5] to optimize text
prompts to match a given image.
Figure: CLIP interrogator : A tool on Hugging Face Space
11 / 28
14. Image to Prompts
Method
OFA
OFA
OFA[9] is a unified multimodal pretrained model that unifies
modalities (i.e., cross-modality, vision, language) and tasks (e.g.,
image generation, visual grounding, image captioning, image
classification, text generation, etc.) to a simple sequence-to-sequence
learning framework.
In here, we use OFA-large-caption for image captioning for
pre-trained model, it uses COCO dataset for training the model.
14 / 28
15. Image to Prompts
Analysis
Table of contents
1 Motivation
2 Dataset
3 Method
4 Analysis
Sentence Transformer
ViT
CLIP interrogator
OFA
5 Discussion
15 / 28
16. Image to Prompts
Analysis
Sentence Transformer
For the competition, we need to calculate Stable Diffusion Prompt
Embeddings.
Input will be the prompt sentence, output will be the embedding. We
can get the embedding with using sentence transformer. The model
called Sentence-BERT [8].
Figure: SBERT architecture at inference
16 / 28
17. Image to Prompts
Analysis
Sentence Transformer
Figure: Ground Truth Prompts
Figure: Text Embedding for each prompt. Each of them has 384 dimensions.
17 / 28
21. Image to Prompts
Analysis
OFA
Model Caption SBERT Score
ViT
comic book, triceratops,
coffee mug, jigsaw puzzle,
book jacket, dust cover,
dust jacket
0.553
CLIP
Interrogator
a cartoon dinosaur with
a piece of cheese in its mouth,
boardgamegeek, gaia,
woodland background,
pancake flat head, inspired by
Charles Fremont Conner,
mobile learning app prototype,
is essentially arbitrary, blue scales,
museum item, nachos, prehistory, plates
0.739
OFA
A blue dinosaur eating a
piece of cheese in a forest
0.772
21 / 28
23. Image to Prompts
Discussion
Discussion
This competition aims to explore the reversibility of the latent
relationship between text prompts and generated images, challenging
participants to predict the text prompt from a given image.
It highlights the importance of prompt engineering in text-to-image
models and the potential for understanding the intricate relationships
between prompts and images.
The combination of CLIP Interrogator, OFA, and ViT models provides
a robust approach for analyzing and predicting prompt embeddings.
Figure: Ranking on Kaggle Leader-board (Top 36%)
23 / 28
24. Image to Prompts
Discussion
Future work
But the method for this competition can still be improved.
Using the generation of a large dataset of images and prompts, along
with modifications and training of models, demonstrates the potential
for further improvements in the field of text-to-image generation and
prompt engineering (Score: 0.66316, Rank 1) [2].
Or, the utilization of a ViT-based method and the creation of a
customized dataset, indicating the importance of supervised learning
in predicting sentence embeddings and its relevance to advancing
text-to-image models (Score: 0.65179, Rank 2) [4].
24 / 28
26. Image to Prompts
Reference
References I
[1] Will Cukierski Ashley Chow inversion. Stable Diffusion -
Image to Prompts. 2023. url:
https://kaggle.com/competitions/stable-
diffusion-image-to-prompts.
[2] bestfitting. 1st Place Solution. 2023. url:
https://www.kaggle.com/competitions/stable-
diffusion-image-to-prompts/discussion/411237.
[3] Alexey Dosovitskiy et al. An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale. 2021. arXiv:
2010.11929 [cs.CV].
[4] KaizaburoChubachi. 2nd place solution. 2023. url:
https://www.kaggle.com/competitions/stable-
diffusion-image-to-prompts/discussion/410606.
26 / 28
27. Image to Prompts
Reference
References II
[5] Junnan Li et al. BLIP: Bootstrapping Language-Image
Pre-training for Unified Vision-Language Understanding and
Generation. 2022. arXiv: 2201.12086 [cs.CV].
[6] Tsung-Yi Lin et al. Microsoft COCO: Common Objects in
Context. 2015. arXiv: 1405.0312 [cs.CV].
[7] Alec Radford et al. Learning Transferable Visual Models From
Natural Language Supervision. 2021. arXiv: 2103.00020
[cs.CV].
[8] Nils Reimers and Iryna Gurevych. “Sentence-BERT: Sentence
Embeddings using Siamese BERT-Networks”. In: CoRR
abs/1908.10084 (2019). arXiv: 1908.10084. url:
http://arxiv.org/abs/1908.10084.
27 / 28
28. Image to Prompts
Reference
References III
[9] Peng Wang et al. OFA: Unifying Architectures, Tasks, and
Modalities Through a Simple Sequence-to-Sequence Learning
Framework. 2022. arXiv: 2202.03052 [cs.CV].
[10] Ryan Webster et al. On the De-duplication of LAION-2B. 2023.
arXiv: 2303.12733 [cs.CV].
28 / 28