charan1102 commited on
Commit
b2ca582
·
verified ·
1 Parent(s): f8571a5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +438 -39
README.md CHANGED
@@ -17,48 +17,447 @@ widget:
17
  example_title: Airport
18
  ---
19
 
20
- # The Illustrated Image Captioning using transformers
21
 
22
  ![](https://ankur3107.github.io/assets/images/vision-encoder-decoder.png)
23
 
24
- * https://ankur3107.github.io/blogs/the-illustrated-image-captioning-using-transformers/
25
-
26
-
27
- # Sample running code
28
-
29
- ```python
30
- from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
31
- import torch
32
- from PIL import Image
33
- model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
34
- feature_extractor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
35
- tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
36
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
37
- model.to(device)
38
- max_length = 16
39
- num_beams = 4
40
- gen_kwargs = {"max_length": max_length, "num_beams": num_beams}
41
- def predict_step(image_paths):
42
- images = []
43
- for image_path in image_paths:
44
- i_image = Image.open(image_path)
45
- if i_image.mode != "RGB":
46
- i_image = i_image.convert(mode="RGB")
47
- images.append(i_image)
48
- pixel_values = feature_extractor(images=images, return_tensors="pt").pixel_values
49
- pixel_values = pixel_values.to(device)
50
- output_ids = model.generate(pixel_values, **gen_kwargs)
51
- preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
52
- preds = [pred.strip() for pred in preds]
53
- return preds
54
- predict_step(['doctor.e16ba4e4.jpg']) # ['a woman in a hospital bed with a woman in a hospital bed']
 
 
 
 
 
 
 
55
  ```
56
 
57
- # Sample running code using transformers pipeline
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
- ```python
60
- from transformers import pipeline
61
- image_to_text = pipeline("image-to-text", model="nlpconnect/vit-gpt2-image-captioning")
62
- image_to_text("https://ankur3107.github.io/assets/images/image-captioning-example.png")
63
- # [{'generated_text': 'a soccer game with a player jumping to catch the ball '}]
64
- ```
 
17
  example_title: Airport
18
  ---
19
 
20
+ # The Illustrated Image Captioning using transformers model
21
 
22
  ![](https://ankur3107.github.io/assets/images/vision-encoder-decoder.png)
23
 
24
+ # Table of Contents
25
+ - [1. Introduction](#1-introduction)
26
+ - [2. Dataset Used](#2-dataset-used)
27
+ - [3. Installation](#3-installation)
28
+ - [4. Models and Technologies Used](#4-models-and-technologies-used)
29
+ - [5. Steps for Code Explanation](#5-steps-for-code-explanation)
30
+ - [6. Results and Analysis](#6-results-and-analysis)
31
+ - [7. Evaluation Metrics](#7-evaluation-metrics)
32
+ - [8. References](#8-references)
33
+
34
+ ## 1. Introduction
35
+
36
+ This repository, Image captioning is a challenging problem that involves generating human-like descriptions for images. By utilizing Vision Transformers, this project aims to achieve improved image understanding and caption generation. The combination of computer vision and Transformers has shown promising results in various natural language processing tasks, and this project explores their application to image captioning.
37
+
38
+ ## 2. Dataset Used
39
+
40
+ ### About MS COCO dataset
41
+ The Microsoft **C**ommon **O**bjects in **CO**ntext (MS COCO) dataset is a large-scale dataset for scene understanding. The dataset is commonly used to train and benchmark object detection, segmentation, and captioning algorithms.
42
+
43
+ ![Image 11-15-24 at 5 12 PM](https://github.com/user-attachments/assets/1656bf8b-f13b-42ad-aeaa-4eef012f10d6)
44
+
45
+
46
+ You can read more about the dataset on the [website](http://cocodataset.org/#home), [research paper](https://arxiv.org/pdf/1405.0312.pdf), or Appendix section at the end of this page.
47
+
48
+ ## 3. Installation
49
+
50
+ ### Install COCO API
51
+
52
+ 1. Clone this repo: https://github.com/cocodataset/cocoapi
53
+ ```
54
+ git clone https://github.com/cocodataset/cocoapi.git
55
+ ```
56
+
57
+ 2. Setup the coco API (also described in the readme [here](https://github.com/cocodataset/cocoapi))
58
+ ```
59
+ cd cocoapi/PythonAPI
60
+ make
61
+ cd ..
62
  ```
63
 
64
+ 3. Download some specific data from here: http://cocodataset.org/#download (described below)
65
+
66
+ * Under **Annotations**, download:
67
+ * **2017 Train/Val annotations [241MB]** (extract captions_train2017.json and captions_val2017.json, and place at locations cocoapi/annotations/captions_train2017.json and cocoapi/annotations/captions_val2017.json, respectively)
68
+ * **2017 Testing Image info [1MB]** (extract image_info_test2017.json and place at location cocoapi/annotations/image_info_test2017.json)
69
+
70
+ * Under **Images**, download:
71
+ * **2017 Train images [83K/13GB]** (extract the train2017 folder and place at location cocoapi/images/train2017/)
72
+ * **2017 Val images [41K/6GB]** (extract the val2017 folder and place at location cocoapi/images/val2017/)
73
+ * **2017 Test images [41K/6GB]** (extract the test2017 folder and place at location cocoapi/images/test2017/)
74
+
75
+ ## 3. Installation
76
+ ## Preparing the environment
77
+ **Note**: I have developed this project on Mac. It can surely be run on Windows and linux with some little changes.
78
+
79
+ 1. Clone the repository, and navigate to the downloaded folder.
80
+ ```
81
+ git clone https://github.com/CapstoneProjectimagecaptioning/image_captioning_transformer.git
82
+ cd image_captioning_transformer
83
+ ```
84
+
85
+ 2. Create (and activate) a new environment, named `captioning_env` with Python 3.7. If prompted to proceed with the install `(Proceed [y]/n)` type y.
86
+
87
+ ```shell
88
+ conda create -n captioning_env python=3.7
89
+ source activate captioning_env
90
+ ```
91
+
92
+ At this point your command line should look something like: `(captioning_env) <User>:image_captioning <user>$`. The `(captioning_env)` indicates that your environment has been activated, and you can proceed with further package installations.
93
+
94
+ 6. Before you can experiment with the code, you'll have to make sure that you have all the libraries and dependencies required to support this project. You will mainly need Python3.7+, PyTorch and its torchvision, OpenCV, and Matplotlib. You can install dependencies using:
95
+ ```
96
+ pip install -r requirements.txt
97
+ ```
98
+
99
+ 7. Navigate back to the repo. (Also, your source environment should still be activated at this point.)
100
+ ```shell
101
+ cd image_captioning
102
+ ```
103
+
104
+ 8. Open the directory of notebooks, using the below command. You'll see all of the project files appear in your local environment; open the first notebook and follow the instructions.
105
+ ```shell
106
+ jupyter notebook
107
+ ```
108
+
109
+ 9. Once you open any of the project notebooks, make sure you are in the correct `captioning_env` environment by clicking `Kernel > Change Kernel > captioning_env`.
110
+
111
+
112
+ ## 4. Models and Technologies Used
113
+
114
+ ### The following methods and techniques are employed in this project:
115
+
116
+ - Vision Transformers (ViTs)
117
+ - Attention mechanisms
118
+ - Language modeling
119
+ - Transfer learning
120
+ - Evaluation metrics for image captioning (e.g., BLEU, METEOR, CIDEr)
121
+
122
+ ### The project is implemented in Python and utilizes the following libraries:
123
+
124
+ - PyTorch
125
+ - Transformers
126
+ - TorchVision
127
+ - NumPy
128
+ - NLTK
129
+ - Matplotlib
130
+
131
+ ### Introduction
132
+
133
+ This project uses a transformer [[3]](#3) based model to generate a description
134
+ for images. This task is known as the Image Captioning task. Researchers used
135
+ many methodologies to approach this problem. One of these methodologies is the
136
+ encoder-decoder neural network [4]. The encoder transforms the source image
137
+ into a representation space; then, the decoder translates the information from
138
+ the encoded space into a natural language. The goal of the encoder-decoder is
139
+ to minimize the loss of generating a description from an image.
140
+
141
+ As shown in the survey done by MD Zakir Hossain et al. [[4]](#4), we can see that the
142
+ models that use encoder-decoder architecture mainly consist of a language model
143
+ based on LSTM [[5]](#5), which decodes the encoded image received from a CNN, see
144
+ Figure 1. The limitation of LSTM with long sequences and the success of
145
+ transformers in machine translation and other NLP tasks attracts attention to
146
+ utilizing it in machine vision. Alexey Dosovitskiy et al. introduce an image
147
+ classification model (ViT) based on a classical transformer encoder showing a
148
+ good performance [[6]](#6). Based on ViT, Wei Liu et al. present an image captioning
149
+ model (CPTR) using an encoder-decoder transformer [[1]](#1). The source image is fed
150
+ to the transformer encoder in sequence patches. Hence, one can treat the image
151
+ captioning problem as a machine translation task.
152
+
153
+ <img
154
+ src="https://github.com/zarzouram/xformer_img_captnng/blob/main/images/report/Encoder-Decoder.png"
155
+ width="80%" padding="100px 100px 100px 10px">
156
+
157
+ Figure 1: Encoder Decoder Architecture
158
+
159
+ ### Framework
160
+
161
+ The CPTR [[1]](#1) consists of an image patcher that converts images
162
+ ![x\in\mathbb{R}^{H\times W\times
163
+ C}](https://latex.codecogs.com/svg.latex?x\in\mathbb{R}^{H\times%20W\times%20C})
164
+ to a sequence of patches ![x_p\in\mathbb{R}^{N(P^2\times
165
+ E)}](https://latex.codecogs.com/svg.latex?x_p\in\mathbb{R}^{N(P^2\times%20E)}),
166
+ where _N_ is number of patches, _H_, _W_, _C_ are images height, width and
167
+ number of chanel _C=3_ respectively, _P_ is patch resolution, and _E_ is image
168
+ embeddings size. Position embeddings are then added to the images patches,
169
+ which form the input to twelve layers of identical transformer encoders. The
170
+ output of the last encoder layer goes to four layers of identical transformer
171
+ decoders. The decoder also takes words with sinusoid positional embedding.
172
+
173
+ The pre-trained ViT weights initialize the CPTR encoder [[1]](#1). I omitted
174
+ the initialization and image positional embeddings, adding an image embedding
175
+ module to the image patcher using the features map extracted from the Resnet101
176
+ network [[7]](#7). The number of encoder layers is reduced to two. For
177
+ Resenet101, I deleted the last two layers and the last softmax layer used for
178
+ image classification.
179
+
180
+ Another modification takes place at the encoder side. The feedforward network
181
+ consists of two convolution layers with a RELU activation function in between.
182
+ The encoder side deals solely with the image part, where it is beneficial to
183
+ exploit the relative position of the features we have. Refer to Figure 2 for
184
+ the model architecture.
185
+
186
+ <img
187
+ src="https://github.com/zarzouram/xformer_img_captnng/blob/main/images/report/Architectures.png"
188
+ width="80%" padding="100px 100px 100px 10px">
189
+
190
+ Figure 2: Model Architecture
191
+
192
+ ### Training
193
+
194
+ The transformer decoder output goes to one fully connected layer, which
195
+ provides –-given the previous token–- a probability distribution
196
+ (![\in\mathbb{R}^k](https://latex.codecogs.com/svg.latex?\in\mathbb{R}^k), *k*
197
+ is vocabulary size) for each token in the sequence.
198
+
199
+ I trained the model using cross-entropy loss given the target ground truth
200
+ (![y_{1:T}](https://latex.codecogs.com/svg.latex?y_{1:T})) where _T_ is the
201
+ length of the sequence. Also, I add the doubly stochastic attention
202
+ regularization [[8]](#8) to the cross-entropy loss to penalize high weights in
203
+ the encoder-decoder attention. This term encourages the summation of attention
204
+ weights across the sequence to be approximatively equal to one. By doing so,
205
+ the model will not concentrate on specific parts in the image when generating a
206
+ caption. Instead, it will look all over the image, leading to a richer and more
207
+ descriptive text [[8]](#8).
208
+
209
+ The loss function is defined as:
210
+
211
+ ![\large L=-\sum_{c=1}^{T}{log\left(p\left(y_c\middle| y_{c-1}\right)\right)\ +\ \sum_{l=1}^{L}{\frac{1}{L}\left(\sum_{d=1}^{D}\sum_{i=1}^{P^2}\left(1-\sum_{c=1}^{T}\alpha_{cidl}\right)^2\right)}}](https://latex.codecogs.com/svg.latex?\large%20L=-\sum_{c=1}^{T}{log\left(p\left(y_c\middle|%20y_{c-1}\right)\right)\%20+\%20\sum_{l=1}^{L}{\frac{1}{L}\left(\sum_{d=1}^{D}\sum_{i=1}^{P^2}\left(1-\sum_{c=1}^{T}\alpha_{cidl}\right)^2\right)}})
212
+
213
+ where _D_ is the number of heads and _L_ is the number of layers.
214
+
215
+ I used Adam optimizer, with a batch size of thirty-two. The reader can find the
216
+ model sizes in the configuration file `code/config.json`. Evaluation metrics
217
+ used are Bleu [[9]](#9), METEOR [[10]](#10), and Gleu [[11]](#11).
218
+
219
+ I trained the model for one hundred epochs, with stopping criteria if the
220
+ tracked evaluation metric (bleu-4) does not improve for twenty successive
221
+ epochs. Also, the learning rate is reduced by 0.25% if the tracked evaluation
222
+ metric (bleu-4) does not improve for ten consecutive epochs. The evaluation of
223
+ the model against the validation split takes place every two epochs.
224
+
225
+ The pre-trained Glove embeddings [[12]](#12) initialize the word embedding
226
+ weights. The words embeddings are frozen for ten epochs. The Resnet101 network
227
+ is tuned from the beginning.
228
+
229
+ ### Inference
230
+
231
+ A beam search of size five is used to generate a caption for the images in the
232
+ test split. The generation starts by feeding the image and the "start of
233
+ sentence" special tokens. Then at each iteration, five tokens with the highest
234
+ scores are chosen. The generation iteration stops when the "end of sentence" is
235
+ generated or the max length limit is reached.
236
+
237
+ ## 5. Steps for Code Explanation
238
+
239
+ ### 1. Data Loading and Preprocessing
240
+ - Load Annotations: The code first loads image-caption pairs from the COCO 2017 dataset. It uses JSON files containing images and corresponding captions (captions_train2017.json).
241
+ - Pairing Images and Captions: The code then creates a list (img_cap_pairs) that pairs image filenames with their respective captions.
242
+ - Dataframe for Captions: It organizes the data in a pandas DataFrame for easier manipulation, including creating a path to each image file.
243
+ - Sampling Data: 70,000 image-caption pairs are randomly sampled, making the dataset manageable without needing all data.
244
+
245
+ ### 2. Text Preprocessing
246
+ - The code preprocesses captions to prepare them for the model. It lowercases the text, removes punctuation, replaces multiple spaces with single spaces, and adds [start] and [end] tokens, marking the beginning and end of each caption.
247
+
248
+ ### 3. Tokenization
249
+ - Vocabulary Setup: A tokenizer (TextVectorization) is created with a vocabulary size of 15,000 words and a maximum token length of 40. It tokenizes captions, transforming them into sequences of integers.
250
+ - Saving Vocabulary: The vocabulary is saved to a file so that it can be reused later without retraining.
251
+ - Mapping Words to Indexes: word2idx and idx2word are mappings that convert words to indices and vice versa.
252
+
253
+ ### 4. Dataset Preparation
254
+ - Image-Caption Mapping: Using a dictionary, each image is mapped to its list of captions. Then, the images are shuffled, and a train-validation split is made (80% for training, 20% for validation).
255
+ - Creating TensorFlow Datasets: Using the load_data function, images are resized, preprocessed, and tokenized captions are created as tensors. These tensors are batched for training and validation, improving memory efficiency and allowing parallel processing.
256
+
257
+ ### 5. Data Augmentation
258
+ - Basic image augmentations (RandomFlip, RandomRotation, and RandomContrast) are applied to training images to help the model generalize better by learning from slightly altered versions of each image.
259
+
260
+ ### 6. Model Architecture
261
+ #### CNN Encoder:
262
+ - An InceptionV3 model (pre-trained on ImageNet) is used to process images and extract features, which serve as input to the transformer.
263
+ #### Transformer Encoder Layer:
264
+ - A TransformerEncoderLayer with multi-head self-attention and normalization layers learns the relationships between image features.
265
+ #### Embeddings Layer:
266
+ - This layer adds positional embeddings, allowing the model to capture the order of words in captions.
267
+ #### Transformer Decoder Layer:
268
+ - The TransformerDecoderLayer generates captions. It includes multi-head attention, feedforward neural networks, and dropout to prevent overfitting. Masking ensures that tokens don’t “see” future tokens when predicting the next word.
269
+
270
+ ### 7. Image Captioning Model Class
271
+ - The ImageCaptioningModel class wraps the encoder, decoder, and CNN encoder into a unified model for training and inference.
272
+ - Loss and Accuracy Calculation: Custom functions track model performance by calculating the loss and accuracy using the tokenized captions and generated predictions.
273
+
274
+ ### 8. Training
275
+ - Loss Function: Sparse categorical cross-entropy is used to calculate the difference between predicted and actual tokens, excluding padding tokens.
276
+ - Early Stopping: Monitors validation loss to stop training if performance on the validation set stops improving.
277
+ - Model Compilation and Training: The model is compiled, optimized, and trained over multiple epochs with early stopping.
278
+
279
+ ### 9. Evaluation and Caption Generation
280
+ - The generate_caption function generates a caption for a new image by feeding it through the model. The function iteratively predicts tokens, appending each token to the generated sequence until the [end] token appears.
281
+
282
+ ### 10. Saving the Model
283
+ - The model weights are saved to a file (Image_Captioning_Model) to reload the model for future use without retraining.
284
+
285
+ ## 6. Results and Analysis
286
+
287
+ ### Deployed in Hugging Face Spaces and share image captioning service using Gradio
288
+ The Hugging Face Space Image Captioning GenAI serves as a user-friendly deployment of an image captioning model, designed to generate descriptive captions for uploaded images. The deployment leverages the Hugging Face Spaces infrastructure, which is ideal for hosting machine learning applications with interactive interfaces.
289
+
290
+ ### Key Features of the Deployment:
291
+ - *Web-Based Interaction*: The Space offers an intuitive graphical interface for users to upload images and receive real-time AI-generated captions.
292
+ - *Scalability*: Built on Hugging Face’s robust hosting environment, the application ensures smooth operation, accommodating multiple users simultaneously.
293
+ - *Efficient Framework*: Likely powered by Gradio, the interface integrates seamlessly with the underlying Transformer-based model, enabling fast inference and visually engaging outputs.
294
+ - *Accessibility*: Users do not need any technical knowledge or setup to use the tool—everything is available in-browser.
295
+
296
+ [Gradio](http://pytorch.org/docs/master/optim.html#torch.optim.Optimizer) is a package that allows users to create simple web apps with just a few lines of code. It is essentially used for the same purpose as Streamlight and Flask but is much simpler to utilize. Many types of web interface tools can be selected including sketchpad, text boxes, file upload buttons, webcam, etc. Using these tools to receive various types of data as input, machine learning tasks such as classification and regression can easily be demoed.
297
+
298
+
299
+ You can deploy an interactive version of the image captioning service on your browser by running the following command. Please don't forget to set the `cocoapi_dir` and encoder/decoder model paths to the correct values.
300
+
301
+ ```shell
302
+ python gradio_main.py
303
+ ```
304
+
305
+ Access the service URL: https://huggingface.co/spaces/premanthcharan/Image_Captioining_GenAI
306
+
307
+ ![Image 11-15-24 at 4 45 PM](https://github.com/user-attachments/assets/42c8dddc-112e-424c-b29b-e45116ee0a97)
308
+ - A Web- Interface developed using Gradio platform and deployed into HuggingFace Spaces for user interaction
309
+
310
+
311
+ ![Image 11-15-24 at 4 49 PM](https://github.com/user-attachments/assets/398c8761-4d71-46d5-9f0d-19a0fdb272b7)
312
+ - Caption Generated: a red double decker bus driving down a street
313
+
314
+ ### Model Training
315
+
316
+ Figure 3 and Figure 4 show the loss and bleu-4 scores during the training and
317
+ validation phases. These figures show that the model starts to overfit early
318
+ around epoch eight. The bleu-4 score and loss value unimproved after epoch 20.
319
+ The reason for overfitting may be due to the following reasons:
320
+
321
+ 1. Not enough training data:
322
+
323
+ - The CPTR's encoder is initialized by the pre-trained ViT model [[1]](#1). In
324
+ the ViT paper, the model performs relatively well when trained on a
325
+ large dataset like ImageNet, which has 21 million Images [[6]](#6). In our
326
+ case, the model weights are randomly initialized, and we have less than
327
+ 18.5 K images.
328
+
329
+ - Typically the dataset split configuration is 113,287, 5,000, and 5,000
330
+ images for training, validation, and test based on Karpathy et al.'s work
331
+ [[13]](#13). My split has way fewer images in the training dataset and is
332
+ based on the 80%, 20%, 20% configuration.
333
+
334
+ 2. The image features learned from Resenet101 are patched to an N patches of
335
+ size _P x P_. Such configuration may not be the best design as these
336
+ features do not have to represent an image that could be transformed into a
337
+ sequence of subgrids. Flatten the Resnet101's features may be a better
338
+ design.
339
+
340
+ 3. The pre-trained Resent101 has been tuned from the beginning, unlike the
341
+ word embedding layer. The gradient updates during early training stages
342
+ where the model does not learn yet may distort the image features of the
343
+ Resent101.
344
+
345
+ 4. Unsuitable hyperparameters
346
+
347
+ | <img src="https://github.com/zarzouram/xformer_img_captnng/blob/main/images/report/LossChart.png"/> | <img src="https://github.com/zarzouram/xformer_img_captnng/blob/main/images/report/Bleu4Chart.png"> |
348
+ | :--: | :--: |
349
+ | Figure 3: Loss Curve | Figure 4: Bleu-4 score curv |
350
+
351
+ ### Inference Output
352
+
353
+ #### Generated Text Length
354
+
355
+ Figure 5 shows the generated caption's lengths distribution. The Figure
356
+ indicates that the model tends to generate shorter captions. The distribution
357
+ of the training caption's lengths (left) explains that behavior; the
358
+ distribution of the lengths is positively skewed. More specifically, the
359
+ maximum caption length generated by the model (21 tokens) accounts for 98.66%
360
+ of the lengths in the training set. See “code/experiment.ipynb Section 1.3”.
361
+
362
+ <img
363
+ src="https://github.com/zarzouram/xformer_img_captnng/blob/main/images/report/lens.png"
364
+ padding="100px 100px 100px 10px">
365
+
366
+ Figure 5: Generated caption's lengths distribution
367
+
368
+ ## 7. Evaluation Metrics
369
+
370
+ The table below shows the mean and standard deviation of the performance
371
+ metrics across the test dataset. The bleu4 has the highest variation,
372
+ suggesting that the performance varies across the dataset. This high variation
373
+ is expected as the model training needs improvement, as discussed above. Also,
374
+ the distribution of the bleu4 scores over the test set shows that 83.3% of the
375
+ scores are less than 0.5. See “code/experiment.ipynb Section 1.4”.
376
+
377
+ | | bleu1 | bleu2 | bleu3 | bleu4 | gleu | meteor |
378
+ | :--- | :----: |:----: |:----: |:----: |:----: |:----: |
379
+ |mean ± std | 0.7180 ± 0.17 | 0.5116 ± 0.226 | 0.3791 ± 0.227 | 0.2918 ± 0.215 | 0.2814 ± 0.174 | 0.4975 ± 0.193
380
+
381
+ ### Attention Visualisation
382
+
383
+ I will examine the last layer of the transformer encoder-decoder attention. The weights are averaged across its heads. Section 1.5 in the notebook "code/experiment.ipynb" shows that the weights contain outliers. I considered weights that far from 99.95% percentile and higher as outliers. The outlier's values are capped to the 99.95% percentile.
384
+
385
+ Fourteen samples were randomly selected from the test split to be examined. The sample image is superimposed with the attention weights for each generated token. The output is saved in either GIF format (one image for all generated tokens) or png format (one image for each token). All superimposed images are saved under "images/tests". The reader can examine the selected fourteen superimposed images under section 2.0 from the experiments notebook. You need to rerun all cells under Section 2.0. The samples are categorized as follows:
386
+
387
+ Category 1. two samples that have the highest bleu4= 1.0
388
+ Category 2. four samples that have the lowest bleu4 scores
389
+ Category 3. two samples that have the low value of bleu4 [up to 0.5]
390
+ Category 4. two samples that have bleu4 score= (0.5 - 0.7]
391
+ Category 5. two samples that have bleu4 score=(0.7 - 0.8]
392
+ Category 6. two samples that have bleu4 score= (0.8 - 1.0)
393
+
394
+ ## 8. References
395
+
396
+ <a id="1">[1]</a> Liu, W., Chen, S., Guo, L., Zhu, X., & Liu, J. (2021). CPTR:
397
+ Full transformer network for image captioning. arXiv preprint
398
+ [arXiv:2101.10804](https://arxiv.org/abs/2101.10804).
399
+
400
+ <a id="2">[2]</a> Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P.,
401
+ Ramanan, D., ... & Zitnick, C. L. (2014, September). Microsoft coco: Common
402
+ objects in context. In European conference on computer vision (pp. 740-755).
403
+ Springer, Cham.
404
+
405
+ <a id="3">[3]</a> A. Vaswani et al., 'Attention is all you need', Advances in neural
406
+ information processing systems, vol. 30, 2017.
407
+
408
+ <a id="4">[4]</a> M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, 'A Comprehensive
409
+ Survey of Deep Learning for Image Captioning', arXiv:1810.04020 [cs, stat],
410
+ Oct. 2018, Accessed: Mar. 03, 2022. [Online]. Available:
411
+ http://arxiv.org/abs/1810.04020.
412
+
413
+ <a id="5">[5]</a> S. Hochreiter and J. Schmidhuber, ‘Long short-term memory’, Neural
414
+ computation, vol. 9, no. 8, pp. 1735–1780, 1997.
415
+
416
+ <a id="6">[6]</a> A. Dosovitskiy et al., 'An image is worth 16x16 words: Transformers for
417
+ image recognition at scale', arXiv preprint arXiv:2010.11929, 2020.
418
+
419
+ <a id="7">[7]</a> K. He, X. Zhang, S. Ren, and J. Sun, 'Deep Residual Learning for Image
420
+ Recognition', arXiv:1512.03385 [cs], Oct. 2015, Accessed: Mar. 06, 2022.
421
+ [Online]. Available: http://arxiv.org/abs/1512.03385.
422
+
423
+ <a id="8">[8]</a> K. Xu et al., 'Show, Attend and Tell: Neural Image Caption Generation with
424
+ Visual Attention', arXiv:1502.03044 [cs], Apr. 2016, Accessed: Mar. 07, 2022.
425
+ [Online]. Available: http://arxiv.org/abs/1502.03044.
426
+
427
+ <a id="9">[9]</a> K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, 'Bleu: a method for
428
+ automatic evaluation of machine translation', in Proceedings of the 40th annual
429
+ meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
430
+
431
+ <a id="10">[10]</a> S. Banerjee and A. Lavie, 'METEOR: An automatic metric for MT evaluation
432
+ with improved correlation with human judgments', in Proceedings of the acl
433
+ workshop on intrinsic and extrinsic evaluation measures for machine translation
434
+ and/or summarization, 2005, pp. 65–72.
435
+
436
+ <a id="11">[11]</a> A. Mutton, M. Dras, S. Wan, and R. Dale, 'GLEU: Automatic evaluation of
437
+ sentence-level fluency', in Proceedings of the 45th Annual Meeting of the
438
+ Association of Computational Linguistics, 2007, pp. 344–351.
439
+
440
+ <a id="12">[12]</a> J. Pennington, R. Socher, and C. D. Manning, 'Glove: Global vectors for
441
+ word representation', in Proceedings of the 2014 conference on empirical
442
+ methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
443
+
444
+ <a id="13">[13]</a> A. Karpathy and L. Fei-Fei, 'Deep visual-semantic alignments for
445
+ generating image descriptions', in Proceedings of the IEEE conference on
446
+ computer vision and pattern recognition, 2015, pp. 3128–3137.
447
+
448
+ <a id="13">[14]</a> Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3156-3164.
449
+
450
+ <a id="13">[15]</a> Hugging Face Spaces Forum about image captioning model.
451
+ https://huggingface.co/docs/transformers/main/en/tasks/image_captioning
452
+
453
+ <a id="13">[16]</a> QuickStart Guide to GitHub pages
454
+ https://docs.github.com/en/pages/quickstart
455
+
456
+ <a id="13">[17]</a> Microsoft COCO: Common Objects in Context (cs.CV). arXiv:1405.0312 [cs.CV]
457
+ https://doi.org/10.48550/arXiv.1405.0312
458
+
459
+ <a id="13">[18]</a> Show, Attend and Tell: Neural Image Caption Generation with Visual Attention arXiv:1502.03044v3 [cs.LG] 19 Apr 2016 https://doi.org/10.48550/arXiv.1502.03044
460
+
461
+ <a id="13">[19]</a> Deep Residual Learning for Image Recognition arXiv:1512.03385v1 [cs.CV] 10 Dec 2015
462
 
463
+ <a id="13">[20]</a> Gradio Quickstart Guide https://www.gradio.app/guides/quickstart