A Comparative Evaluation of Transformer-Based Vision Encoder-Decoder Models for Brazilian Portuguese Image Captioning

Programa de Pós-Graduação em Computação Aplicada - PPComp
Instituto Federal do Espírito Santo, Serra, Brazil - IFES

Example of Transformer-based Vision Encoder-Decoder Architecture

Abstract

Image captioning refers to the process of creating a natural language description for one or more images. This task has several practical applications, from aiding in medical diagnoses through image descriptions to promoting social inclusion by providing visual context to people with impairments.

Despite recent progress, especially in English, low-resource languages like Brazilian Portuguese face a shortage of datasets, models, and studies. This work seeks to contribute to this context by fine-tuning and investigating the performance of vision language models based on the Transformer architecture in Brazilian Portuguese. We leverage pre-trained vision model checkpoints (ViT, Swin, and DeiT) and neural language models (BERTimbau, DistilBERTimbau, and GPorTuguese-2). Several experiments were carried out to compare the efficiency of different model combinations using the #PraCegoVer-63K, a native Portuguese dataset, and a translated version of the Flickr30K dataset.

The experimental results demonstrated that configurations using the Swin, DistilBERTimbau, and GPorTuguese-2 models generally achieved the best outcomes. Furthermore, the #PraCegoVer-63K dataset presents a series of challenges, such as descriptions made up of multiple sentences and the presence of proper names of places and people, which significantly decrease the performance of the investigated models.

Our main contributions are summarized as follows:

  1. This is the first work to conduct a comprehensive experimental investigation on a fully Transformer-based vision encoder-decoder architecture for Brazilian Portuguese Image Captioning. We provide a performance comparison of different options to encoder and decoder models.
  2. Our extensive evaluation was conducted on two datasets: the native Portuguese #PraCegoVer-63K and our translated version of the traditional Flickr30K.
  3. Our source code, Portuguese translated version of Flickr30K, and the models that achieved the highest performance are publicly available.

Related Work

The main advantage of our work compared to related studies lies in its comprehensive and detailed approach to Image Captioning (IC) in Brazilian Portuguese, a low-resource language. Unlike previous research focusing on generic models or translated datasets, our study explores both translated datasets and native ones (#PraCegoVer), enabling a deeper analysis of the linguistic and cultural nuances of Brazilian Portuguese. Additionally, by combining advanced visual encoders (ViT, Swin, DeiT) with textual decoders (BERTimbau, DistilBERTimbau, GPorTuguese-2), we identified the most effective combinations for specific scenarios. Our work is also pioneering in its extensive evaluation of model performance using a diverse set of metrics, including BLEU, ROUGE, METEOR, CIDEr, and BERTScore, offering a broad and quantitative view of performance across different contexts. This provides a solid foundation for future research and advancements in IC for Brazilian Portuguese.

Our work compared to other Brazilian Portuguese IC works.

Author(s) Fully Transformer-Based Leverage Pre-trained Checkpoints Compare Several Models Brazilian Portuguese Dataset Translated Dataset
Santos et. al [1]
Gondim et. al [2]
Alencar et. al [3]
Ours

Experimental Results

The models with highest evaluation metrics are available on Hugging Face.

Here, the language models names stand for their portuguese version: BERT for BERTimbau, DistilBERT for DistilBERTimbau, and GPT-2 for GPorTuguese-2.

Flickr30K Portuguese Dataset


All the models with Swin Transformer as encoder achieved the best results in evaluation metrics. The best one is the model that used Swin together with the portuguese version of DistilBERT, which surpassed its teacher model (the portuguese version of BERT).

Evaluation results (%) for the Flick30K dataset. The three higher are bold, and the highest have green background.

Encoder Decoder CIDEr-D BLEU-4 ROUGE-L METEOR BERTScore
BERTBASE 49.53 19.20 36.00 39.80 69.58
DeiTBASE DistilBERTBASE 50.58 19.24 35.77 39.93 69.50
GPT-2SMALL 50.61 19.83 36.30 40.52 69.66
BERTBASE 62.42 22.78 38.71 43.47 71.19
SwinBASE DistilBERTBASE 66.73 24.65 39.98 44.71 72.30
GPT-2SMALL 64.71 23.15 39.39 44.36 71.70
BERTBASE 57.32 22.12 37.50 41.72 70.63
ViTBASE DistilBERTBASE 59.32 21.19 37.74 42.70 71.15
GPT-2SMALL 59.02 21.39 37.68 42.64 71.03

It is worth to point out that Flickr30K has shorter captions and lower caption variance when compared to another datasets like #PraCegoVer. The scenes are generic and depict few perspectives, and the descriptions are way simple.

Left image, where Swin-DistilBERT achieved lower performance. Right image, where Swin-DistilBERT achieved higher performance. The side comments are about the dataset overall characteristics.

#PraCegoVer-63K Dataset

All the models with GPT-2 as decoder achieved the best results in evaluation metrics. The best one is the model that used Swin together with the portuguese version of GPT-2.

Evaluation results (%) for the #PraCegoVer-63K dataset. The three higher are bold, and the highest have green background.

Encoder Decoder CIDEr-D BLEU-4 ROUGE-L METEOR BERTScore
BERTBASE 0.99 0.00 4.02 3.49 36.20
DeiTBASE DistilBERTBASE 1.59 0.11 9.22 7.74 45.36
GPT-2SMALL 5.95 1.00 12.44 13.87 49.11
BERTBASE 1.29 0.00 4.53 3.90 30.84
SwinBASE DistilBERTBASE 0.31 0.01 7.91 5.76 40.95
GPT-2SMALL 9.45 1.60 13.43 15.58 49.85
BERTBASE 0.83 0.00 3.03 2.61 27.69
ViTBASE DistilBERTBASE 1.70 0.12 9.01 7.89 45.71
GPT-2SMALL 8.27 1.49 13.23 15.74 49.57

It is worth to point out that #PraCegoVer-63K has larger captions and higher caption variance when compared to another datasets like Flickr30K and COCO Captions. The dataset contains proper names of people and places, image-reference caption mismatching, linguistics errors and complex images which demands additional resources to support smaller models on the generation of more accurate captions.

Left image, where Swin-GPT-2 achieved lower performance. Right image, where Swin-GPT-2 achieved higher performance. The side comments are about the dataset overall characteristics.

BibTeX

@inproceedings{bromonschenkel2024comparative,
    title={A Comparative Evaluation of Transformer-Based Vision Encoder-Decoder Models for Brazilian Portuguese Image Captioning},
    author={Bromonschenkel, Gabriel and Oliveira, Hil{\'a}rio and Paix{\~a}o, Thiago M},
    booktitle={2024 37th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI)},
    pages={1--6},
    year={2024},
    organization={IEEE}
}