FLAN-T5 COVID-19 Vaccine Stance Classification

This repository contains my submission for the take-home coding assessment regarding the LLM Research Opportunity under Sean Yun-Shiuan Chuang, Junjie Hu, and Tim Rogers.

This model is currently public for easy visibility during the coding assessment, but will be made private afterwards.

Task Summary

Predict the stance of each tweet (in-favor, against, or neutral-or-unclear) from a CSV of 5,751 tweets regarding the COVID-19 vaccination using flan-t5-large.

Project Structure

predict.py - Predicts the model's labels on a given dataset and saves the result into output/.
eval.py - Model evaluation.
utils.py - Shared helper functions.
train.py - Fine-tuning code on a given dataset.
requirements.txt - Package installs for reproducibility.
data/ - Contains original dataset.
output/ - Contains prediction output files and heldout dataset files from train/test splitting.
finetune/ - contains all files of the fine-tuned model. Includes epoch checkpoint files.

Setup

Install dependencies:

pip install transformers torch pandas scikit-learn sentencepiece datasets

pip install -r requirements.txt

Quick Start

To run the fine-tuned model as-is:

python3 predict.py        # manually change dataset path if needed
python3 eval.py           # for evaluation

Development Summary

Initial zero-shot prompting (no fine-tuning) revealed the model never predicted neutral-or-unclear. Overall F1 score was 0.428.
In order to speed up initial fine-tuning on T4 GPU, flan-t5-base was used until final evaluations were done using flan-t5-large.
- Initial attempt at fine-tuning (no upsampling) had poor neutral-or-unclear recall (0.18). Overall F1 score was 0.518.
- Fine-tuning with upsampling on neutral-or-unclear with an 80/20 train/test split on the first 2,000 records, and then running predictions on the following 1,500 records yielded an F1 score of 0.562. (3 Epochs)
- Fine-tuning with upsampling only on neutral-or-unclearon the entire dataset with a 80/20 train/test split showed average precision for against (0.59). Overall F1 score was 0.690. (3 Epochs)
- Fine-tuning with upsampling on both neutral-or-unclear and against lead to an F1 score of 0.724. (3 Epochs)\
Final fine-tuning on flan-t5-large was done on 2 epochs in bf16 format to account for T4 GPU limitations.
- Fine-tuning with flan-t5-large on an 80/20 split with predictions ran on the heldout dataset resulted in an F1 score of 0.782.
- Final version of the fine-tuned model, with predictions run on the entirety of the original dataset provided a final F1 score of 0.772 with an accuracy of 0.801.

Potential Improvements

Further tinkering with prompt could yield improved results. Brevity is a key obstacle in prompt generation.
3 epochs over 2 for fine-tuning flan-t5-large could also provide improved F1 on more powerful GPUs.
Experimentation on train/test splits.

akashmohan
/

finetuned-flan-t5-large