Movie Review Sentiment Analysis

Project Task

This project focuses on predicting movie reviews using the IMDb movie review dataset from Stanford. The goal is to classify reviews as either positive (1) or negative (0) using a pre-trained language model.

Dataset

The dataset consists of:

25,000 labeled training reviews
25,000 labeled test reviews
Additional unlabeled data for potential use
The data is categorical, containing text-based movie reviews and corresponding sentiment labels (0 for negative, 1 for positive).

Loading the Original Dataset

You can load the original dataset using the datasets library:

from datasets import load_dataset

ds = load_dataset("stanfordnlp/imdb")

Preprocessed Dataset Links

The cleaned and preprocessed datasets are stored in Google Drive and can be accessed via the following links:

Unsupervised Data: unsupervised_clean.csv
Training Data: clean_train.csv
Testing Data: clean_test.csv

Pre-trained Model

The DistilBERT pre-trained model was selected because it is a smaller, faster, and more efficient version of BERT, making it well-suited for NLP tasks while maintaining high accuracy.

Performance Metrics

The model was evaluated using the following metrics:

Metric	Training Set	Test Set
Accuracy	89.7%	89.9%
Precision	0.8977	0.8991
Recall	0.8976	0.8991
F1 Score	0.8976	0.8990
Loss	-	0.2705

Hyperparameters

The most relevant hyperparameters used for optimization include:

Number of epochs: Determines the number of training cycles
Weight decay: Helps prevent overfitting
Warmup steps: Gradually increases learning rate for stability
Hidden dropout probability: Reduces overfitting by randomly dropping connections
Attention dropout probability: Prevents over-reliance on specific tokens

This project leverages transfer learning with fine-tuning to achieve high accuracy in sentiment analysis.