Cheese Origin Classification with DistilBERT

Model Description

This repository contains a fine-tuned DistilBERT model for classifying cheese descriptions into their country/region of origin. The model was trained on the aslan-ng/cheese-text dataset as part of Homework 1 for CMU 24-679 (Designing and Deploying AI/ML).

Base model: distilbert-base-uncased
Task: Multiclass text classification
Labels (21 classes): Belgium/Germany, Bulgaria, Cyprus, Denmark, England, France, Germany, Greece, India, Italy, Levant, Mexico, Netherlands, Norway, Peru, Philippines, Poland, Portugal, Spain, Switzerland, USA

Training and Evaluation

Train/Val/Test split (augmented data): 640 / 160 / 200
External validation (original data): 100

Dataset	Accuracy	F1 (Weighted)	Precision	Recall
Augmented Test	0.9300	0.9123	0.9168	0.9300
External Validation	0.9500	0.9323	0.9214	0.9500

The model generalizes well beyond the augmented split, achieving 95% accuracy on the original validation set.

Error Analysis

Common confusions occur between geographically or culturally close regions:

Germany vs Switzerland (e.g., Limburger misclassified as Swiss)
Norway vs Denmark (e.g., Jarlsberg → Denmark)
Cyprus vs Greece (e.g., Halloumi → Greece)
Philippines vs Spain (Queso de Bola → Spain)

These errors reflect real-world overlaps in cheese naming and history.

How to Use

Install dependencies

```bash pip install transformers torch datasets ```

Load model and tokenizer

```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch

model_id = "cassieli226/cheese-text-distilbert-predictor" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained(model_id)

text = "Halloumi is a semi-hard Cypriot cheese known for its high melting point." inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad(): logits = model(**inputs).logits predicted_class = logits.argmax(dim=-1).item()

print("Predicted class:", model.config.id2label[predicted_class]) ```

Dataset

Source: aslan-ng/cheese-text
Structure:
- `original`: 100 manually curated examples
- `augmented`: 1000 synthetic examples (paraphrased and simplified)

Intended Use

Educational demonstration of fine-tuning DistilBERT for multiclass classification.
Baseline for exploring text augmentation and error analysis in NLP coursework.

Limitations

Dataset is small (≈1,100 examples), so predictions may be sensitive to phrasing.
Cultural and regional overlaps in cheese descriptions can lead to ambiguities.

Acknowledgments

Dataset by aslan-ng
Fine-tuning completed for CMU 24-679 coursework.
Model training, evaluation, and upload assisted with Hugging Face Transformers.

Downloads last month: 25

Safetensors

Model size

67M params

Tensor type

F32

cassieli226
/

cheese-text-distilbert-predictor