Cheese Origin Classification with DistilBERT
Model Description
This repository contains a fine-tuned DistilBERT model for classifying cheese descriptions into their country/region of origin.
The model was trained on the aslan-ng/cheese-text
dataset as part of Homework 1 for CMU 24-679 (Designing and Deploying AI/ML).
- Base model: distilbert-base-uncased
- Task: Multiclass text classification
- Labels (21 classes): Belgium/Germany, Bulgaria, Cyprus, Denmark, England, France, Germany, Greece, India, Italy, Levant, Mexico, Netherlands, Norway, Peru, Philippines, Poland, Portugal, Spain, Switzerland, USA
Training and Evaluation
- Train/Val/Test split (augmented data): 640 / 160 / 200
- External validation (original data): 100
Dataset | Accuracy | F1 (Weighted) | Precision | Recall |
---|---|---|---|---|
Augmented Test | 0.9300 | 0.9123 | 0.9168 | 0.9300 |
External Validation | 0.9500 | 0.9323 | 0.9214 | 0.9500 |
The model generalizes well beyond the augmented split, achieving 95% accuracy on the original validation set.
Error Analysis
Common confusions occur between geographically or culturally close regions:
- Germany vs Switzerland (e.g., Limburger misclassified as Swiss)
- Norway vs Denmark (e.g., Jarlsberg โ Denmark)
- Cyprus vs Greece (e.g., Halloumi โ Greece)
- Philippines vs Spain (Queso de Bola โ Spain)
These errors reflect real-world overlaps in cheese naming and history.
How to Use
Install dependencies
```bash pip install transformers torch datasets ```
Load model and tokenizer
```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch
model_id = "cassieli226/cheese-text-distilbert-predictor" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained(model_id)
text = "Halloumi is a semi-hard Cypriot cheese known for its high melting point." inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad(): logits = model(**inputs).logits predicted_class = logits.argmax(dim=-1).item()
print("Predicted class:", model.config.id2label[predicted_class]) ```
Dataset
- Source: aslan-ng/cheese-text
- Structure:
- `original`: 100 manually curated examples
- `augmented`: 1000 synthetic examples (paraphrased and simplified)
Intended Use
- Educational demonstration of fine-tuning DistilBERT for multiclass classification.
- Baseline for exploring text augmentation and error analysis in NLP coursework.
Limitations
- Dataset is small (โ1,100 examples), so predictions may be sensitive to phrasing.
- Cultural and regional overlaps in cheese descriptions can lead to ambiguities.
Acknowledgments
- Dataset by aslan-ng
- Fine-tuning completed for CMU 24-679 coursework.
- Model training, evaluation, and upload assisted with Hugging Face Transformers.
- Downloads last month
- 25