Cheese Origin Classification with DistilBERT

Model Description

This repository contains a fine-tuned DistilBERT model for classifying cheese descriptions into their country/region of origin. The model was trained on the aslan-ng/cheese-text dataset as part of Homework 1 for CMU 24-679 (Designing and Deploying AI/ML).

  • Base model: distilbert-base-uncased
  • Task: Multiclass text classification
  • Labels (21 classes): Belgium/Germany, Bulgaria, Cyprus, Denmark, England, France, Germany, Greece, India, Italy, Levant, Mexico, Netherlands, Norway, Peru, Philippines, Poland, Portugal, Spain, Switzerland, USA

Training and Evaluation

  • Train/Val/Test split (augmented data): 640 / 160 / 200
  • External validation (original data): 100
Dataset Accuracy F1 (Weighted) Precision Recall
Augmented Test 0.9300 0.9123 0.9168 0.9300
External Validation 0.9500 0.9323 0.9214 0.9500

The model generalizes well beyond the augmented split, achieving 95% accuracy on the original validation set.


Error Analysis

Common confusions occur between geographically or culturally close regions:

  • Germany vs Switzerland (e.g., Limburger misclassified as Swiss)
  • Norway vs Denmark (e.g., Jarlsberg โ†’ Denmark)
  • Cyprus vs Greece (e.g., Halloumi โ†’ Greece)
  • Philippines vs Spain (Queso de Bola โ†’ Spain)

These errors reflect real-world overlaps in cheese naming and history.


How to Use

Install dependencies

```bash pip install transformers torch datasets ```

Load model and tokenizer

```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch

model_id = "cassieli226/cheese-text-distilbert-predictor" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained(model_id)

text = "Halloumi is a semi-hard Cypriot cheese known for its high melting point." inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad(): logits = model(**inputs).logits predicted_class = logits.argmax(dim=-1).item()

print("Predicted class:", model.config.id2label[predicted_class]) ```


Dataset

  • Source: aslan-ng/cheese-text
  • Structure:
    • `original`: 100 manually curated examples
    • `augmented`: 1000 synthetic examples (paraphrased and simplified)

Intended Use

  • Educational demonstration of fine-tuning DistilBERT for multiclass classification.
  • Baseline for exploring text augmentation and error analysis in NLP coursework.

Limitations

  • Dataset is small (โ‰ˆ1,100 examples), so predictions may be sensitive to phrasing.
  • Cultural and regional overlaps in cheese descriptions can lead to ambiguities.

Acknowledgments

Downloads last month
25
Safetensors
Model size
67M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train cassieli226/cheese-text-distilbert-predictor