DistilBERT Text Classification Model

This model is a fine-tuned version of distilbert-base-uncased for text classification tasks.

Model Description

This model is a fine-tuned DistilBERT model for binary text classification, specifically designed to classify text as being related to either Pittsburgh or Shanghai cities. The model achieves excellent performance with 99.5% accuracy on the test set.

  • Model type: Text Classification (Binary)
  • Language(s) (NLP): English
  • Base model: distilbert-base-uncased
  • Classes: Pittsburgh, Shanghai

Intended Uses & Limitations

Intended Uses

  • Binary text classification between Pittsburgh and Shanghai-related content
  • City-based text categorization tasks
  • Research and educational purposes in NLP and text classification

Limitations

  • Limited to English language text
  • Performance may vary on out-of-domain data
  • Maximum input length of 256 tokens due to truncation

Training and Evaluation Data

Training Data

  • Base dataset: cassieli226/cities-text-dataset
  • Classes: Pittsburgh (507 samples) and Shanghai (493 samples) in augmented dataset
  • Original dataset: 100 samples (50 Pittsburgh, 50 Shanghai)
  • Data augmentation: Applied to increase dataset size from 100 to 1000 samples
  • Train/Test Split: 80/20 split (800 train, 200 test) with stratified sampling
  • External validation: Original 100 samples used for additional validation

Preprocessing

  • Text tokenization using DistilBERT tokenizer
  • Maximum sequence length: 256 tokens
  • Truncation applied to longer sequences

Training Procedure

Training Hyperparameters

  • Learning rate: 5e-5
  • Training batch size: 16
  • Evaluation batch size: 32
  • Number of epochs: 4
  • Weight decay: 0.01
  • Warmup ratio: 0.1
  • LR scheduler: Linear
  • Gradient accumulation steps: 1
  • Mixed precision: FP16 (if GPU available)

Training Configuration

  • Optimizer: AdamW (default)
  • Early stopping: Enabled with patience of 2 epochs
  • Best model selection: Based on F1 score (macro)
  • Evaluation strategy: Every epoch
  • Save strategy: Every epoch (best model only)

Evaluation

Metrics

The model was evaluated using:

  • Accuracy: Overall classification accuracy
  • F1 Score (Macro): Macro-averaged F1 score across all classes
  • Per-class accuracy: Individual class performance metrics

Results

  • Test Set Performance:
    • Accuracy: 99.5%
    • F1 Score (Macro): 99.5%
  • External Validation:
    • Accuracy: 100.0%
    • F1 Score (Macro): 100.0%

Detailed Performance

  • Pittsburgh Class: 99.01% accuracy (101 samples)
  • Shanghai Class: 100.0% accuracy (99 samples)
  • Confusion Matrix: Only 1 misclassification out of 200 test samples

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "Anyuhhh/hw2-text-distilbert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example usage
text = "Your input text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=-1)

print(f"Predicted class: {predicted_class.item()}")
Downloads last month
13
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Anyuhhh/hw2-text-distilbert

Finetuned
(9828)
this model

Dataset used to train Anyuhhh/hw2-text-distilbert

Evaluation results