DistilBERT Text Classification Model
This model is a fine-tuned version of distilbert-base-uncased for text classification tasks.
Model Description
This model is a fine-tuned DistilBERT model for binary text classification, specifically designed to classify text as being related to either Pittsburgh or Shanghai cities. The model achieves excellent performance with 99.5% accuracy on the test set.
- Model type: Text Classification (Binary)
- Language(s) (NLP): English
- Base model: distilbert-base-uncased
- Classes: Pittsburgh, Shanghai
Intended Uses & Limitations
Intended Uses
- Binary text classification between Pittsburgh and Shanghai-related content
- City-based text categorization tasks
- Research and educational purposes in NLP and text classification
Limitations
- Limited to English language text
- Performance may vary on out-of-domain data
- Maximum input length of 256 tokens due to truncation
Training and Evaluation Data
Training Data
- Base dataset: cassieli226/cities-text-dataset
- Classes: Pittsburgh (507 samples) and Shanghai (493 samples) in augmented dataset
- Original dataset: 100 samples (50 Pittsburgh, 50 Shanghai)
- Data augmentation: Applied to increase dataset size from 100 to 1000 samples
- Train/Test Split: 80/20 split (800 train, 200 test) with stratified sampling
- External validation: Original 100 samples used for additional validation
Preprocessing
- Text tokenization using DistilBERT tokenizer
- Maximum sequence length: 256 tokens
- Truncation applied to longer sequences
Training Procedure
Training Hyperparameters
- Learning rate: 5e-5
- Training batch size: 16
- Evaluation batch size: 32
- Number of epochs: 4
- Weight decay: 0.01
- Warmup ratio: 0.1
- LR scheduler: Linear
- Gradient accumulation steps: 1
- Mixed precision: FP16 (if GPU available)
Training Configuration
- Optimizer: AdamW (default)
- Early stopping: Enabled with patience of 2 epochs
- Best model selection: Based on F1 score (macro)
- Evaluation strategy: Every epoch
- Save strategy: Every epoch (best model only)
Evaluation
Metrics
The model was evaluated using:
- Accuracy: Overall classification accuracy
- F1 Score (Macro): Macro-averaged F1 score across all classes
- Per-class accuracy: Individual class performance metrics
Results
- Test Set Performance:
- Accuracy: 99.5%
- F1 Score (Macro): 99.5%
- External Validation:
- Accuracy: 100.0%
- F1 Score (Macro): 100.0%
Detailed Performance
- Pittsburgh Class: 99.01% accuracy (101 samples)
- Shanghai Class: 100.0% accuracy (99 samples)
- Confusion Matrix: Only 1 misclassification out of 200 test samples
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "Anyuhhh/hw2-text-distilbert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example usage
text = "Your input text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=-1)
print(f"Predicted class: {predicted_class.item()}")
- Downloads last month
- 13
Model tree for Anyuhhh/hw2-text-distilbert
Base model
distilbert/distilbert-base-uncasedDataset used to train Anyuhhh/hw2-text-distilbert
Evaluation results
- Test Accuracy on Cities Text Datasettest set self-reported99.500
- Test F1 Score (Macro) on Cities Text Datasettest set self-reported99.500