DistilBERT-Base-Uncased Quantized Model for Scientific Paper Classification
This repository hosts a quantized version of the DistilBERT model, fine-tuned for scientific paper classification into three categories: Biology, Mathematics, and Physics. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for real-world applications, including academic research and automated categorization of scientific literature.
Model Details
- Model Architecture: DistilBERT Base Uncased
- Task: Scientific Paper Classification
- Dataset: Custom dataset labeled with three categories: Biology, Mathematics, and Physics
- Quantization: Float16 (FP16)
- Fine-tuning Framework: Hugging Face Transformers
Usage
Installation
pip install transformers torch
Loading the Model
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
import torch
# Load quantized model
quantized_model_path = "/kaggle/working/distilbert_finetuned_fp16"
quantized_model = DistilBertForSequenceClassification.from_pretrained(quantized_model_path)
quantized_model.eval() # Set to evaluation mode
quantized_model.half() # Convert model to FP16
# Load tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
# Define a test input
test_paper = "The quantum mechanics of atomic structures are governed by Schrödinger's equation."
# Tokenize input
inputs = tokenizer(test_paper, return_tensors="pt", padding=True, truncation=True, max_length=512)
# Ensure input tensors are in correct dtype
inputs["input_ids"] = inputs["input_ids"].long() # Convert to long type
inputs["attention_mask"] = inputs["attention_mask"].long() # Convert to long type
# Make prediction
with torch.no_grad():
outputs = quantized_model(**inputs)
# Get predicted class
predicted_class = torch.argmax(outputs.logits, dim=1).item()
# Class labels
label_mapping = {0: "Biology", 1: "Mathematics", 2: "Physics"}
predicted_label = label_mapping[predicted_class]
print(f"Predicted Label: {predicted_label}")
Performance Metrics
- Accuracy: 0.95 (after fine-tuning)
- F1-Score: 0.91 (weighted)
Fine-Tuning Details
Dataset
The dataset consists of scientific papers categorized into three domains:
- Biology
- Mathematics
- Physics
The dataset was preprocessed and tokenized using the DistilBERT tokenizer.
Training
- Number of epochs: 3
- Batch size: 8
- Learning rate: 2e-5
- Optimizer: AdamW
- Evaluation strategy: epoch
Quantization
Post-training quantization was applied using PyTorch’s built-in quantization framework to reduce the model size and improve inference efficiency.
Repository Structure
.
├── model/ # Contains the quantized model files
├── tokenizer_config/ # Tokenizer configuration and vocabulary files
├── model.safensors/ # Fine-Tuned Model
├── README.md # Model documentation
Limitations
- The model is trained on a limited dataset and may not generalize well to niche scientific subdomains.
- Quantization may result in slight accuracy degradation compared to full-precision models.
Contributing
Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.
- Downloads last month
- 4