---
license: cc0-1.0
task_categories:
- token-classification
language:
- en
tags:
- named-entity-recognition
- ner
- scientific
- unit-conversion
- units
- measurement
- natural-language-understanding
- automatic-annotations
---

# DistilBERT Token Classification Model for Unit Conversion

### Model Overview

This model is a fine-tuned version of `distilbert/distilbert-base-uncased` for token classification on unit conversion-related text. It is designed to recognize unit values and conversion entities, facilitating automatic extraction of unit-related data.

### Dataset

The model is trained on the `maliknaik/natural_unit_conversion` dataset, which contains:

- **Training set**: 583,863 examples
- **Validation set**: 100,091 examples
- **Test set**: 150,137 examples

Each example consists of:

- **text**: The input sentence containing unit-related phrases.
- **entities**: The labeled entities specifying unit values and types.

Dataset url: [https://huggingface.co/datasets/maliknaik/natural_unit_conversion](https://huggingface.co/datasets/maliknaik/natural_unit_conversion)

### Labels

The model classifies tokens into the following categories:

- `B-FROM_UNIT`: Beginning of the source unit
- `I-FROM_UNIT`: Inside the source unit
- `B-TO_UNIT`: Beginning of the target unit
- `I-TO_UNIT`: Inside the target unit
- `B-FEET_VALUE`: Beginning of feet value
- `I-FEET_VALUE`: Inside feet value
- `B-INCH_VALUE`: Beginning of inch value
- `I-INCH_VALUE`: Inside inch value

### Training Details
- **Base Model**: `distilbert/distilbert-base-uncased`
- **Tokenization**: `AutoTokenizer` from Hugging Face Transformers
- **Training Framework**: Hugging Face `Trainer`
- **Data Collator**: `DataCollatorForTokenClassification`
- **Loss Function**: CrossEntropyLoss
- **Batch Size**: 64
- **Epochs**: 10
- **GPU**: 1x NVIDIA Tesla P4 (8GB GDDR5)
- **CPU**: 56 vCPUs
- **RAM**: 283GB


### Usage

To use this model for inference:

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = 'maliknaik/distilbert-natural-unit-conversion'

model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased')

text = 'How many miles are there in 50 kilometers?'

unit_pipeline = pipeline('ner', model=model, tokenizer=tokenizer)
print(unit_pipeline(text))
```

Output:
```bash
[{'entity_group': 'TO_UNIT',
  'score': np.float32(0.9999982),
  'word': 'miles',
  'start': 9,
  'end': 14},
 {'entity_group': 'FROM_UNIT',
  'score': np.float32(0.9999473),
  'word': 'kilometers',
  'start': 31,
  'end': 41}]
```

### Performance

The model achieves high f1 score in identifying unit values and conversions. The f1-score for validation and test sets is 
expected to be optimized further.


### Usage
This dataset can be used for training named entity recognition (NER) models, especially for tasks related to unit 
conversion and natural language understanding.

### License
This model is available under the CC0-1.0 license. It is free to use for any purpose without any restrictions.

### Contributions

Developed by [Malik N. Mohammed](https://maliknaik.me/), leveraging **DistilBERT** for efficient NLP token classification.

### Citation
If you use this model in your work, please cite it as follows:

```
@misc{unit-conversion-dataset,
  author = {Malik N. Mohammed},
  title = {Natural Language Unit Conversion Model for Named-Entity Recognition},
  year = {2025},
  publisher = {HuggingFace},
  journal = {HuggingFace repository}
  howpublished = {\url{https://huggingface.co/maliknaik/distilbert-natural-unit-conversion/}}
}

```