--- license: cc0-1.0 task_categories: - token-classification language: - en tags: - named-entity-recognition - ner - scientific - unit-conversion - units - measurement - natural-language-understanding - automatic-annotations --- # DistilBERT Token Classification Model for Unit Conversion ### Model Overview This model is a fine-tuned version of `distilbert/distilbert-base-uncased` for token classification on unit conversion-related text. It is designed to recognize unit values and conversion entities, facilitating automatic extraction of unit-related data. ### Dataset The model is trained on the `maliknaik/natural_unit_conversion` dataset, which contains: - **Training set**: 583,863 examples - **Validation set**: 100,091 examples - **Test set**: 150,137 examples Each example consists of: - **text**: The input sentence containing unit-related phrases. - **entities**: The labeled entities specifying unit values and types. Dataset url: [https://huggingface.co/datasets/maliknaik/natural_unit_conversion](https://huggingface.co/datasets/maliknaik/natural_unit_conversion) ### Labels The model classifies tokens into the following categories: - `B-FROM_UNIT`: Beginning of the source unit - `I-FROM_UNIT`: Inside the source unit - `B-TO_UNIT`: Beginning of the target unit - `I-TO_UNIT`: Inside the target unit - `B-FEET_VALUE`: Beginning of feet value - `I-FEET_VALUE`: Inside feet value - `B-INCH_VALUE`: Beginning of inch value - `I-INCH_VALUE`: Inside inch value ### Training Details - **Base Model**: `distilbert/distilbert-base-uncased` - **Tokenization**: `AutoTokenizer` from Hugging Face Transformers - **Training Framework**: Hugging Face `Trainer` - **Data Collator**: `DataCollatorForTokenClassification` - **Loss Function**: CrossEntropyLoss - **Batch Size**: 64 - **Epochs**: 10 - **GPU**: 1x NVIDIA Tesla P4 (8GB GDDR5) - **CPU**: 56 vCPUs - **RAM**: 283GB ### Usage To use this model for inference: ```python from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline model_name = 'maliknaik/distilbert-natural-unit-conversion' model = AutoModelForTokenClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased') text = 'How many miles are there in 50 kilometers?' unit_pipeline = pipeline('ner', model=model, tokenizer=tokenizer) print(unit_pipeline(text)) ``` Output: ```bash [{'entity_group': 'TO_UNIT', 'score': np.float32(0.9999982), 'word': 'miles', 'start': 9, 'end': 14}, {'entity_group': 'FROM_UNIT', 'score': np.float32(0.9999473), 'word': 'kilometers', 'start': 31, 'end': 41}] ``` ### Performance The model achieves high f1 score in identifying unit values and conversions. The f1-score for validation and test sets is expected to be optimized further. ### Usage This dataset can be used for training named entity recognition (NER) models, especially for tasks related to unit conversion and natural language understanding. ### License This model is available under the CC0-1.0 license. It is free to use for any purpose without any restrictions. ### Contributions Developed by [Malik N. Mohammed](https://maliknaik.me/), leveraging **DistilBERT** for efficient NLP token classification. ### Citation If you use this model in your work, please cite it as follows: ``` @misc{unit-conversion-dataset, author = {Malik N. Mohammed}, title = {Natural Language Unit Conversion Model for Named-Entity Recognition}, year = {2025}, publisher = {HuggingFace}, journal = {HuggingFace repository} howpublished = {\url{https://huggingface.co/maliknaik/distilbert-natural-unit-conversion/}} } ```