tsilva
/

clinical-field-mapper-classification

@@ -1,95 +1,88 @@
 ---
 library_name: transformers
 license: apache-2.0
-base_model: distilbert/distilgpt2
 tags:
-- generated_from_trainer
 model-index:
-- name: clinical-field-mapper-classification
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# clinical-field-mapper-classification
-This model is a fine-tuned version of [distilbert/distilgpt2](https://huggingface.co/distilbert/distilgpt2) on an unknown dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.9970
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 0.0005
-- train_batch_size: 1024
-- eval_batch_size: 1024
-- seed: 42
-- distributed_type: multi-GPU
-- optimizer: Use OptimizerNames.ADAMW_BNB with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: cosine
-- lr_scheduler_warmup_ratio: 0.01
-- num_epochs: 50
-- mixed_precision_training: Native AMP
-- label_smoothing_factor: 0.1
-### Training results
-| Training Loss | Epoch | Step | Validation Loss |
-|:-------------:|:-----:|:----:|:---------------:|
-| 5.0235        | 1.0   | 79   | 1.6342          |
-| 1.3733        | 2.0   | 158  | 1.2243          |
-| 1.1531        | 3.0   | 237  | 1.1438          |
-| 1.0853        | 4.0   | 316  | 1.0994          |
-| 1.0467        | 5.0   | 395  | 1.0791          |
-| 1.0201        | 6.0   | 474  | 1.0542          |
-| 1.0019        | 7.0   | 553  | 1.0437          |
-| 0.9885        | 8.0   | 632  | 1.0336          |
-| 0.9777        | 9.0   | 711  | 1.0308          |
-| 0.9693        | 10.0  | 790  | 1.0271          |
-| 0.9626        | 11.0  | 869  | 1.0182          |
-| 0.9572        | 12.0  | 948  | 1.0197          |
-| 0.9523        | 13.0  | 1027 | 1.0101          |
-| 0.9481        | 14.0  | 1106 | 1.0090          |
-| 0.9448        | 15.0  | 1185 | 1.0020          |
-| 0.9417        | 16.0  | 1264 | 1.0049          |
-| 0.9386        | 17.0  | 1343 | 1.0043          |
-| 0.9364        | 18.0  | 1422 | 0.9989          |
-| 0.935         | 19.0  | 1501 | 0.9984          |
-| 0.9328        | 20.0  | 1580 | 0.9976          |
-| 0.9305        | 21.0  | 1659 | 0.9951          |
-| 0.9289        | 22.0  | 1738 | 0.9949          |
-| 0.9273        | 23.0  | 1817 | 0.9966          |
-| 0.9263        | 24.0  | 1896 | 0.9954          |
-| 0.9256        | 25.0  | 1975 | 0.9954          |
-| 0.9243        | 26.0  | 2054 | 0.9937          |
-| 0.923         | 27.0  | 2133 | 0.9928          |
-| 0.9222        | 28.0  | 2212 | 0.9938          |
-| 0.9219        | 29.0  | 2291 | 0.9916          |
-| 0.9209        | 30.0  | 2370 | 0.9920          |
-| 0.9196        | 31.0  | 2449 | 0.9918          |
-| 0.9207        | 32.0  | 2528 | 0.9923          |
-| 0.9212        | 33.0  | 2607 | 0.9923          |
-| 0.9195        | 34.0  | 2686 | 0.9970          |
-### Framework versions
-- Transformers 4.51.3
-- Pytorch 2.6.0+cu124
-- Datasets 3.5.1
-- Tokenizers 0.21.1

 ---
 library_name: transformers
 license: apache-2.0
 tags:
+  - healthcare
+  - column-normalization
+  - text-classification
+  - distilgpt2
 model-index:
+  - name: tsilva/clinical-field-mapper-classification
+    results:
+      - task:
+          name: Field Classification
+          type: text-classification
+        dataset:
+          name: tsilva/clinical-field-mappings
+          type: healthcare
+        metrics:
+          - name: train Accuracy
+            type: accuracy
+            value: 0.9471
+          - name: validation Accuracy
+            type: accuracy
+            value: 0.9144
+          - name: test Accuracy
+            type: accuracy
+            value: 0.9156
 ---
+# Model Card for tsilva/clinical-field-mapper-classification
+This model is a fine-tuned version of `distilbert/distilgpt2` on the [`tsilva/clinical-field-mappings`](https://huggingface.co/datasets/tsilva/clinical-field-mappings/tree/4d4cdba1b7e9b1eff2893c7014cfc08fe58a73bc) dataset.
+Its purpose is to normalize healthcare database column names to a standardized set of target column names.
+## Task
+This model is a sequence classification model that maps free-text field names to a set of standardized schema terms.
+## Usage
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+tokenizer = AutoTokenizer.from_pretrained("tsilva/clinical-field-mapper-classification")
+model = AutoModelForSequenceClassification.from_pretrained("tsilva/clinical-field-mapper-classification")
+def predict(input_text):
+    inputs = tokenizer(input_text, return_tensors="pt")
+    outputs = model(**inputs)
+    pred = outputs.logits.argmax(-1).item()
+    label = model.config.id2label[str(pred)] if hasattr(model.config, 'id2label') else pred
+    print(f"Predicted label: family_history_reported")
+predict('cardi@')
+## Evaluation Results
+- **train accuracy**: 94.71%
+- **validation accuracy**: 91.44%
+- **test accuracy**: 91.56%
+## Training Details
+- **Seed**: 42
+- **Epochs scheduled**: 50
+- **Epochs completed**: 34
+- **Early stopping triggered**: Yes
+- **Final training loss**: 1.0888
+- **Final evaluation loss**: 0.9916
+- **Optimizer**: adamw_bnb_8bit
+- **Learning rate**: 0.0005
+- **Batch size**: 1024
+- **Precision**: fp16
+- **DeepSpeed enabled**: True
+- **Gradient accumulation steps**: 1
+## License
+Specify your license here (e.g., Apache 2.0, MIT, etc.)
+## Limitations and Bias
+- Model was trained on a specific clinical mapping dataset.
+- Performance may vary on out-of-distribution column names.
+- Ensure you validate model outputs in production environments.