tsilva commited on
Commit
c6a7167
·
verified ·
1 Parent(s): d845123

Update model card with evaluation results and training config.

Browse files
Files changed (1) hide show
  1. README.md +81 -88
README.md CHANGED
@@ -1,95 +1,88 @@
1
  ---
2
  library_name: transformers
3
  license: apache-2.0
4
- base_model: distilbert/distilgpt2
5
  tags:
6
- - generated_from_trainer
 
 
 
7
  model-index:
8
- - name: clinical-field-mapper-classification
9
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
- should probably proofread and complete it, then remove this comment. -->
14
-
15
- # clinical-field-mapper-classification
16
-
17
- This model is a fine-tuned version of [distilbert/distilgpt2](https://huggingface.co/distilbert/distilgpt2) on an unknown dataset.
18
- It achieves the following results on the evaluation set:
19
- - Loss: 0.9970
20
-
21
- ## Model description
22
-
23
- More information needed
24
-
25
- ## Intended uses & limitations
26
-
27
- More information needed
28
-
29
- ## Training and evaluation data
30
-
31
- More information needed
32
-
33
- ## Training procedure
34
-
35
- ### Training hyperparameters
36
-
37
- The following hyperparameters were used during training:
38
- - learning_rate: 0.0005
39
- - train_batch_size: 1024
40
- - eval_batch_size: 1024
41
- - seed: 42
42
- - distributed_type: multi-GPU
43
- - optimizer: Use OptimizerNames.ADAMW_BNB with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
44
- - lr_scheduler_type: cosine
45
- - lr_scheduler_warmup_ratio: 0.01
46
- - num_epochs: 50
47
- - mixed_precision_training: Native AMP
48
- - label_smoothing_factor: 0.1
49
-
50
- ### Training results
51
-
52
- | Training Loss | Epoch | Step | Validation Loss |
53
- |:-------------:|:-----:|:----:|:---------------:|
54
- | 5.0235 | 1.0 | 79 | 1.6342 |
55
- | 1.3733 | 2.0 | 158 | 1.2243 |
56
- | 1.1531 | 3.0 | 237 | 1.1438 |
57
- | 1.0853 | 4.0 | 316 | 1.0994 |
58
- | 1.0467 | 5.0 | 395 | 1.0791 |
59
- | 1.0201 | 6.0 | 474 | 1.0542 |
60
- | 1.0019 | 7.0 | 553 | 1.0437 |
61
- | 0.9885 | 8.0 | 632 | 1.0336 |
62
- | 0.9777 | 9.0 | 711 | 1.0308 |
63
- | 0.9693 | 10.0 | 790 | 1.0271 |
64
- | 0.9626 | 11.0 | 869 | 1.0182 |
65
- | 0.9572 | 12.0 | 948 | 1.0197 |
66
- | 0.9523 | 13.0 | 1027 | 1.0101 |
67
- | 0.9481 | 14.0 | 1106 | 1.0090 |
68
- | 0.9448 | 15.0 | 1185 | 1.0020 |
69
- | 0.9417 | 16.0 | 1264 | 1.0049 |
70
- | 0.9386 | 17.0 | 1343 | 1.0043 |
71
- | 0.9364 | 18.0 | 1422 | 0.9989 |
72
- | 0.935 | 19.0 | 1501 | 0.9984 |
73
- | 0.9328 | 20.0 | 1580 | 0.9976 |
74
- | 0.9305 | 21.0 | 1659 | 0.9951 |
75
- | 0.9289 | 22.0 | 1738 | 0.9949 |
76
- | 0.9273 | 23.0 | 1817 | 0.9966 |
77
- | 0.9263 | 24.0 | 1896 | 0.9954 |
78
- | 0.9256 | 25.0 | 1975 | 0.9954 |
79
- | 0.9243 | 26.0 | 2054 | 0.9937 |
80
- | 0.923 | 27.0 | 2133 | 0.9928 |
81
- | 0.9222 | 28.0 | 2212 | 0.9938 |
82
- | 0.9219 | 29.0 | 2291 | 0.9916 |
83
- | 0.9209 | 30.0 | 2370 | 0.9920 |
84
- | 0.9196 | 31.0 | 2449 | 0.9918 |
85
- | 0.9207 | 32.0 | 2528 | 0.9923 |
86
- | 0.9212 | 33.0 | 2607 | 0.9923 |
87
- | 0.9195 | 34.0 | 2686 | 0.9970 |
88
-
89
-
90
- ### Framework versions
91
-
92
- - Transformers 4.51.3
93
- - Pytorch 2.6.0+cu124
94
- - Datasets 3.5.1
95
- - Tokenizers 0.21.1
 
1
  ---
2
  library_name: transformers
3
  license: apache-2.0
 
4
  tags:
5
+ - healthcare
6
+ - column-normalization
7
+ - text-classification
8
+ - distilgpt2
9
  model-index:
10
+ - name: tsilva/clinical-field-mapper-classification
11
+ results:
12
+ - task:
13
+ name: Field Classification
14
+ type: text-classification
15
+ dataset:
16
+ name: tsilva/clinical-field-mappings
17
+ type: healthcare
18
+ metrics:
19
+ - name: train Accuracy
20
+ type: accuracy
21
+ value: 0.9471
22
+ - name: validation Accuracy
23
+ type: accuracy
24
+ value: 0.9144
25
+ - name: test Accuracy
26
+ type: accuracy
27
+ value: 0.9156
28
  ---
29
 
30
+
31
+
32
+ # Model Card for tsilva/clinical-field-mapper-classification
33
+
34
+ This model is a fine-tuned version of `distilbert/distilgpt2` on the [`tsilva/clinical-field-mappings`](https://huggingface.co/datasets/tsilva/clinical-field-mappings/tree/4d4cdba1b7e9b1eff2893c7014cfc08fe58a73bc) dataset.
35
+ Its purpose is to normalize healthcare database column names to a standardized set of target column names.
36
+
37
+ ## Task
38
+
39
+ This model is a sequence classification model that maps free-text field names to a set of standardized schema terms.
40
+
41
+ ## Usage
42
+
43
+
44
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
45
+
46
+ tokenizer = AutoTokenizer.from_pretrained("tsilva/clinical-field-mapper-classification")
47
+ model = AutoModelForSequenceClassification.from_pretrained("tsilva/clinical-field-mapper-classification")
48
+
49
+ def predict(input_text):
50
+ inputs = tokenizer(input_text, return_tensors="pt")
51
+ outputs = model(**inputs)
52
+ pred = outputs.logits.argmax(-1).item()
53
+ label = model.config.id2label[str(pred)] if hasattr(model.config, 'id2label') else pred
54
+ print(f"Predicted label: family_history_reported")
55
+
56
+ predict('cardi@')
57
+
58
+
59
+ ## Evaluation Results
60
+
61
+ - **train accuracy**: 94.71%
62
+ - **validation accuracy**: 91.44%
63
+ - **test accuracy**: 91.56%
64
+
65
+ ## Training Details
66
+
67
+ - **Seed**: 42
68
+ - **Epochs scheduled**: 50
69
+ - **Epochs completed**: 34
70
+ - **Early stopping triggered**: Yes
71
+ - **Final training loss**: 1.0888
72
+ - **Final evaluation loss**: 0.9916
73
+ - **Optimizer**: adamw_bnb_8bit
74
+ - **Learning rate**: 0.0005
75
+ - **Batch size**: 1024
76
+ - **Precision**: fp16
77
+ - **DeepSpeed enabled**: True
78
+ - **Gradient accumulation steps**: 1
79
+
80
+ ## License
81
+
82
+ Specify your license here (e.g., Apache 2.0, MIT, etc.)
83
+
84
+ ## Limitations and Bias
85
+
86
+ - Model was trained on a specific clinical mapping dataset.
87
+ - Performance may vary on out-of-distribution column names.
88
+ - Ensure you validate model outputs in production environments.