🧠 ClinicalBERT-MS-Autoimmune-Neuro

A fine-tuned version of emilyalsentzer/Bio_ClinicalBERT
for detecting autoimmune neurological disease signals from clinical text notes.

Maintainer: Vahid Mahmoudian
Repository: vhdm/clinicalbert-ms-autoimmune-neuro
Status: Research / Proof-of-Concept β€” not for standalone clinical use


πŸ” Model Summary

Property Value
Base model emilyalsentzer/Bio_ClinicalBERT
Task Binary text classification (autoimmune-neurological vs non-autoimmune)
Language English clinical notes
Domain Neurology / Autoimmune disorders
Dataset Internal β€œMS-autoimmune” corpus (split into train, valid, test)
Sequence length 512 tokens per chunk
Hardware NVIDIA H100
Mixed precision bf16
Trainer seed 42
Epochs 4
Learning rate 2 Γ— 10⁻⁡
Optimizer adamw_torch
Batch sizes train = 24, eval = 48
Warmup ratio 0.1
Best metric recall

βš™οΈ Training Log (chunk-level)

Below metrics are auto-generated by Hugging Face Trainer
using the raw validation set (before note aggregation, calibration, or threshold tuning).

Step Train Loss Val Loss Accuracy Precision Recall F1 AUC
200 0.655 0.627 0.660 0.800 0.0377 0.0720 0.6259
800 0.331 0.373 0.840 0.882 0.628 0.733 0.8856
1600 0.275 0.360 0.851 0.854 0.691 0.764 0.9040
2400 0.291 0.333 0.858 0.878 0.689 0.772 0.9121
3400 0.220 0.358 0.861 0.851 0.731 0.786 0.9169
4600 0.169 0.432 0.856 0.828 0.744 0.784 0.9172

Final Trainer metrics (chunk-level):

These are raw chunk-level metrics for monitoring during training β€” not the final evaluation used for deployment.


🧩 Note-Level Aggregated Evaluation (final tuned results)

After post-processing with:

  • Aggregation: logit_topk (k = 3)
  • Calibration: temperature scaling (T = 1.372)
  • Threshold: tuned on validation (FΞ² = 1.5 β†’ thr β‰ˆ 1.000)
  • Inference logic: per-note probability = mean(logit(top-k chunks))

Validation (n = 493)

Metric Value
Precision 0.9118
Recall 0.9394
F1 0.9254
Accuracy 0.9493
ROC-AUC 0.9688

Test (n = 493)

Metric Value
Precision 0.9618
Recall 0.9207
F1 0.9408
Accuracy 0.9615
ROC-AUC 0.9786

βœ… Final configuration:
Aggregation = logit_topk(k=3) Temperature = 1.372 Threshold β‰ˆ 1.0


🧠 Inference Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch, numpy as np

repo = "vhdm/clinicalbert-ms-autoimmune-neuro"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo)

texts = ["Patient reports numbness in lower limbs...", "MRI shows demyelination consistent with MS."]
inputs = tok(texts, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1).numpy()[:, 1]

# logit_topk aggregation (k=3)
probs = np.clip(probs, 1e-6, 1-1e-6)
logits_ = np.log(probs) - np.log(1-probs)
k = 3
idx = np.argsort(logits_)[-min(k, len(logits_)):]
mean_logit = logits_[idx].mean()
note_score = 1.0 / (1.0 + np.exp(-mean_logit))

T = 1.372  # temperature
note_score_cal = 1.0 / (1.0 + np.exp(-mean_logit / T))
thr = 1.0  # tuned threshold

pred = int(note_score_cal >= thr)
print({"score": note_score_cal, "prediction": pred})

πŸ§ͺ Reproducibility

TrainingArguments(
  output_dir="./runs/clinicalbert_ms",
  learning_rate=2e-5,
  per_device_train_batch_size=24,
  per_device_eval_batch_size=48,
  num_train_epochs=5,
  weight_decay=0.01,
  bf16=True,
  optim="adamw_torch",
  warmup_ratio=0.1,
  seed=42,
  evaluation_strategy="steps",
  save_strategy="steps",
  logging_steps=50,
  eval_steps=200,
  save_steps=200,
  save_total_limit=3,
  load_best_model_at_end=True,
  metric_for_best_model="recall",
  greater_is_better=True,
)
Downloads last month
27
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for vhdm/clinicalbert-ms-autoimmune-neuro

Finetuned
(38)
this model

Evaluation results