zzachw12
/

llemr-v1

+---
+license: mit
+---
+# Instruction Tuning Large Language Models to Understand Electronic Health Records
+**Authors:** Zhenbang Wu, Anant Dadu, Michael Nalls, Faraz Faghri, Jimeng Sun
+**Published at:** NeurIPS 2024 Datasets and Benchmarks Track (Spotlight)
+[[📑Paper](https://openreview.net/pdf?id=Dgy5WVgPd2)] [[🔗Code](https://github.com/zzachw/Llemr)]
+This repository contains the model weights for Llemr, a large language model (LLM) capable of processing and interpreting electronic health records (EHR) with complex data structures.
+## Model Description
+Llemr is trained on MIMIC-Instr, a dataset comprising 350K schema-alignment examples and 100K clinical-reasoning examples generated from the MIMIC-IV EHR database. The model excels at generating relevant, context-aware responses to patient-related queries by leveraging:
+- BiomedBERT as the event encoder.
+- Vicuna as the backbone language model.
+## How to Load Weights
+Follow the steps below to load the pre-trained weights:
+1. Clone the repository:
+```bash
+git clone https://huggingface.co/zzachw/llemr-v1
+cd llemr-v1
+```
+2. Load the weights in Python:
+```python
+from peft import PeftModel
+from src.model.init_llemr import init_llemr
+# Define paths for the base model and LoRA weights
+llm_pretrained_model_name_or_path = "lmsys/vicuna-7b-v1.5"
+lora_name_or_path = "zzachw12/llemr-v1"
+# Initialize the base model and tokenizer
+model, tokenizer = init_llemr(llm_pretrained_model_name_or_path, hidden_size=1027)
+# Integrate the LoRA weights into the model
+model = PeftModel.from_pretrained(model, lora_name_or_path)
+```
+**Note:** This model requires pre-computed event embeddings generated by BiomedBERT. Refer to the [GitHub repository](https://github.com/zzachw/Llemr) for detailed instructions on data preprocessing and event embedding preparation.
+## Notes on Model Enhancements
+Llemr incorporates several minor improvements over the original implementation described in the paper:
+1. **Enhanced Event Encoder:**
+   - Replaced ClinicalBERT (`emilyalsentzer/Bio_ClinicalBERT`) with BiomedBERT-large (`microsoft/BiomedNLP-BiomedBERT-large-uncased-abstract`), improving the quality of event embeddings.
+2. **Improved Event Embedding:**
+   - Concatenated event timestamps and numeric values (where available) to the final event embeddings, resulting in better representation of time-sensitive and quantitative data.
+3. **Expanded Dataset:**
+   - Increased the size of the clinical reasoning subset to 100K examples, doubling the data from the original 50K subset for more comprehensive coverage.
+4. **Unified Training Approach:**
+   - Adopted a single-step training process that integrates schema alignment and clinical reasoning subsets simultaneously, streamlining the training pipeline.
+These advancements collectively enhance the model's ability to interpret and reason with EHR data, delivering superior performance compared to its predecessor.
+## Citation
+If you utilize this work in your research or projects, please consider citing us:
+```
+@inproceedings{
+    wu2024instruction,
+    title={Instruction Tuning Large Language Models to Understand Electronic Health Records},
+    author={Zhenbang Wu and Anant Dadu and Michael Nalls and Faraz Faghri and Jimeng Sun},
+    booktitle={The Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
+    year={2024},
+    url={https://openreview.net/forum?id=Dgy5WVgPd2}
+}
+```