Initial commit

Files changed (8) hide show

.gitattributes +35 -0
README.md +205 -0
Train Tuva Concurrent Inpatient Models.ipynb +31 -0
feature_fill_rate_inpatient.csv +0 -0
inpatient_feature_importance.csv +0 -0
inpatient_models_bundle_medicare_lds_2023_fs.pkl.gz +3 -0
inpatient_models_eval_metrics.csv +13 -0
predict inpatient.ipynb +385 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,205 @@

+---
+license: mit
+tags:
+  - healthcare
+  - medicare
+  - xgboost
+  - logistic-regression
+  - length-of-stay
+  - readmission
+  - discharge-prediction
+  - classification
+  - regression
+datasets:
+  - cms-lds
+---
+# Medicare Inpatient Outcome Prediction Models
+**Model ID:** Medicare LDS 2023 Inpatient Model Bundle
+**Model Types:** XGBoost Regressor & Calibrated Multinomial Logistic Regression
+**Dataset:** 2023 CMS Limited Data Set (LDS)
+**Target Level:** Inpatient Encounter
+---
+## What the Model Predicts
+This bundle contains three distinct models that predict key outcomes for a given inpatient hospital stay:
+- **Length of Stay (Regression):** Predicts the total number of days for the inpatient stay.
+- **Readmission Probability (Binary Classification):** Predicts the probability (from 0.0 to 1.0) that the patient will be readmitted to a hospital within 30 days of discharge.
+- **Discharge Location Probability (Multiclass Classification):** Predicts the probability for each possible discharge location (e.g., Home, Skilled Nursing Facility, Hospice).
+---
+## Intended Use
+This model bundle is designed to support a variety of clinical and operational workflows:
+- **Discharge Planning & Care Management:** Identify high-risk patients who may need additional support or a specific type of post-acute care to prevent readmission.
+- **Resource Planning:** Forecast bed-days and resource utilization based on predicted length of stay.
+- **Actuarial Analysis:** Inform risk stratification and cost estimation models.
+- **Benchmarking:** Compare observed outcomes against predicted risks for a given patient population.
+- **Healthcare Research:** Analyze drivers of inpatient outcomes.
+> **Note on Prediction Type:** The models are trained for **concurrent prediction** — they use clinical and demographic data available during the inpatient stay to predict outcomes related to that same stay.
+---
+## Model Performance
+> These metrics reflect performance on a **20% test set** held out from the 2023 CMS LDS data. All values were calculated on unseen data and represent model generalization performance.
+### Model 1: Length of Stay (XGBoost Regressor)
+| Target             | R²   | MAE (days) |
+|--------------------|------|------------|
+| `length_of_stay`  | 0.25 | 2.72       |
+### Model 2: Readmission Probability (Calibrated Logistic Regression)
+| Target                   | AUC ROC | Brier Score |
+|--------------------------|---------|-------------|
+| `readmission_probability`| 0.7483  | 0.1176      |
+- **AUC ROC:** Measures the model's ability to distinguish between patients who will and will not be readmitted (higher is better).
+- **Brier Score:** Measures the accuracy of the predicted probabilities (lower is better).
+### Model 3: Discharge Location (Calibrated Logistic Regression)
+| Target               | Accuracy | Brier Score (Macro Avg) |
+|----------------------|----------|--------------------------|
+| `discharge_location` | 0.5216   | 0.0771                   |
+- **Accuracy:** The overall percentage of times the model predicted the correct discharge location.
+- **Brier Score (Macro Avg):** The average Brier Score across all possible discharge location classes (lower is better).
+---
+## Files Included
+- `inpatient_models_bundle_medicare_lds_2023_fs.pkl` — A compressed pickle file containing the trained models, feature lists, and encoders. The filename may vary based on training parameters (e.g., `_fs` indicates feature selection was used). The bundle includes:
+  - `los_model` (XGBoost)
+  - `readmission_model` (Calibrated Logistic Regression)
+  - `discharge_model` (Calibrated Logistic Regression)
+  - `feature_columns_*` (specific feature lists for each model)
+  - `le_discharge` (label encoder for discharge location)
+- `Train Tuva Concurrent Inpatient Models.ipynb` — The notebook script used for training, feature selection, and evaluation on Snowflake.
+- `predict_inpatient.py` — An example prediction script for running the bundle on new data.
+- `feature_fill_rate_inpatient.csv` — A diagnostic file detailing the prevalence of each feature in the training dataset.
+- `inpatient_feature_importance.csv` — A file containing the calculated importance of each feature for each of the three models.
+---
+## Understanding Model Artifacts
+This repository includes two key CSV files that provide insight into the model's training data and internal logic. These are generated by the training notebook, which also populates corresponding tables in Snowflake for easier querying (`FEATURE_FREQUENCY_STATS_INPATIENT` and `MODEL_FEATURE_IMPORTANCE_INPATIENT`).
+### Feature Fill Rates (`feature_fill_rate_inpatient.csv`)
+This file is a diagnostic tool for understanding the input data used to train the models. It helps you check for data drift or data quality issues.
+| Column | Description |
+|---|---|
+| `FEATURE_NAME` | The name of the input feature (e.g., `age_at_admit`, `cond_hypertension`). |
+| `POSITIVE_COUNT` | The number of records in the training set where this feature was present (non-zero). |
+| `TOTAL_ROWS` | The total number of records in the training set. |
+| `POSITIVE_RATE_PERCENT` | The prevalence or "fill rate" of the feature (`POSITIVE_COUNT` / `TOTAL_ROWS`). |
+**How to Use:** Compare the `POSITIVE_RATE_PERCENT` from this file with the rates from your own prediction input data. Significant discrepancies can point to data pipeline issues and may explain poor model performance.
+### Feature Importances (`inpatient_feature_importance.csv`)
+This file provides model explainability by showing which features are most influential for each of the three models.
+| Column | Description |
+|---|---|
+| `MODEL_NAME` | Identifies the model (e.g., `Inpatient_LOS_FeatureSelected`). |
+| `FEATURE_NAME` | The name of the input feature. |
+| `IMPORTANCE_VALUE` | A numeric score indicating the feature's influence. Higher is more important. |
+| `IMPORTANCE_RANK` | The rank of the feature's importance for that specific model (1 is most important). |
+**How to Use:** Use this file to understand the key drivers behind the model's predictions. For example, you can filter by `MODEL_NAME` for the readmission model and sort by `IMPORTANCE_RANK` to see what most influences readmission risk. This is useful for clinical validation and debugging.
+---
+## Quick Start: End-to-End Workflow
+This section provides high-level instructions for running a model with the Tuva Project. The workflow involves preparing benchmark data using dbt, running a Python prediction script, and optionally ingesting the results back into dbt for analysis.
+### 1. Configure Your dbt Project
+You need to enable the correct variables in your `dbt_project.yml` file to control the workflow.
+#### A. Enable Benchmark Marts
+These two variables control which parts of the Tuva Project are active. They are `false` by default.
+```yaml
+# in dbt_project.yml
+vars:
+  benchmarks_train: true
+  benchmarks_already_created: true
+```
+- `benchmarks_train`: Set to `true` to build the datasets that the ML models will use for making predictions.
+- `benchmarks_already_created`: Set to `true` to ingest model predictions back into the project as a new dbt source.
+#### B. (Optional) Set Prediction Source Locations
+If you plan to bring predictions back into dbt for analysis, you must define where dbt can find the prediction data.
+```yaml
+# in dbt_project.yml
+vars:
+  predictions_person_year: "{{ source('benchmark_output', 'person_year') }}"
+  predictions_inpatient: "{{ source('benchmark_output', 'inpatient') }}"
+```
+#### C. Configure `sources.yml`
+Ensure your `sources.yml` file includes a definition for the source you referenced above (e.g., `benchmark_output`) that points to the database and schema where your model's prediction outputs are stored.
+---
+### 2. The 3-Step Run Process
+This workflow can be managed by any orchestration tool (e.g., Airflow, Prefect, Fabric Notebooks) or run manually from the command line.
+#### Step 1: Generate the Training & Benchmarking Data
+Run the Tuva Project with `benchmarks_train` enabled. This creates the input data required by the ML model.
+```bash
+dbt build --vars '{benchmarks_train: true}'
+```
+To run only the benchmark mart:
+```bash
+dbt build --select tag:benchmarks_train --vars '{benchmarks_train: true}'
+```
+#### Step 2: Run the Prediction Python Code
+Execute the Python script `(predict inpatient.ipynb)` to generate predictions. This script will read the data created in Step 1 and write the prediction outputs to a persistent location (e.g., a table in your data warehouse).
+*We have provided example Snowflake Notebook code within each model's repository that was used in Tuva's environment.*
+#### Step 3: (Optional) Bring Predictions back into Tuva Project
+To bring the predictions back into the Tuva Project for analysis, run dbt again with `benchmarks_already_created` enabled. This populates the analytics marts.
+```bash
+dbt build --vars '{benchmarks_already_created: true, benchmarks_train: false}'
+```
+To run only the analysis models:
+```bash
+dbt build --select tag:benchmarks_analysis --vars '{benchmarks_already_created: true, benchmarks_train: false}'
+```
+---

Train Tuva Concurrent Inpatient Models.ipynb ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Streamlit Notebook",
+   "name": "streamlit"
+  },
+  "lastEditStatus": {
+   "notebookId": "6rovstl42ft2p5id6gwo",
+   "authorId": "374530764978",
+   "authorName": "BRAD",
+   "authorEmail": "[email protected]",
+   "sessionId": "65561efa-4d18-4072-8f4d-10240cb902ba",
+   "lastEditTime": 1750870004305
+  }
+ },
+ "nbformat_minor": 5,
+ "nbformat": 4,
+ "cells": [
+  {
+   "cell_type": "code",
+   "id": "3775908f-ca36-4846-8f38-5adca39217f2",
+   "metadata": {
+    "language": "python",
+    "name": "cell1"
+   },
+   "source": "0]}\")\n        \n        base_readmit_model = LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000, solver='liblinear')\n        \n        # Determine the feature set to use.\n        if FAST_MODE:\n            print(\"\\n[FAST MODE] Skipping feature selection. Using all available features.\")\n            best_readmission_features = X_train_base.columns.tolist()\n        else:\n            best_readmission_features = find_best_feature_subset(\n                model=base_readmit_model, X_train=X_train_base, y_train=y_train_base, X_val=X_calib_read, y_val=y_calib_read,\n                scoring_func=roc_auc_score, higher_is_better=True, model_name=\"Readmission (Logistic Regression)\"\n            )\n\n        print(f\"\\nTraining final Readmission model pipeline using {len(best_readmission_features)} features...\")\n        base_model_for_calib = clone(base_readmit_model)\n        base_model_for_calib.fit(X_train_base[best_readmission_features], y_train_base)\n        \n        # Log feature importances from the base (uncalibrated) model.\n        uncal_read_model_name = f\"Inpatient_Readmission_Base_Uncalibrated_{MODEL_NAME_SUFFIX}{EXCLUSION_SUFFIX}\"\n        log_feature_importances_to_snowflake(session, base_model_for_calib, best_readmission_features, MODEL_RUN_ID, uncal_read_model_name, TARGET_READMISSION, FEATURE_IMPORTANCE_TABLE_NAME)\n        \n        # Evaluate and log metrics for the uncalibrated model for comparison.\n        y_pred_proba_uncal = base_model_for_calib.predict_proba(X_test_read[best_readmission_features])[:, 1]\n        uncalibrated_metrics = calculate_binary_classification_proba_metrics(y_test_read, y_pred_proba_uncal)\n        log_model_metrics_to_snowflake(session, MODEL_RUN_ID, uncal_read_model_name, TARGET_READMISSION + \"_Probability\", uncalibrated_metrics, \"Binary_Uncalibrated\", METRICS_TABLE_NAME, MODEL_SOURCE_TAG, MODEL_YEAR_TAG)\n        \n        # Calibrate the model on the held-out calibration set.\n        calibrated_readmission_model = CalibratedClassifierCV(base_model_for_calib, method='isotonic', cv='prefit')\n        calibrated_readmission_model.fit(X_calib_read[best_readmission_features], y_calib_read)\n        y_pred_proba_cal = calibrated_readmission_model.predict_proba(X_test_read[best_readmission_features])[:, 1]\n\n        print(\"\\nCalibrated Readmission Model - Test Set Evaluation:\")\n        calibrated_proba_metrics = calculate_binary_classification_proba_metrics(y_test_read, y_pred_proba_cal)\n        for k, v in calibrated_proba_metrics.items(): print(f\"  {k}: {v:.4f}\")\n        \n        cal_read_model_name = f\"Inpatient_Readmission_Calibrated_{MODEL_NAME_SUFFIX}{EXCLUSION_SUFFIX}\"\n        log_model_metrics_to_snowflake(session, MODEL_RUN_ID, cal_read_model_name, TARGET_READMISSION + \"_Probability\", calibrated_proba_metrics, \"Binary_Calibrated\", METRICS_TABLE_NAME, MODEL_SOURCE_TAG, MODEL_YEAR_TAG)\n\n# --- 4.3 Model 3: Predicting Discharge Location (Multiclass Classification) ---\nprint(\"\\n\" + \"=\"*80)\nprint(\"--- Training Model 3: Calibrated Discharge Location ---\")\nprint(\"=\"*80)\nTARGET_DISCHARGE = 'discharge_location'\ncalibrated_discharge_model, le_discharge, best_discharge_features = None, None, None\n\nif TARGET_DISCHARGE not in df_pd.columns:\n    print(f\"Error: Target column '{TARGET_DISCHARGE}' not found. Skipping Discharge Location model.\")\nelse:\n    le_discharge = LabelEncoder()\n    y_discharge_encoded = le_discharge.fit_transform(df_pd[TARGET_DISCHARGE])\n    num_classes_discharge = len(le_discharge.classes_)\n    print(f\"Discharge Location: {num_classes_discharge} classes found: {le_discharge.classes_}\")\n    \n    # Split data: 60% base train, 20% calibration, 20% test\n    stratify_discharge = y_discharge_encoded if num_classes_discharge > 1 else None\n    X_train_full_disc, X_test_disc, y_train_full_disc_enc, y_test_disc_enc = train_test_split(X, y_discharge_encoded, test_size=0.2, random_state=42, stratify=stratify_discharge)\n    X_train_base_disc, X_calib_disc, y_train_base_disc_enc, y_calib_disc_enc = train_test_split(X_train_full_disc, y_train_full_disc_enc, test_size=0.25, random_state=42, stratify=y_train_full_disc_enc if num_classes_discharge > 1 else None)\n    print(f\"Data split for discharge: Base train: {X_train_base_disc.shape[0]}, Calibration: {X_calib_disc.shape[0]}, Test: {X_test_disc.shape[0]}\")\n    \n    base_discharge_model = LogisticRegression(random_state=42, max_iter=1000, solver='lbfgs', multi_class='multinomial', class_weight='balanced')\n    \n    # Determine the feature set to use.\n    if FAST_MODE:\n        print(\"\\n[FAST MODE] Skipping feature selection. Using all available features.\")\n        best_discharge_features = X_train_base_disc.columns.tolist()\n    else:\n        best_discharge_features = find_best_feature_subset(\n            model=base_discharge_model, X_train=X_train_base_disc, y_train=y_train_base_disc_enc, X_val=X_calib_disc, y_val=y_calib_disc_enc,\n            scoring_func=log_loss, higher_is_better=False, model_name=\"Discharge Location (Multinomial Regression)\"\n        )\n\n    print(f\"\\nTraining final Discharge Location model pipeline using {len(best_discharge_features)} features...\")\n    base_model_for_calib_disc = clone(base_discharge_model)\n    base_model_for_calib_disc.fit(X_train_base_disc[best_discharge_features], y_train_base_disc_enc)\n    \n    discharge_model_name = f\"Inpatient_Discharge_Cal_Overall_{MODEL_NAME_SUFFIX}{EXCLUSION_SUFFIX}\"\n    log_feature_importances_to_snowflake(session, base_model_for_calib_disc, best_discharge_features, MODEL_RUN_ID, discharge_model_name, TARGET_DISCHARGE, FEATURE_IMPORTANCE_TABLE_NAME)\n    \n    # Calibrate the model. 'sigmoid' is used for one-vs-rest calibration, suitable for multiclass.\n    calibrated_discharge_model = CalibratedClassifierCV(base_model_for_calib_disc, method='sigmoid', cv='prefit')\n    calibrated_discharge_model.fit(X_calib_disc[best_discharge_features], y_calib_disc_enc)\n    y_pred_proba_discharge_calibrated = calibrated_discharge_model.predict_proba(X_test_disc[best_discharge_features])\n    y_pred_labels_discharge_calibrated = calibrated_discharge_model.predict(X_test_disc[best_discharge_features])\n    \n    print(\"\\nCalibrated Discharge Model - Test Set Evaluation:\")\n    calibrated_disc_metrics = calculate_multiclass_classification_metrics(y_test_disc_enc, y_pred_labels_discharge_calibrated, y_pred_proba_discharge_calibrated, le_discharge.classes_)\n    \n    # Log the overall multiclass metrics.\n    overall_cal_metrics_to_log = {k: v for k, v in calibrated_disc_metrics.items() if k != 'per_class_details'}\n    overall_cal_metrics_to_log['BRIER_SCORE'] = calibrated_disc_metrics.get('BRIER_SCORE_MACRO_AVG')\n    log_model_metrics_to_snowflake(session, MODEL_RUN_ID, discharge_model_name, TARGET_DISCHARGE, overall_cal_metrics_to_log, \"Multiclass_Cal_Overall\", METRICS_TABLE_NAME, MODEL_SOURCE_TAG, MODEL_YEAR_TAG)\n    \n    # --- FIX: Log the per-class metrics by mapping keys correctly ---\n    discharge_class_model_name = f\"Inpatient_Discharge_Cal_Class_{MODEL_NAME_SUFFIX}{EXCLUSION_SUFFIX}\"\n    for class_detail in calibrated_disc_metrics.get('per_class_details', []):\n        # Create a new dict with keys the logging function expects.\n        per_class_metrics_to_log = {\n            'BRIER_SCORE': class_detail.get('brier_score'),\n            'AVG_Y_PRED': class_detail.get('avg_pred_proba'),\n            'AVG_Y_TRUE': class_detail.get('true_proportion'),\n            'PRED_RATIO': class_detail.get('proba_ratio'),\n        }\n        log_model_metrics_to_snowflake(\n            session, MODEL_RUN_ID, discharge_class_model_name,\n            f\"{TARGET_DISCHARGE}_Class_{class_detail['class_name']}\",\n            per_class_metrics_to_log, # Use the correctly mapped dictionary\n            \"Multiclass_Cal_ClassDetail\", METRICS_TABLE_NAME, MODEL_SOURCE_TAG, MODEL_YEAR_TAG\n        )\n\n    print(\"\\nCalibrated Classification Report:\\n\", classification_report(y_test_disc_enc, y_pred_labels_discharge_calibrated, target_names=le_discharge.classes_.astype(str), zero_division=0, digits=4))\n\n\n# =============================================================================\n# 5. MODEL SAVING\n# =============================================================================\nprint(\"\\n\" + \"=\"*80)\nprint(\"--- Saving Models and Artifacts ---\")\nprint(\"=\"*80)\n\n# Bundle all necessary objects for deployment into a single dictionary.\ninpatient_models_bundle = {\n    'los_model': los_model,\n    'readmission_model': calibrated_readmission_model,\n    'discharge_model': calibrated_discharge_model,\n    'feature_columns_los': best_los_features,\n    'feature_columns_readmission': best_readmission_features,\n    'feature_columns_discharge': best_discharge_features,\n    'le_discharge': le_discharge,\n    'model_run_id': MODEL_RUN_ID,\n    'fast_mode': FAST_MODE,\n    'excluded_feature_prefixes': EXCLUDE_FEATURE_PREFIXES\n}\n\n# Create a descriptive file name for the bundle.\nBUNDLE_SUFFIX = \"fast\" if FAST_MODE else \"fs\"\nEXCLUSION_FILE_TAG = f\"_excl_{'-'.join([p.strip('_').lower() for p in EXCLUDE_FEATURE_PREFIXES])}\" if EXCLUDE_FEATURE_PREFIXES else \"\"\nBUNDLE_FILE_NAME = f'inpatient_models_bundle_{MODEL_SOURCE_TAG}_{MODEL_YEAR_TAG}_{BUNDLE_SUFFIX}{EXCLUSION_FILE_TAG}.pkl'\n\n# Save the bundle locally using pickle.\nwith open(BUNDLE_FILE_NAME, 'wb') as f:\n    pickle.dump(inpatient_models_bundle, f)\nprint(f\"Models bundled and saved locally to: {BUNDLE_FILE_NAME}\")\n\n# Upload the local bundle file to the specified Snowflake stage.\nput_result = session.file.put(BUNDLE_FILE_NAME, SNOWFLAKE_STAGE_NAME, overwrite=True)\nif put_result[0].status == 'UPLOADED':\n    print(f\"Model bundle successfully uploaded to Snowflake stage: {SNOWFLAKE_STAGE_NAME}\")\nelse:\n    print(f\"Error uploading model bundle. Status: {put_result[0].status}, Message: {put_result[0].message}\")\n\nfile_size_mb = os.path.getsize(BUNDLE_FILE_NAME) / (1024 * 1024)\nprint(f\"Saved local bundle file size: {file_size_mb:.2f} MB\")\n\nprint(f\"\\n✅ Script finished ({'FAST MODE' if FAST_MODE else 'FULL MODE'}).\")",
+   "execution_count": null,
+   "outputs": []
+  }
+ ]
+}

feature_fill_rate_inpatient.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

inpatient_feature_importance.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

inpatient_models_bundle_medicare_lds_2023_fs.pkl.gz ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:afb9b33cac24e91958adfc4f52ed3a1fd617a51c12bbf9d8be2c37eab4eb979c
+size 1808823

inpatient_models_eval_metrics.csv ADDED Viewed

	@@ -0,0 +1,13 @@

+MODEL_RUN_ID,MODEL_NAME,TARGET_NAME,R2,MAE,MSE,PRED_RATIO,MAE_PERCENT,AUC_ROC,AUC_PR,LOG_LOSS,BRIER_SCORE,ACCURACY,AVG_Y_PRED,AVG_Y_TRUE,MODEL_SOURCE,MODEL_TYPE,MODEL_YEAR,EVAL_TS
+03daf6f5-4a7a-44b9-a670-e0520ec6772f,Inpatient_Discharge_Cal_Class_FeatureSelected,discharge_location_Class_transfer/other facility,,,,,,,,,,,,,medicare_lds,Multiclass_Cal_ClassDetail,2023,2025-06-18 21:12:05.239
+03daf6f5-4a7a-44b9-a670-e0520ec6772f,Inpatient_Discharge_Cal_Class_FeatureSelected,discharge_location_Class_snf,,,,,,,,,,,,,medicare_lds,Multiclass_Cal_ClassDetail,2023,2025-06-18 21:12:01.767
+03daf6f5-4a7a-44b9-a670-e0520ec6772f,Inpatient_Discharge_Cal_Class_FeatureSelected,discharge_location_Class_other,,,,,,,,,,,,,medicare_lds,Multiclass_Cal_ClassDetail,2023,2025-06-18 21:11:58.290
+03daf6f5-4a7a-44b9-a670-e0520ec6772f,Inpatient_Discharge_Cal_Class_FeatureSelected,discharge_location_Class_ipt rehab,,,,,,,,,,,,,medicare_lds,Multiclass_Cal_ClassDetail,2023,2025-06-18 21:11:55.442
+03daf6f5-4a7a-44b9-a670-e0520ec6772f,Inpatient_Discharge_Cal_Class_FeatureSelected,discharge_location_Class_hospice,,,,,,,,,,,,,medicare_lds,Multiclass_Cal_ClassDetail,2023,2025-06-18 21:11:52.653
+03daf6f5-4a7a-44b9-a670-e0520ec6772f,Inpatient_Discharge_Cal_Class_FeatureSelected,discharge_location_Class_home health,,,,,,,,,,,,,medicare_lds,Multiclass_Cal_ClassDetail,2023,2025-06-18 21:11:49.518
+03daf6f5-4a7a-44b9-a670-e0520ec6772f,Inpatient_Discharge_Cal_Class_FeatureSelected,discharge_location_Class_home,,,,,,,,,,,,,medicare_lds,Multiclass_Cal_ClassDetail,2023,2025-06-18 21:11:46.588
+03daf6f5-4a7a-44b9-a670-e0520ec6772f,Inpatient_Discharge_Cal_Class_FeatureSelected,discharge_location_Class_expired,,,,,,,,,,,,,medicare_lds,Multiclass_Cal_ClassDetail,2023,2025-06-18 21:11:43.545
+03daf6f5-4a7a-44b9-a670-e0520ec6772f,Inpatient_Discharge_Cal_Overall_FeatureSelected,discharge_location,,,,,,,,1.302877,0.077095,0.521647,,,medicare_lds,Multiclass_Cal_Overall,2023,2025-06-18 21:11:40.194
+03daf6f5-4a7a-44b9-a670-e0520ec6772f,Inpatient_Readmission_Calibrated_FeatureSelected,readmission_numerator_Probability,,,,,,0.74825,0.334959,0.380948,0.117589,,0.157001,0.15629,medicare_lds,Binary_Classification_Probability_Calibrated,2023,2025-06-18 20:48:33.051
+03daf6f5-4a7a-44b9-a670-e0520ec6772f,Inpatient_Readmission_Base_Uncalibrated_FeatureSelected,readmission_numerator_Probability,,,,,,0.748529,0.341185,0.594258,0.20216,,0.428988,0.15629,medicare_lds,Binary_Classification_Probability_Uncalibrated,2023,2025-06-18 20:48:29.072
+03daf6f5-4a7a-44b9-a670-e0520ec6772f,Inpatient_LOS_FeatureSelected,length_of_stay,0.245662,2.724616,23.875849,0.970435,54.179359,,,,,,4.880204201,5.028881,medicare_lds,Regression,2023,2025-06-18 20:46:30.898

predict inpatient.ipynb ADDED Viewed

	@@ -0,0 +1,385 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3775908f-ca36-4846-8f38-5adca39217f2",
+   "metadata": {
+    "language": "python",
+    "name": "cell1"
+   },
+   "outputs": [],
+   "source": [
+    "\"\"\"\n",
+    "Snowflake Inpatient Prediction and Evaluation Script (Self-Contained)\n",
+    "\n",
+    "This script is designed to run entirely within a Snowflake environment with NO\n",
+    "external network access. It performs the following operations:\n",
+    "\n",
+    "1.  Connects to an active Snowflake session.\n",
+    "2.  Loads a pre-trained model bundle and reference tables from a Snowflake stage.\n",
+    "3.  Loads new inpatient data from a Snowflake table for prediction.\n",
+    "4.  Calculates and compares the feature fill rates of the input data against the training data.\n",
+    "5.  Performs data preprocessing, including one-hot encoding.\n",
+    "6.  Generates predictions for Length of Stay, Readmission, and Discharge Location.\n",
+    "7.  Calculates evaluation metrics by comparing predictions to actual outcomes.\n",
+    "8.  Saves predictions, metrics, and diagnostic tables back to Snowflake.\n",
+    "\"\"\"\n",
+    "\n",
+    "import os\n",
+    "import gzip\n",
+    "import pickle\n",
+    "import uuid\n",
+    "from datetime import datetime\n",
+    "\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "from snowflake.snowpark.context import get_active_session\n",
+    "from snowflake.snowpark.session import Session\n",
+    "from snowflake.snowpark.exceptions import SnowparkClientException\n",
+    "\n",
+    "from sklearn.metrics import (\n",
+    "    r2_score, mean_absolute_error, mean_squared_error, accuracy_score,\n",
+    "    roc_auc_score, log_loss, brier_score_loss, average_precision_score\n",
+    ")\n",
+    "from snowflake.snowpark.types import (\n",
+    "    StructType, StructField, StringType, TimestampType, FloatType, LongType\n",
+    ")\n",
+    "\n",
+    "# =============================================================================\n",
+    "# 0. CONFIGURATION\n",
+    "# =============================================================================\n",
+    "\n",
+    "# --- Snowflake Environment Settings ---\n",
+    "DATABASE = \"CMS_SYNTHETIC\"\n",
+    "SCHEMA = \"BENCHMARKS\"\n",
+    "STAGE_DATABASE = \"CMS_SYNTHETIC\"\n",
+    "\n",
+    "# --- Input & Output Table Names ---\n",
+    "INPUT_TABLE_NAME = \"BENCHMARKS_INPATIENT_INPUT\"\n",
+    "OUTPUT_PREDICTIONS_TABLE_NAME = \"INPATIENT_PREDICTIONS\"\n",
+    "OUTPUT_METRICS_TABLE_NAME = \"INPATIENT_EVALUATION_METRICS\"\n",
+    "OUTPUT_FILL_RATE_COMPARISON_TABLE_NAME = \"INPATIENT_FILL_RATE_COMPARISON\"\n",
+    "\n",
+    "# --- Model & Artifact Loading Configuration ---\n",
+    "MODEL_STAGE_NAME = f\"@{STAGE_DATABASE}.{SCHEMA}.BENCHMARK_STAGE\"\n",
+    "MODEL_FILE_NAME_IN_STAGE = \"inpatient_models_bundle_medicare_lds_2023_fs.pkl.gz\"\n",
+    "FILL_RATE_FILE_NAME_IN_STAGE = \"feature_fill_rate_inpatient.csv.gz\"\n",
+    "\n",
+    "# --- Local Directory for Artifacts ---\n",
+    "# This directory is used within the Snowflake virtual environment to store downloaded files.\n",
+    "LOCAL_ARTIFACT_DIR = \"/tmp/appRoot\"\n",
+    "\n",
+    "# --- Run Configuration ---\n",
+    "ROW_LIMIT = None  # Set to an integer to limit input rows for testing, or None for all rows.\n",
+    "RUN_ID = str(uuid.uuid4())\n",
+    "\n",
+    "# --- Feature Configuration ---\n",
+    "# Set to True to calculate and save a comparison of feature fill rates\n",
+    "# between the input data and the original training data. This requires\n",
+    "# the fill rate file to be available in the stage. Set to False to skip.\n",
+    "ENABLE_FILL_RATE_COMPARISON = True\n",
+    "\n",
+    "# --- Construct Full Object Names ---\n",
+    "FULL_INPUT_TABLE = f\"{INPUT_TABLE_NAME}\"\n",
+    "FULL_OUTPUT_PREDICTIONS_TABLE = f\"{OUTPUT_PREDICTIONS_TABLE_NAME}\"\n",
+    "FULL_METRICS_TABLE = f\"{OUTPUT_METRICS_TABLE_NAME}\"\n",
+    "FULL_OUTPUT_FILL_RATE_COMPARISON_TABLE = f\"{OUTPUT_FILL_RATE_COMPARISON_TABLE_NAME}\"\n",
+    "MODEL_STAGE_PATH = f\"{MODEL_STAGE_NAME}/{MODEL_FILE_NAME_IN_STAGE}\"\n",
+    "FILL_RATE_STAGE_PATH = f\"{MODEL_STAGE_NAME}/{FILL_RATE_FILE_NAME_IN_STAGE}\"\n",
+    "\n",
+    "\n",
+    "# =============================================================================\n",
+    "# 1. UTILITY FUNCTIONS\n",
+    "# =============================================================================\n",
+    "\n",
+    "def calculate_regression_metrics(y_true, y_pred):\n",
+    "    \"\"\"Calculates a dictionary of standard regression metrics.\"\"\"\n",
+    "    y_true_np, y_pred_np = np.array(y_true), np.array(y_pred)\n",
+    "    sum_y_true, mean_y_true = np.sum(y_true_np), np.mean(y_true_np)\n",
+    "    pred_ratio = np.sum(y_pred_np) / sum_y_true if sum_y_true != 0 else np.nan\n",
+    "    mae_percent = (mean_absolute_error(y_true_np, y_pred_np) / mean_y_true) * 100 if mean_y_true != 0 else np.nan\n",
+    "    return {\n",
+    "        'R2': r2_score(y_true_np, y_pred_np), 'MAE': mean_absolute_error(y_true_np, y_pred_np),\n",
+    "        'MSE': mean_squared_error(y_true_np, y_pred_np), 'PRED_RATIO': pred_ratio, 'MAE_PERCENT': mae_percent,\n",
+    "        'AVG_Y_PRED': np.mean(y_pred_np), 'AVG_Y_TRUE': mean_y_true\n",
+    "    }\n",
+    "\n",
+    "def calculate_binary_classification_proba_metrics(y_true, y_pred_proba):\n",
+    "    \"\"\"Calculates a dictionary of standard binary classification metrics from probabilities.\"\"\"\n",
+    "    y_true_np, y_pred_proba_np = np.array(y_true), np.array(y_pred_proba)\n",
+    "    is_multiclass = len(np.unique(y_true_np)) > 1\n",
+    "    auc_roc = roc_auc_score(y_true_np, y_pred_proba_np) if is_multiclass else np.nan\n",
+    "    auc_pr = average_precision_score(y_true_np, y_pred_proba_np) if is_multiclass else np.nan\n",
+    "    return {\n",
+    "        'AUC_ROC': auc_roc, 'AUC_PR': auc_pr, 'LOG_LOSS': log_loss(y_true_np, y_pred_proba_np),\n",
+    "        'BRIER_SCORE': brier_score_loss(y_true_np, y_pred_proba_np),\n",
+    "        'AVG_Y_PRED_PROBA': np.mean(y_pred_proba_np), 'AVG_Y_TRUE': np.mean(y_true_np)\n",
+    "    }\n",
+    "\n",
+    "def calculate_multiclass_classification_metrics(y_true_encoded, y_pred_labels, y_pred_proba, le_classes):\n",
+    "    \"\"\"Calculates a dictionary of standard multi-class classification metrics.\"\"\"\n",
+    "    num_samples, num_classes = len(y_true_encoded), len(le_classes)\n",
+    "    metrics = {\n",
+    "        'ACCURACY': accuracy_score(y_true_encoded, y_pred_labels),\n",
+    "        'LOG_LOSS': log_loss(y_true_encoded, y_pred_proba, labels=np.arange(num_classes))\n",
+    "    }\n",
+    "    per_class_details, all_brier_scores = [], []\n",
+    "    if num_samples > 0 and num_classes > 0:\n",
+    "        for i in range(num_classes):\n",
+    "            class_name, true_class_binary = le_classes[i], (y_true_encoded == i).astype(int)\n",
+    "            pred_proba_for_class, true_proportion_class = y_pred_proba[:, i], np.mean(true_class_binary)\n",
+    "            brier_score_class = brier_score_loss(true_class_binary, pred_proba_for_class) if len(np.unique(true_class_binary)) > 1 else np.nan\n",
+    "            all_brier_scores.append(brier_score_class)\n",
+    "            per_class_details.append({\n",
+    "                \"class_name\": class_name, \"avg_pred_proba\": np.mean(pred_proba_for_class),\n",
+    "                \"true_proportion\": true_proportion_class,\n",
+    "                \"proba_ratio\": np.mean(pred_proba_for_class) / true_proportion_class if true_proportion_class > 0 else np.nan,\n",
+    "                \"brier_score\": brier_score_class\n",
+    "            })\n",
+    "    metrics['per_class_details'] = per_class_details\n",
+    "    valid_brier_scores = [s for s in all_brier_scores if not np.isnan(s)]\n",
+    "    metrics['BRIER_SCORE_MACRO_AVG'] = np.mean(valid_brier_scores) if valid_brier_scores else np.nan\n",
+    "    return metrics\n",
+    "\n",
+    "def log_metrics_to_snowflake(session, run_id, model_id, data_source, year_nbr, model_target, metrics_dict, table_name):\n",
+    "    \"\"\"Constructs a DataFrame from a metrics dictionary and appends it to a Snowflake table.\"\"\"\n",
+    "    metrics_schema = StructType([\n",
+    "        StructField(\"RUN_ID\", StringType(), nullable=False), StructField(\"MODEL_ID\", StringType(), nullable=True),\n",
+    "        StructField(\"DATA_SOURCE\", StringType(), nullable=True), StructField(\"YEAR_NBR\", LongType(), nullable=True),\n",
+    "        StructField(\"MODEL_TARGET\", StringType(), nullable=True), StructField(\"EVAL_TIMESTAMP\", TimestampType(), nullable=False),\n",
+    "        StructField(\"N_SAMPLES\", LongType(), nullable=True), StructField(\"R2\", FloatType(), nullable=True),\n",
+    "        StructField(\"MAE\", FloatType(), nullable=True), StructField(\"MSE\", FloatType(), nullable=True),\n",
+    "        StructField(\"PRED_RATIO\", FloatType(), nullable=True), StructField(\"MAE_PERCENT\", FloatType(), nullable=True),\n",
+    "        StructField(\"AUC_ROC\", FloatType(), nullable=True), StructField(\"AUC_PR\", FloatType(), nullable=True),\n",
+    "        StructField(\"LOG_LOSS\", FloatType(), nullable=True), StructField(\"BRIER_SCORE\", FloatType(), nullable=True),\n",
+    "        StructField(\"ACCURACY\", FloatType(), nullable=True), StructField(\"AVG_Y_PRED\", FloatType(), nullable=True),\n",
+    "        StructField(\"AVG_Y_TRUE\", FloatType(), nullable=True)\n",
+    "    ])\n",
+    "    avg_y_pred = metrics_dict.get('AVG_Y_PRED', metrics_dict.get('AVG_Y_PRED_PROBA'))\n",
+    "    payload = {\n",
+    "        \"RUN_ID\": run_id, \"MODEL_ID\": model_id, \"DATA_SOURCE\": data_source,\n",
+    "        \"YEAR_NBR\": int(year_nbr) if pd.notna(year_nbr) else None, \"MODEL_TARGET\": model_target,\n",
+    "        \"EVAL_TIMESTAMP\": datetime.utcnow(), \"N_SAMPLES\": metrics_dict.get('n_samples'), \"R2\": metrics_dict.get('R2'),\n",
+    "        \"MAE\": metrics_dict.get('MAE'), \"MSE\": metrics_dict.get('MSE'), \"PRED_RATIO\": metrics_dict.get('PRED_RATIO'),\n",
+    "        \"MAE_PERCENT\": metrics_dict.get('MAE_PERCENT'), \"AUC_ROC\": metrics_dict.get('AUC_ROC'),\n",
+    "        \"AUC_PR\": metrics_dict.get('AUC_PR'), \"LOG_LOSS\": metrics_dict.get('LOG_LOSS'),\n",
+    "        \"BRIER_SCORE\": metrics_dict.get('BRIER_SCORE', metrics_dict.get('BRIER_SCORE_MACRO_AVG')),\n",
+    "        \"ACCURACY\": metrics_dict.get('ACCURACY'), \"AVG_Y_PRED\": avg_y_pred, \"AVG_Y_TRUE\": metrics_dict.get('AVG_Y_TRUE')\n",
+    "    }\n",
+    "    dfm = pd.DataFrame([payload]).replace({np.nan: None, pd.NaT: None})\n",
+    "    try:\n",
+    "        column_order = [field.name for field in metrics_schema.fields]\n",
+    "        dfm_reordered = dfm[column_order]\n",
+    "        snowpark_df = session.create_dataframe(dfm_reordered.values.tolist(), schema=metrics_schema)\n",
+    "        snowpark_df.write.mode(\"append\").save_as_table(table_name)\n",
+    "        print(f\"✅ Logged metrics for {model_target} ({data_source}, {year_nbr}) to {table_name}.\")\n",
+    "    except Exception as e:\n",
+    "        print(f\"Error logging metrics for {model_target}: {e}\\nPayload: {dfm.to_dict('records')}\")\n",
+    "\n",
+    "# =============================================================================\n",
+    "# 2. MAIN EXECUTION\n",
+    "# =============================================================================\n",
+    "def main(session: Session):\n",
+    "    print(f\"--- Starting Inpatient Prediction Pipeline ---\\nRun ID: {RUN_ID}\")\n",
+    "    session.use_database(DATABASE)\n",
+    "    session.use_schema(SCHEMA)\n",
+    "    print(f\"Session context set to DATABASE: {DATABASE}, SCHEMA: {SCHEMA}\")\n",
+    "\n",
+    "    # --- Stage 1: Load Model & Artifacts ---\n",
+    "    print(f\"\\n--- Stage 1: Loading artifacts from stage to {LOCAL_ARTIFACT_DIR} ---\")\n",
+    "    os.makedirs(LOCAL_ARTIFACT_DIR, exist_ok=True)\n",
+    "    df_training_fill_rates = pd.DataFrame()  # Initialize as empty DataFrame\n",
+    "\n",
+    "    try:\n",
+    "        # The model bundle is a required artifact for the script to run.\n",
+    "        session.file.get(MODEL_STAGE_PATH, LOCAL_ARTIFACT_DIR)\n",
+    "        local_model_path = os.path.join(LOCAL_ARTIFACT_DIR, MODEL_FILE_NAME_IN_STAGE)\n",
+    "        with gzip.open(local_model_path, \"rb\") as f:\n",
+    "            models_bundle = pickle.load(f)\n",
+    "        print(\"✅ Model bundle loaded successfully.\")\n",
+    "\n",
+    "        # The fill rate file is optional. It is loaded only if the feature is enabled.\n",
+    "        if ENABLE_FILL_RATE_COMPARISON:\n",
+    "            try:\n",
+    "                session.file.get(FILL_RATE_STAGE_PATH, LOCAL_ARTIFACT_DIR)\n",
+    "                local_fr_path = os.path.join(LOCAL_ARTIFACT_DIR, FILL_RATE_FILE_NAME_IN_STAGE)\n",
+    "                with gzip.open(local_fr_path, \"rb\") as f:\n",
+    "                    df_training_fill_rates = pd.read_csv(f)\n",
+    "                print(\"✅ Training fill rates loaded successfully.\")\n",
+    "            except Exception as e:\n",
+    "                print(f\"WARNING: Could not load fill rate file from '{FILL_RATE_STAGE_PATH}'. \"\n",
+    "                      f\"Fill rate comparison will be skipped. Error: {e}\")\n",
+    "\n",
+    "    except Exception as e:\n",
+    "        print(f\"CRITICAL ERROR: Failed to load the required model bundle from stage. Error: {e}\")\n",
+    "        return\n",
+    "\n",
+    "    los_model, readmission_model, discharge_model = models_bundle['los_model'], models_bundle['readmission_model'], models_bundle['discharge_model']\n",
+    "    los_features, readmission_features, discharge_features = models_bundle['feature_columns_los'], models_bundle['feature_columns_readmission'], models_bundle['feature_columns_discharge']\n",
+    "    le_discharge = models_bundle.get('le_discharge')\n",
+    "    model_version = models_bundle.get('model_run_id', MODEL_FILE_NAME_IN_STAGE)\n",
+    "    print(f\"  - Using Model Version (from training run): {model_version}\")\n",
+    "\n",
+    "    # --- Stage 2: Load New Data for Prediction ---\n",
+    "    print(f\"\\n--- Stage 2: Loading data from table: {FULL_INPUT_TABLE} ---\")\n",
+    "    try:\n",
+    "        query = session.table(FULL_INPUT_TABLE)\n",
+    "        if ROW_LIMIT:\n",
+    "            query = query.limit(ROW_LIMIT)\n",
+    "        df_new_data_pd = query.to_pandas()\n",
+    "        print(f\"Loaded {len(df_new_data_pd)} rows for prediction.\")\n",
+    "        if df_new_data_pd.empty:\n",
+    "            print(\"WARNING: Input data is empty. Exiting script.\")\n",
+    "            return\n",
+    "        df_new_data_pd.columns = [col.upper() for col in df_new_data_pd.columns]\n",
+    "    except Exception as e:\n",
+    "        print(f\"CRITICAL ERROR: Failed to load data from {FULL_INPUT_TABLE}. Error: {e}\")\n",
+    "        return\n",
+    "        \n",
+    "    # --- Stage 3: Preprocess New Data ---\n",
+    "    print(\"\\n--- Stage 3: Preprocessing data ---\")\n",
+    "    df_new_data_pd_lower = df_new_data_pd.copy()\n",
+    "    df_new_data_pd_lower.columns = df_new_data_pd_lower.columns.str.lower()\n",
+    "    categorical_cols = ['sex', 'state', 'race', 'ms_drg_code', 'ccsr_cat']\n",
+    "    cols_to_encode = [col for col in categorical_cols if col in df_new_data_pd_lower.columns]\n",
+    "    if cols_to_encode:\n",
+    "        df_new_data_encoded = pd.get_dummies(df_new_data_pd_lower, columns=cols_to_encode, dummy_na=False)\n",
+    "    else:\n",
+    "        df_new_data_encoded = df_new_data_pd_lower.copy()\n",
+    "    print(f\"Data preprocessed. Total features after encoding: {len(df_new_data_encoded.columns)}\")\n",
+    "\n",
+    "    # --- Stage 4: Calculate & Compare Feature Fill Rates ---\n",
+    "    print(\"\\n--- Stage 4: Comparing input data fill rate to training data fill rate ---\")\n",
+    "    if ENABLE_FILL_RATE_COMPARISON:\n",
+    "        # This block executes only if the feature is enabled in the configuration.\n",
+    "        if not df_training_fill_rates.empty:\n",
+    "            # Calculate the non-null rate for all columns.\n",
+    "            input_fill_rate_series = (df_new_data_encoded.notna().sum() / len(df_new_data_encoded))\n",
+    "\n",
+    "            # Identify binary/categorical features to calculate positive rate instead of non-null rate.\n",
+    "            binary_prefixes = ('hcc_', 'cms_', 'cond_') + tuple(f'{col}_' for col in cols_to_encode)\n",
+    "            binary_feature_names = [\n",
+    "                col for col in df_new_data_encoded.columns \n",
+    "                if col.lower().startswith(binary_prefixes)\n",
+    "            ]\n",
+    "            print(f\"Identified {len(binary_feature_names)} binary-like features for positive rate calculation.\")\n",
+    "            \n",
+    "            # For these binary columns, calculate the positive rate (mean) and update the series.\n",
+    "            if binary_feature_names:\n",
+    "                existing_binary_features = [c for c in binary_feature_names if c in df_new_data_encoded.columns]\n",
+    "                if existing_binary_features:\n",
+    "                    positive_rates = df_new_data_encoded[existing_binary_features].mean()\n",
+    "                    input_fill_rate_series.update(positive_rates)\n",
+    "\n",
+    "            df_input_fill_rates = input_fill_rate_series.reset_index()\n",
+    "            df_input_fill_rates.columns = ['FEATURE', 'INPUT_FILL_RATE']\n",
+    "\n",
+    "            # Prepare training data fill rates for comparison.\n",
+    "            df_training_fill_rates.columns = df_training_fill_rates.columns.str.upper()\n",
+    "            df_training_fill_rates = df_training_fill_rates.rename(columns={'FEATURE_NAME': 'FEATURE'})\n",
+    "            if 'POSITIVE_RATE_PERCENT' in df_training_fill_rates.columns:\n",
+    "                df_training_fill_rates['TRAINING_FILL_RATE'] = df_training_fill_rates['POSITIVE_RATE_PERCENT'] / 100.0\n",
+    "\n",
+    "            # Merge input and training fill rates on the feature name.\n",
+    "            df_training_fill_rates['FEATURE'] = df_training_fill_rates['FEATURE'].str.upper()\n",
+    "            df_input_fill_rates['FEATURE'] = df_input_fill_rates['FEATURE'].str.upper()\n",
+    "\n",
+    "            df_comparison = pd.merge(\n",
+    "                df_training_fill_rates[['FEATURE', 'TRAINING_FILL_RATE']],\n",
+    "                df_input_fill_rates, on='FEATURE', how='outer'\n",
+    "            )\n",
+    "            df_comparison['RUN_ID'] = RUN_ID\n",
+    "            df_comparison['LAST_RUN'] = pd.Timestamp(datetime.utcnow())\n",
+    "            df_comparison['FILL_RATE_DIFFERENCE'] = df_comparison['INPUT_FILL_RATE'] - df_comparison['TRAINING_FILL_RATE']\n",
+    "            df_comparison = df_comparison[['RUN_ID', 'FEATURE', 'TRAINING_FILL_RATE', 'INPUT_FILL_RATE', 'FILL_RATE_DIFFERENCE', 'LAST_RUN']]\n",
+    "            \n",
+    "            # Save the comparison table to Snowflake.\n",
+    "            session.write_pandas(df_comparison, FULL_OUTPUT_FILL_RATE_COMPARISON_TABLE, auto_create_table=True, overwrite=True)\n",
+    "            print(f\"✅ Successfully saved fill rate comparison to {FULL_OUTPUT_FILL_RATE_COMPARISON_TABLE}\")\n",
+    "        else:\n",
+    "            print(\"WARNING: Training fill rate data not available. Skipping comparison.\")\n",
+    "    else:\n",
+    "        print(\"Fill rate comparison is disabled by configuration. Skipping.\")\n",
+    "        \n",
+    "    # --- Stage 5: Generate Predictions & Save ---\n",
+    "    print(\"\\n--- Stage 5: Generating and saving predictions ---\")\n",
+    "    predictions_df = pd.DataFrame({\n",
+    "        \"ENCOUNTER_ID\": df_new_data_encoded['encounter_id'],\n",
+    "        \"LENGTH_OF_STAY_PRED\": los_model.predict(df_new_data_encoded.reindex(columns=los_features, fill_value=0)),\n",
+    "        \"READMISSION_PRED\": readmission_model.predict_proba(df_new_data_encoded.reindex(columns=readmission_features, fill_value=0))[:, 1],\n",
+    "        \"LAST_RUN\": datetime.utcnow()\n",
+    "    })\n",
+    "    if le_discharge is not None:\n",
+    "        discharge_probas = discharge_model.predict_proba(df_new_data_encoded.reindex(columns=discharge_features, fill_value=0))\n",
+    "        for i, class_label in enumerate(le_discharge.classes_):\n",
+    "            col_name = f\"DISCHARGE_PRED_PROBA_{class_label.upper().replace(' ', '_').replace('/', '_')}\"\n",
+    "            predictions_df[col_name] = discharge_probas[:, i]\n",
+    "    session.write_pandas(predictions_df, FULL_OUTPUT_PREDICTIONS_TABLE, auto_create_table=True, overwrite=True)\n",
+    "    print(f\"✅ Successfully saved {len(predictions_df)} predictions to {FULL_OUTPUT_PREDICTIONS_TABLE}\")\n",
+    "\n",
+    "    # --- Stage 6: Calculate and Save Evaluation Metrics ---\n",
+    "    print(\"\\n--- Stage 6: Calculating and saving evaluation metrics ---\")\n",
+    "    actual_cols_to_get = ['ENCOUNTER_ID', 'DATA_SOURCE', 'YEAR_NBR', 'LENGTH_OF_STAY', 'READMISSION_NUMERATOR', 'READMISSION_DENOMINATOR', 'DISCHARGE_LOCATION']\n",
+    "    actual_cols = [col for col in actual_cols_to_get if col in df_new_data_pd.columns]\n",
+    "    eval_df = pd.merge(predictions_df, df_new_data_pd[actual_cols], on=\"ENCOUNTER_ID\", how=\"left\")\n",
+    "    eval_df.columns = eval_df.columns.str.lower()\n",
+    "    \n",
+    "    ACTUAL_LOS_COL_LOWER, ACTUAL_READMISSION_NUM_LOWER, ACTUAL_READMISSION_DENOM_LOWER, ACTUAL_DISCHARGE_COL_LOWER = \"length_of_stay\", \"readmission_numerator\", \"readmission_denominator\", \"discharge_location\"\n",
+    "    groups_to_iterate = eval_df.groupby(['data_source', 'year_nbr'], dropna=False) if 'data_source' in eval_df.columns and 'year_nbr' in eval_df.columns else [((\"Overall\", -1), eval_df)]\n",
+    "\n",
+    "    for group_key, group_df in groups_to_iterate:\n",
+    "        data_source_val, year_nbr_val = group_key\n",
+    "        print(f\"\\n--- Calculating metrics for group: {data_source_val}, {year_nbr_val} ---\")\n",
+    "        if ACTUAL_LOS_COL_LOWER in group_df.columns and not group_df[ACTUAL_LOS_COL_LOWER].isnull().all():\n",
+    "            metrics = calculate_regression_metrics(group_df[ACTUAL_LOS_COL_LOWER], group_df[\"length_of_stay_pred\"])\n",
+    "            metrics['n_samples'] = len(group_df)\n",
+    "            log_metrics_to_snowflake(session, RUN_ID, model_version, data_source_val, year_nbr_val, \"LENGTH_OF_STAY\", metrics, FULL_METRICS_TABLE)\n",
+    "        if ACTUAL_READMISSION_NUM_LOWER in group_df.columns and ACTUAL_READMISSION_DENOM_LOWER in group_df.columns:\n",
+    "            readmission_eval_df = group_df[group_df[ACTUAL_READMISSION_DENOM_LOWER] == 1].copy()\n",
+    "            if not readmission_eval_df.empty and not readmission_eval_df[ACTUAL_READMISSION_NUM_LOWER].isnull().all():\n",
+    "                metrics = calculate_binary_classification_proba_metrics(readmission_eval_df[ACTUAL_READMISSION_NUM_LOWER], readmission_eval_df[\"readmission_pred\"])\n",
+    "                metrics['n_samples'] = len(readmission_eval_df)\n",
+    "                log_metrics_to_snowflake(session, RUN_ID, model_version, data_source_val, year_nbr_val, \"READMISSION\", metrics, FULL_METRICS_TABLE)\n",
+    "        if ACTUAL_DISCHARGE_COL_LOWER in group_df.columns and le_discharge is not None:\n",
+    "            group_input = df_new_data_encoded.loc[group_df.index]\n",
+    "            y_true_labels = group_df[ACTUAL_DISCHARGE_COL_LOWER]\n",
+    "            known_mask = y_true_labels.isin(le_discharge.classes_)\n",
+    "            y_true_enc = le_discharge.transform(y_true_labels[known_mask])\n",
+    "            group_probas = discharge_model.predict_proba(group_input[known_mask].reindex(columns=discharge_features, fill_value=0))\n",
+    "            group_preds_enc = discharge_model.predict(group_input[known_mask].reindex(columns=discharge_features, fill_value=0))\n",
+    "            metrics = calculate_multiclass_classification_metrics(y_true_enc, group_preds_enc, group_probas, le_discharge.classes_)\n",
+    "            metrics['n_samples'] = len(y_true_enc)\n",
+    "            log_metrics_to_snowflake(session, RUN_ID, model_version, data_source_val, year_nbr_val, \"DISCHARGE_LOCATION_OVERALL\", metrics, FULL_METRICS_TABLE)\n",
+    "            for detail in metrics.get('per_class_details', []):\n",
+    "                log_metrics_to_snowflake(session, RUN_ID, model_version, data_source_val, year_nbr_val, f\"DISCHARGE_LOCATION_Class_{detail['class_name']}\", detail, FULL_METRICS_TABLE)\n",
+    "    print(\"\\n✅ Script finished.\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    try:\n",
+    "        snowpark_session = get_active_session()\n",
+    "        print(\"Successfully retrieved active Snowpark session.\")\n",
+    "    except SnowparkClientException:\n",
+    "        print(\"No active session. Creating a new session from local credentials...\")\n",
+    "        snowpark_session = Session.builder.create()\n",
+    "    main(snowpark_session)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Streamlit Notebook",
+   "name": "streamlit"
+  },
+  "lastEditStatus": {
+   "authorEmail": "[email protected]",
+   "authorId": "374530764978",
+   "authorName": "BRAD",
+   "lastEditTime": 1750882358404,
+   "notebookId": "7e7pzs6ti4k6chxub4nt",
+   "sessionId": "56467df4-1029-4269-83ed-9a238cb180f6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}