File size: 10,448 Bytes
05a94a1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
license: mit
tags:
  - healthcare
  - medicare
  - xgboost
  - regression
  - pmpm
  - cost-prediction
  - utilization-prediction
datasets:
  - cms-lds
---

# Medicare PMPM and Encounter Prediction Models

**Model ID:** Medicare LDS 2023 Concurrent PMPM Model Bundle
**Model Types:** XGBoost Multi-Output Regressor Bundle
**Dataset:** 2023 CMS Limited Data Set (LDS) 5% Sample
**Target Level:** Member month

---

## What the Models Predict

This repository contains a bundle of multi-output regression models that predict healthcare cost and utilization for a given person over a one-year period. The models are trained on demographic and clinical data (e.g., chronic conditions) from the beginning of the year to predict outcomes for that same year.

The models predict outcomes at three levels of granularity:
1.  **Total PMPM:** A single model predicting the total PMPM (Per Member Per Month) cost.
2.  **Cost & Utilization by Encounter Group:** Multi-output models that simultaneously predict PMPM cost and per member counts of utilization for broad encounter groups (Inpatient, Outpatient, Office-Based, Other).
3.  **Cost & Utilization by Encounter Type:** Multi-output models that predict PMPM and per member counts for more granular service types (e.g., Emergency Department, Home Health, Skilled Nursing, Dialysis).

> **Note on Prediction Type:** The models are trained for **concurrent prediction** — they use data available at the start of a year to predict cost and utilization outcomes for that entire year. The core models predict a *rate* (PMPM/per member counts), which is then used to calculate annual estimates.

---

## Intended Use

This model bundle is designed to support a variety of financial and operational workflows:

- **Actuarial Analysis & Risk Stratification:** Forecast future costs for a population to inform premium pricing and risk adjustment.
-   **Population Health Management:** Identify members who are predicted to have high costs or utilization for proactive care management interventions.
-   **Resource Planning:** Forecast demand for different types of services (e.g., inpatient vs. outpatient) to aid in network management and capacity planning.
-   **Benchmarking:** Compare observed costs against predicted costs for a given patient population to identify efficiency opportunities.
-   **Healthcare Research:** Analyze the drivers of cost and utilization in the Medicare population.

---

## Model Performance

> These metrics reflect performance on a **20% test set** held out from the 2023 CMS LDS data. All values represent model generalization performance on unseen data. The models predict a rate (PMPM/per member counts), but for easier interpretation, the metrics below are calculated on the **annualized** predictions (e.g., PMPM * 12).

### Total Annual Cost Prediction (`total_paid_fs_pmpm_scaled`)

| Target       | R²     | MAE %  |
|--------------|--------|--------|
| paid_amount  | 0.4877 | 65.05% |

### Annual Cost by Encounter Group (`group_paid_fs_pmpm_scaled`)

| Target                    | R²     | MAE %   |
|---------------------------|--------|---------|
| inpatient_paid_amount     | 0.5765 | 85.31%  |
| outpatient_paid_amount    | 0.1627 | 90.64%  |
| office_based_paid_amount  | 0.0717 | 92.76%  |
| other_paid_amount         | 0.0005 | 114.60% |

### Annual Encounters by Encounter Group (`group_count_fs_pmpc_scaled`)

| Target                   | R²     | MAE %  |
|--------------------------|--------|--------|
| inpatient_count          | 0.5341 | 84.84% |
| outpatient_count         | 0.5414 | 67.64% |
| office_based_count       | 0.3389 | 60.43% |
| other_count              | 0.2792 | 79.60% |


See the included csv `encounter_eval_metrics.csv` for the complete list of eval metrics for each of the encounter types.
---

## Files Included

-   `2023_concurrent_model_with_all_features.pkl.gz` — A compressed pickle file containing the bundle of trained XGBoost models, scalers, and feature lists.
-   `Train PMPM Encounters Container.ipynb` — The Snowflake Notebook used for data preparation, feature selection, training, and evaluation.
-   `predict pmpm and encounters.ipynb` — An example Snowflake Notebook for loading the model bundle and running predictions on new data within a Snowflake environment.
-   `encounters_feature_importance.csv` — A file containing the calculated importance of each feature for each of the models in the bundle.
-   `feature_fill_rates_encounters.csv` — A diagnostic file detailing the prevalence (fill rate) of each feature in the training dataset.

---

## Understanding Model Artifacts

This repository includes two key CSV files that provide insight into the model's training data and internal logic. These are generated by the `Train PMPM Encounters Container.ipynb` script.

### Feature Fill Rates (`feature_fill_rates_encounters.csv`)

This file is a diagnostic tool for understanding the input data used to train the models. It is crucial for monitoring data drift and diagnosing data quality issues when running predictions.

| Column                  | Description                                                                              |
| ----------------------- | ---------------------------------------------------------------------------------------- |
| `FEATURE_NAME`          | The name of the input feature (e.g., `age_at_year_start`, `cond_hypertension`).            |
| `POSITIVE_COUNT`        | The number of records in the training set where this feature was present (value > 0).      |
| `TOTAL_ROWS`            | The total number of records in the training set.                                         |
| `POSITIVE_RATE_PERCENT` | The prevalence or "fill rate" of the feature (`POSITIVE_COUNT` / `TOTAL_ROWS` * 100).      |

**How to Use:** Compare the `POSITIVE_RATE_PERCENT` from this file with the rates calculated from your own prediction input data. Significant discrepancies can point to data pipeline issues, changes in the population, or data drift, which may explain unexpected model performance.

### Feature Importances (`encounters_feature_importance.csv`)

This file provides model explainability by showing which features are most influential for each of the models.

| Column             | Description                                                                    |
| ------------------ | ------------------------------------------------------------------------------ |
| `MODEL_NAME`       | Identifies the specific model (e.g., `total_paid_fs_pmpm_scaled`).             |
| `TARGET_NAME`      | The specific outcome this importance list is for (can be a single target or `multi_` for multi-output models). |
| `FEATURE_NAME`     | The name of the input feature.                                                 |
| `IMPORTANCE_VALUE` | A numeric score indicating the feature's influence. Higher is more important.  |
| `IMPORTANCE_RANK`  | The rank of the feature's importance for that model (1 is most important).     |

**How to Use:** Use this file to understand the key drivers behind the model's predictions. For example, you can filter by `MODEL_NAME` for the total cost model and sort by `IMPORTANCE_RANK` to see which conditions or demographic factors most influence predicted spending. This is useful for clinical validation, stakeholder communication, and debugging.

---

## Loading and Running Predictions in Snowflake

This model bundle is designed to be run within a Snowflake environment using the provided `predict pmpm and encounters.ipynb` notebook.


---

## Quick Start: End-to-End Workflow

This section provides high-level instructions for running a model with the Tuva Project. The workflow involves preparing benchmark data using dbt, running a Python prediction script, and optionally ingesting the results back into dbt for analysis.

### 1. Configure Your dbt Project

You need to enable the correct variables in your `dbt_project.yml` file to control the workflow.

#### A. Enable Benchmark Marts

These two variables control which parts of the Tuva Project are active. They are `false` by default.

```yaml
# in dbt_project.yml
vars:
  benchmarks_train: true
  benchmarks_already_created: true
```

- `benchmarks_train`: Set to `true` to build the datasets that the ML models will use for making predictions.  
- `benchmarks_already_created`: Set to `true` to ingest model predictions back into the project as a new dbt source.

#### B. (Optional) Set Prediction Source Locations

If you plan to bring predictions back into dbt for analysis, you must define where dbt can find the prediction data.

```yaml
# in dbt_project.yml
vars:
  predictions_person_year: "{{ source('benchmark_output', 'person_year') }}"
  predictions_inpatient: "{{ source('benchmark_output', 'inpatient') }}"
```

#### C. Configure `sources.yml`

Ensure your `sources.yml` file includes a definition for the source you referenced above (e.g., `benchmark_output`) that points to the database and schema where your model's prediction outputs are stored.

---

### 2. The 3-Step Run Process

This workflow can be managed by any orchestration tool (e.g., Airflow, Prefect, Fabric Notebooks) or run manually from the command line.

#### Step 1: Generate the Training & Benchmarking Data

Run the Tuva Project with `benchmarks_train` enabled. This creates the input data required by the ML model.

```bash
dbt build --vars '{benchmarks_train: true}'
```

To run only the benchmark mart:

```bash
dbt build --select tag:benchmarks_train --vars '{benchmarks_train: true}'
```

#### Step 2: Run the Prediction Python Code

Execute the Python script `predict pmpm and encounters.ipynb` to generate predictions. This script will read the data created in Step 1 and write the prediction outputs to a persistent location (e.g., a table in your data warehouse).

*We have provided example Snowflake Notebook code within each model's repository that was used in Tuva's environment.*

#### Step 3: (Optional) Bring Predictions back into Tuva Project 

To bring the predictions back into the Tuva Project for analysis, run dbt again with `benchmarks_already_created` enabled. This populates the analytics marts.

```bash
dbt build --vars '{benchmarks_already_created: true, benchmarks_train: false}'
```

To run only the analysis models:

```bash
dbt build --select tag:benchmarks_analysis --vars '{benchmarks_already_created: true, benchmarks_train: false}'
```

---