End-to-end example - Standard Model Bio

We provide quickstart-pipeline.py demonstrating an end-to-end workflow using the Standard Model to go from raw data to clinical insight. It performs three key steps:

Simulate patient data: Generate a synthetic MEDS (Medical Event Data Standard) cohort with distinct clinical phenotypes (Lung Cancer vs. Pneumonia), including longitudinal codes, labs, and medications.
Represent data as embeddings: Convert tabular MEDS data into a token stream using smb-biopan-utils and use the Standard Model on this input to produce high-dimensional patient-level embeddings that capture causal history.
Train clinical predictors: Use embeddings to power downstream tasks including Readmission Risk, Disease Phenotyping, and Survival Analysis.

Don’t forget to activate your virtual environment!

source standard_model/bin/activate

Pull the script from GitHub:

curl -fsSL https://raw.githubusercontent.com/standardmodelbio/quickstart/main/quickstart-pipeline.py -o quickstart-pipeline.py

Run the script:

python quickstart-pipeline.py

You should see something like the following output as the script runs successfully:

[1/4] Simulating patient data for N=200...
   -> Generated 1274 total clinical events (MEDS format).
   -> Class Balance: 144 Cancer / 56 Pneumonia.

[2/4] Loading Standard Model (SMB-v1-1.7B)...
model.safetensors: 100%|█████████████████| 3.66G/3.66G [00:12<00:00, 287MB/s]
generation_config.json: 100%|██████████████| 127/127 [00:00<00:00, 939kB/s]

[3/4] Generating embeddings for 200 patients...
   -> Strategy: Causal Inference (Last Token Pooling)
   -> Processed 50/200 patients...
   -> Processed 100/200 patients...
   -> Processed 150/200 patients...
   -> Processed 200/200 patients...
   -> Inference complete.

[4/4] Training Clinical Task Heads...
   -> Split: 160 Train / 40 Test examples.

   --- Task A: Binary Classification (Readmission Risk) ---
   -> ROC-AUC: 1.000

   --- Task B: Multiclass Classification (Phenotype Stage) ---
   -> Accuracy: 1.000

   --- Task C: Regression (Overall Survival Time) ---
   -> MAE: 4.02 months

   --- Task D: Survival Analysis (Cox Proportional Hazards) ---
   -> Projecting embeddings to 10D PCA for stability...
   -> C-Index: 0.827