Skip to main content
Models: Multiple sizes from 97M to 600M parameters
Task: Medical image encoding, feature extraction, risk assessment
SMB-Vision is a family of vision encoders trained on medical imaging data including radiology, pathology, and CT scans. These models serve as the visual backbone for multimodal biomedical AI applications.

Available Models

ModelParametersSpecialtyHuggingFace
smb-vision-v0-risk0.6BRisk assessmentLink
smb-vision-v0-mim0.6BMasked image modelingLink
smb-vision-large0.3BGeneral encoderLink
smb-vision-base97MGeneral encoderLink
smb-vision-ct-base-051997MCT-specificLink
smb-vision-vjepa2-vitl-384-2560.3BV-JEPA2 architectureLink

Environment Activation

source standard_model/bin/activate
See the Quickstart Guide for environment creation and usage.

Usage

from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch

# Load model and processor
model = AutoModel.from_pretrained("standardmodelbio/smb-vision-base")
processor = AutoProcessor.from_pretrained("standardmodelbio/smb-vision-base")

# Move to GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

# Load and process image
image = Image.open("chest_xray.png")
inputs = processor(images=image, return_tensors="pt").to(device)

# Extract features
with torch.no_grad():
    outputs = model(**inputs)
    features = outputs.last_hidden_state

print(f"Feature shape: {features.shape}")

Model Selection Guide

Use smb-vision-base (97M) for a lightweight encoder or smb-vision-large (0.3B) for higher capacity.
model = AutoModel.from_pretrained("standardmodelbio/smb-vision-base")
Use smb-vision-ct-base-0519 — specifically trained on CT imaging data.
model = AutoModel.from_pretrained("standardmodelbio/smb-vision-ct-base-0519")
Use smb-vision-v0-risk — optimized for risk assessment tasks.
model = AutoModel.from_pretrained("standardmodelbio/smb-vision-v0-risk")
Use smb-vision-v0-mim — trained with masked image modeling for transfer learning.
model = AutoModel.from_pretrained("standardmodelbio/smb-vision-v0-mim")

Extracting Embeddings

Single Image

from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch

model = AutoModel.from_pretrained("standardmodelbio/smb-vision-base")
processor = AutoProcessor.from_pretrained("standardmodelbio/smb-vision-base")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

# Load image
image = Image.open("medical_image.png")
inputs = processor(images=image, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)
    
    # Get CLS token embedding (global image representation)
    cls_embedding = outputs.last_hidden_state[:, 0, :]
    
    # Or pool all patch embeddings
    pooled_embedding = outputs.last_hidden_state.mean(dim=1)

print(f"CLS embedding shape: {cls_embedding.shape}")
print(f"Pooled embedding shape: {pooled_embedding.shape}")

Batch Processing

from pathlib import Path

# Load multiple images
image_paths = list(Path("images/").glob("*.png"))
images = [Image.open(p) for p in image_paths]

# Process batch
inputs = processor(images=images, return_tensors="pt", padding=True).to(device)

with torch.no_grad():
    outputs = model(**inputs)
    batch_embeddings = outputs.last_hidden_state[:, 0, :]  # CLS tokens

print(f"Batch embeddings shape: {batch_embeddings.shape}")  # [N, hidden_dim]

Working with CT Volumes

For 3D CT volumes, process slice by slice and aggregate:
import numpy as np

def encode_ct_volume(volume, model, processor, device):
    """
    Encode a 3D CT volume by processing slices.
    
    Args:
        volume: numpy array of shape (D, H, W) or (D, H, W, C)
    
    Returns:
        Aggregated embedding for the volume
    """
    slice_embeddings = []
    
    for i in range(volume.shape[0]):
        # Convert slice to PIL Image
        slice_img = Image.fromarray(volume[i].astype(np.uint8))
        
        inputs = processor(images=slice_img, return_tensors="pt").to(device)
        
        with torch.no_grad():
            outputs = model(**inputs)
            embedding = outputs.last_hidden_state[:, 0, :]
            slice_embeddings.append(embedding)
    
    # Aggregate slice embeddings
    volume_embedding = torch.stack(slice_embeddings).mean(dim=0)
    
    return volume_embedding

Use Cases

Image Classification

Classify medical images by training a linear probe on embeddings.

Similarity Search

Find similar cases using embedding cosine similarity.

Multimodal Fusion

Combine with text/EHR embeddings for multimodal models.

Anomaly Detection

Detect unusual findings via embedding distance from normal cases.

Memory Requirements

ModelParametersGPU Memory (fp32)GPU Memory (fp16)
smb-vision-base97M4 GB2 GB
smb-vision-large0.3B8 GB4 GB
smb-vision-v0-risk0.6B12 GB6 GB
smb-vision-v0-mim0.6B12 GB6 GB

Float16 Inference

model = AutoModel.from_pretrained(
    "standardmodelbio/smb-vision-base",
    torch_dtype=torch.float16,
    device_map="auto"
)

Research

Advancing High Resolution Vision-Language Models in Biomedicine

Read the paper on our vision-language approach.