Models: Multiple sizes from 97M to 600M parameters
Task: Medical image encoding, feature extraction, risk assessment
SMB-Vision is a family of vision encoders trained on medical imaging data including radiology, pathology, and CT scans. These models serve as the visual backbone for multimodal biomedical AI applications.
Available Models
Model Parameters Specialty HuggingFace smb-vision-v0-risk 0.6B Risk assessment Link smb-vision-v0-mim 0.6B Masked image modeling Link smb-vision-large 0.3B General encoder Link smb-vision-base 97M General encoder Link smb-vision-ct-base-0519 97M CT-specific Link smb-vision-vjepa2-vitl-384-256 0.3B V-JEPA2 architecture Link
Environment Activation
source standard_model/bin/activate
Usage
from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch
# Load model and processor
model = AutoModel.from_pretrained( "standardmodelbio/smb-vision-base" )
processor = AutoProcessor.from_pretrained( "standardmodelbio/smb-vision-base" )
# Move to GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()
# Load and process image
image = Image.open( "chest_xray.png" )
inputs = processor( images = image, return_tensors = "pt" ).to(device)
# Extract features
with torch.no_grad():
outputs = model( ** inputs)
features = outputs.last_hidden_state
print ( f "Feature shape: { features.shape } " )
Model Selection Guide
Use smb-vision-base (97M) for a lightweight encoder or smb-vision-large (0.3B) for higher capacity. model = AutoModel.from_pretrained( "standardmodelbio/smb-vision-base" )
Use smb-vision-ct-base-0519 — specifically trained on CT imaging data. model = AutoModel.from_pretrained( "standardmodelbio/smb-vision-ct-base-0519" )
Risk stratification from images
Use smb-vision-v0-risk — optimized for risk assessment tasks. model = AutoModel.from_pretrained( "standardmodelbio/smb-vision-v0-risk" )
Self-supervised pretraining
Use smb-vision-v0-mim — trained with masked image modeling for transfer learning. model = AutoModel.from_pretrained( "standardmodelbio/smb-vision-v0-mim" )
Single Image
from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch
model = AutoModel.from_pretrained( "standardmodelbio/smb-vision-base" )
processor = AutoProcessor.from_pretrained( "standardmodelbio/smb-vision-base" )
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()
# Load image
image = Image.open( "medical_image.png" )
inputs = processor( images = image, return_tensors = "pt" ).to(device)
with torch.no_grad():
outputs = model( ** inputs)
# Get CLS token embedding (global image representation)
cls_embedding = outputs.last_hidden_state[:, 0 , :]
# Or pool all patch embeddings
pooled_embedding = outputs.last_hidden_state.mean( dim = 1 )
print ( f "CLS embedding shape: { cls_embedding.shape } " )
print ( f "Pooled embedding shape: { pooled_embedding.shape } " )
Batch Processing
from pathlib import Path
# Load multiple images
image_paths = list (Path( "images/" ).glob( "*.png" ))
images = [Image.open(p) for p in image_paths]
# Process batch
inputs = processor( images = images, return_tensors = "pt" , padding = True ).to(device)
with torch.no_grad():
outputs = model( ** inputs)
batch_embeddings = outputs.last_hidden_state[:, 0 , :] # CLS tokens
print ( f "Batch embeddings shape: { batch_embeddings.shape } " ) # [N, hidden_dim]
Working with CT Volumes
For 3D CT volumes, process slice by slice and aggregate:
import numpy as np
def encode_ct_volume ( volume , model , processor , device ):
"""
Encode a 3D CT volume by processing slices.
Args:
volume: numpy array of shape (D, H, W) or (D, H, W, C)
Returns:
Aggregated embedding for the volume
"""
slice_embeddings = []
for i in range (volume.shape[ 0 ]):
# Convert slice to PIL Image
slice_img = Image.fromarray(volume[i].astype(np.uint8))
inputs = processor( images = slice_img, return_tensors = "pt" ).to(device)
with torch.no_grad():
outputs = model( ** inputs)
embedding = outputs.last_hidden_state[:, 0 , :]
slice_embeddings.append(embedding)
# Aggregate slice embeddings
volume_embedding = torch.stack(slice_embeddings).mean( dim = 0 )
return volume_embedding
Use Cases
Image Classification Classify medical images by training a linear probe on embeddings.
Similarity Search Find similar cases using embedding cosine similarity.
Multimodal Fusion Combine with text/EHR embeddings for multimodal models.
Anomaly Detection Detect unusual findings via embedding distance from normal cases.
Memory Requirements
Model Parameters GPU Memory (fp32) GPU Memory (fp16) smb-vision-base 97M 4 GB 2 GB smb-vision-large 0.3B 8 GB 4 GB smb-vision-v0-risk 0.6B 12 GB 6 GB smb-vision-v0-mim 0.6B 12 GB 6 GB
Float16 Inference
model = AutoModel.from_pretrained(
"standardmodelbio/smb-vision-base" ,
torch_dtype = torch.float16,
device_map = "auto"
)
Research
Advancing High Resolution Vision-Language Models in Biomedicine Read the paper on our vision-language approach.