Clinical LLM Architecture & Cost – Fabric + Azure ML/Foundry

Model Selection

Text-only projects (5 of 6)

Qwen3-32B

Central Venous Catheters (Retrospective)
NICCU CVC Dashboard (QI)
Agitation Prediction (Retrospective)
eBikes / eScooters (Retrospective)
Dietician / Nutrition (Retrospective)

Multimodal project (1 of 6)

Qwen2.5-VL-32B or GPT-4o

Xray Object Interpretation (image + notes)
Start with GPT-4o (no GPU needed, zero-shot)
Fine-tune Qwen2.5-VL if accuracy insufficient
LLaVA-Med as fallback (bring-your-own-weights)

Architecture — Data Flow & Compute

Stage	Microsoft Fabric	Azure ML / Foundry	Compute / Infra
1. Data Storage Clinical notes & structured data	Fabric Lakehouse / ADLS Gen2 Notes as files or Delta table rows. Structured data (MRNs, codes) in Warehouse.	—	No GPU Fabric F-SKU capacity (existing). West US 3.
2. Cohort Query Filter patients by coded criteria	Fabric Warehouse SQL endpoint Filter by dept, date range, procedure codes. Returns MRN list.	—	No GPU CPU-only Fabric compute.
3. Training Data Prep Format labeled examples for fine-tuning	Fabric Spark Notebook Extract notes + gold-standard labels. Format as JSONL (OpenAI chat format). Export to Blob.	Foundry Upload JSONL to Data + Indexes	No GPU Fabric Spark cluster (CPU). Egress ~$0.08/GB West US 3 → East US 2.
4. Fine-tuning Job Supervised fine-tuning on clinical data	—	Foundry Azure ML Fine-tuning Job QLoRA SFT on 100–500 labeled examples per task. One-time or quarterly refresh. Training type: Regional (PHI).	GPU Required Standard NDAMSv4_A100 (4× A100 80GB). $32/hr. Duration: 2–4 hrs/run. Spot option: ~$6/hr.
5. Model Storage Fine-tuned checkpoint saved	—	Foundry Azure Blob Storage (East US 2) ~20 GB adapter weights + base model reference.	Storage ~$0.40/month. Azure Blob LRS East US 2.
6. Inference Endpoint Deploy fine-tuned model as REST API	—	Foundry Managed Online Endpoint Scale-to-zero when idle. OpenAI-compatible API. Private VNet for PHI. Regional deployment (East US 2).	GPU Required Standard NCADSA100v4 (1× A100 80GB). $3.67/hr active. $0/hr idle (scale-to-zero). ~2–5 min cold start.
7. Batch Inference Process notes at scale	Fabric Spark Notebook Loop MRN list, retrieve notes, call endpoint, collect structured JSON responses.	Foundry Fine-tuned Qwen3-32B API	No Extra Cost Fabric Spark (CPU). Network egress negligible (<$1/run at ~500MB).
8. Results & Reporting Store outputs, serve dashboards	Fabric Lakehouse → Power BI Write extraction results to Delta table. QI dashboard (NICCU CVC). Research exports for retrospective projects.	—	No GPU Power BI Premium / Fabric capacity. Existing licensing.

Cost Estimate (Monthly)

Component	Resource	When Active	Unit Cost	Monthly Estimate
Fabric platform	Existing F-SKU capacity	Always	Existing license	$0 incremental
Inference endpoint	NCADSA100v4 (1× A100 80GB)	Business hours only (~10 hrs/day, weekdays)	$3.67/hr	~$550
Fine-tuning job	NDAMSv4_A100 (4× A100 80GB)	Once/quarter (~3 hrs/run)	$32/hr (or $6/hr spot)	~$96 (or ~$18 spot)
Model storage	Azure Blob LRS East US 2	Always	$0.018/GB/month	~$0.40
Training data storage	Azure Blob (JSONL files ~1 GB)	Always	$0.018/GB/month	~$0.02
Cross-region data transfer	West US 3 → East US 2 egress	Per batch run (~500 MB)	$0.08/GB	<$1
GPT-4o (Xray project)	Serverless — pay per token	On demand	~$5/1M tokens	~$10–50 (volume dep.)
Total estimated monthly cost				~$560–660 / month