Project 05 · AI Eval Harness

LLM-as-judgeScalable oversightPower BIGIS / LeafletAzure Container AppsSARIMA + Prophet

Healthcare Dashboard Ops

“Scalable oversight for production BI.”

LLM-as-judge platform that gates a Medicaid Rx delivery pipeline — embedded Power BI dashboard, GIS choropleth, and a 12-month forecast — behind a 16-evaluator quality harness. Pick the surface to explore.

Open the dashboard →Explore the map ↗Source on GitHub

Who this is for

06 personas

▲
State Medicaid agencies
Spend-driver attribution & rebate forecasting
Forecast · cohort
●
Health plans (Humana, BCBS, Centene)
Formulary benchmarking vs the public baseline
Brand–generic substitution
◆
Pharma commercial ops (Gilead, AbbVie, Novo Nordisk)
Brand performance by state × payer mix
Share of class · YoY
✦
Public health & policy researchers
GLP-1 surge tracking, opioid stewardship
Reproducible briefs
⬢
AI evaluation engineers
Reference implementation of scalable oversight
Anthropic-style oversight
⟡
BI engineering teams in regulated industries
Automated quality gates for dashboard delivery
CI / governance

Data & complianceCMS · Public dataNo PHIPower BI Pro / FTLWCAG 2.1 AA targetAzure Blob · DuckDBAnthropic Claude

Datasets

What powers the pipeline

04 sources · public

CMS Medicaid State Drug Utilization Data ↗

Federal claim-level prescription utilization, every state, every quarter, since 1991. The canonical run ingests 2020–2025 = 31M rows / $480B reimbursed.

31M rows52 states + territoriesRefreshed quarterlyPublic domain

US Census ACS5 (American Community Survey, 5-yr) ↗

State + county population denominators for per-capita normalization — "$/person" makes Wyoming and California comparable.

Vintage 2024Block-group resolution

US Census TIGER state boundaries ↗

10m-simplified TopoJSON for the GIS choropleth. ~115 KB, no tile basemap dependency.

TopoJSONus-atlas 3.0

FDA Orange Book (Approved Drug Products) ↗

Brand–generic mappings, therapeutic equivalence ratings, and approval dates to drive the substitution-opportunity panels.

Brand ↔ genericTE codes

Use cases · detail

How each persona uses the pipeline

06 personas

▲State Medicaid agencies

Spend-driver attribution & rebate forecasting

Where is the next $1B coming from? Per-state break-out by drug class shows that GLP-1 spend grew faster than total Rx volume in 31 states. The 12-month SARIMA + Prophet ensemble flags states whose budgets are mis-forecasting GLP-1 exposure.

Forecast · cohort

●Health plans (Humana, BCBS, Centene)

Formulary benchmarking vs the public baseline

Plan analysts overlay their internal spend mix on top of the public Medicaid baseline to quantify negotiated-rate effectiveness by drug class — and identify substitution opportunity where a plan's branded usage runs above public utilization.

Brand–generic substitution

◆Pharma commercial ops (Gilead, AbbVie, Novo Nordisk)

Brand performance by state × payer mix

BIKTARVY, OZEMPIC, HUMIRA each show distinct regional patterns. The pipeline surfaces per-state share-of-class trajectories so brand teams see uptake plateaus and competitive switching in near-real-time.

Share of class · YoY

✦Public health & policy researchers

GLP-1 surge tracking, opioid stewardship

Time-series cohorts on prescription classes (GLP-1 agonists, opioid analgesics, biologics) feed into Brookings / KFF / Commonwealth-style policy briefs — with reproducible AUDIT.md provenance for every chart.

Reproducible briefs

⬢AI evaluation engineers

Reference implementation of scalable oversight

Three-tier judge hierarchy (deterministic → LLM → paired auditor) applied to a non-trivial production artifact (a 5-page BI dashboard + GIS layer + forecast). Drop-in pattern for any team building automated quality gates around generated content.

Anthropic-style oversight

⟡BI engineering teams in regulated industries

Automated quality gates for dashboard delivery

Every PR that touches dashboard_spec.yml triggers the 16-evaluator harness in CI. Reviewers see severity-banded AUDIT.md before they merge — no more "works on my machine" Power BI surprises in production.

CI / governance

Sample audit

What ships with every release

AUDIT.md · run #142 · medicaid_sdud_2026

Composite

0.91

Judges passed

16/16

Verdict

SHIP

[OK]dax_syntax · 1.00 · 0 errors across 39 measures

[OK]phi_leakage · 1.00 · no patient identifiers detected

[WARN]viz_choice · 0.78 · Pareto chart on Page 1 could be a bar

[OK]forecast_methodology · 0.92 · ensemble blend justified

[OK]domain_relevance · 0.95 · Medicaid spend framing on-target

[OK]star_schema_design · 0.97 · 4 relationships, no fact-to-fact joins

[OK]accessibility_wcag · 0.88 · contrast 4.7:1 on all text

[OK]governance_rls · 1.00 · state_code RLS role enforced

How it works

From spec to shipped artifact

01
Spec submitted
dashboard_spec.yml declares domain, data sources, audience, KPIs, RLS roles, forecast methodology. Single source of truth.
02
Bronze ingest
Year-partitioned CSVs pulled from data.medicaid.gov to Azure Blob (stasiprod1eus2/healthcare/bronze). ~600 MB compressed, refreshed quarterly.
03
Silver model
DuckDB star schema: fact_sdud + dim_state + dim_drug + dim_date. Drug-class taxonomy applied (brand-name aware so HUMIRA → Autoimmune, OZEMPIC → GLP-1).
04
Forecast
12-month SARIMA + Prophet ensemble per state × class. Equal-weight blend; 6-month holdout for backtest.
05
Generate
DAX measures + Tabular Object Model + page-layout JSON + narrative. Published to Power BI via Fabric REST.
06
Audit
16 evaluators run in parallel: 8 deterministic Python + 5 Claude judges + 3 paired auditors re-inspecting deterministic findings.
07
Verdict
Severity-banded AUDIT.md + weighted composite scorecard. ≥ 0.85 = Ship · 0.70–0.85 = Tighten · < 0.70 or any MISS = Re-work.

System architecture

Pipeline at a glance

Specdashboard_spec.yml

❯

BronzeAzure Blob CSV

❯

SilverDuckDB star schema

❯

ForecastSARIMA + Prophet

❯

GenerateTOM + DAX + JSON

❯

Audit16 evaluators

❯

VerdictShip / Tighten / Re-work

Eval framework

8 deterministic Python · 5 Claude LLM · 3 paired auditors · severity-banded AUDIT.md

Compute

Azure Container Apps Jobs (quarterly cron) · DuckDB in-process · Power BI Fabric REST

Reproducibility

Every artifact regenerable from one dashboard_spec.yml · domain-neutral by design

Pick a surface

Three ways to read the same $100B

Power BI dashboard →

5 pages · 17 visuals · live embed

GIS choropleth →

state-level · click for detail

GitHub source ↗

harness · spec · runbooks