Project 05 · AI Eval Harness
LLM-as-judgeScalable oversightPower BIGIS / LeafletAzure Container AppsSARIMA + Prophet

Healthcare Dashboard Ops

“Scalable oversight for production BI.”

LLM-as-judge platform that gates a Medicaid Rx delivery pipeline — embedded Power BI dashboard, GIS choropleth, and a 12-month forecast — behind a 16-evaluator quality harness. Pick the surface to explore.

Who this is for

06 personas
  • State Medicaid agencies

    Spend-driver attribution & rebate forecasting

  • Health plans (Humana, BCBS, Centene)

    Formulary benchmarking vs the public baseline

  • Pharma commercial ops (Gilead, AbbVie, Novo Nordisk)

    Brand performance by state × payer mix

  • Public health & policy researchers

    GLP-1 surge tracking, opioid stewardship

  • AI evaluation engineers

    Reference implementation of scalable oversight

  • BI engineering teams in regulated industries

    Automated quality gates for dashboard delivery

Data & complianceCMS · Public dataNo PHIPower BI Pro / FTLWCAG 2.1 AA targetAzure Blob · DuckDBAnthropic Claude

Datasets

What powers the pipeline

Use cases · detail

How each persona uses the pipeline

State Medicaid agencies

Spend-driver attribution & rebate forecasting

Where is the next $1B coming from? Per-state break-out by drug class shows that GLP-1 spend grew faster than total Rx volume in 31 states. The 12-month SARIMA + Prophet ensemble flags states whose budgets are mis-forecasting GLP-1 exposure.

Forecast · cohort
Health plans (Humana, BCBS, Centene)

Formulary benchmarking vs the public baseline

Plan analysts overlay their internal spend mix on top of the public Medicaid baseline to quantify negotiated-rate effectiveness by drug class — and identify substitution opportunity where a plan's branded usage runs above public utilization.

Brand–generic substitution
Pharma commercial ops (Gilead, AbbVie, Novo Nordisk)

Brand performance by state × payer mix

BIKTARVY, OZEMPIC, HUMIRA each show distinct regional patterns. The pipeline surfaces per-state share-of-class trajectories so brand teams see uptake plateaus and competitive switching in near-real-time.

Share of class · YoY
Public health & policy researchers

GLP-1 surge tracking, opioid stewardship

Time-series cohorts on prescription classes (GLP-1 agonists, opioid analgesics, biologics) feed into Brookings / KFF / Commonwealth-style policy briefs — with reproducible AUDIT.md provenance for every chart.

Reproducible briefs
AI evaluation engineers

Reference implementation of scalable oversight

Three-tier judge hierarchy (deterministic → LLM → paired auditor) applied to a non-trivial production artifact (a 5-page BI dashboard + GIS layer + forecast). Drop-in pattern for any team building automated quality gates around generated content.

Anthropic-style oversight
BI engineering teams in regulated industries

Automated quality gates for dashboard delivery

Every PR that touches dashboard_spec.yml triggers the 16-evaluator harness in CI. Reviewers see severity-banded AUDIT.md before they merge — no more "works on my machine" Power BI surprises in production.

CI / governance

Sample audit

What ships with every release

AUDIT.md · run #142 · medicaid_sdud_2026

Composite

0.91

Judges passed

16/16

Verdict

SHIP

[OK]dax_syntax · 1.00 · 0 errors across 39 measures
[OK]phi_leakage · 1.00 · no patient identifiers detected
[WARN]viz_choice · 0.78 · Pareto chart on Page 1 could be a bar
[OK]forecast_methodology · 0.92 · ensemble blend justified
[OK]domain_relevance · 0.95 · Medicaid spend framing on-target
[OK]star_schema_design · 0.97 · 4 relationships, no fact-to-fact joins
[OK]accessibility_wcag · 0.88 · contrast 4.7:1 on all text
[OK]governance_rls · 1.00 · state_code RLS role enforced

How it works

From spec to shipped artifact

  1. 01

    Spec submitted

    dashboard_spec.yml declares domain, data sources, audience, KPIs, RLS roles, forecast methodology. Single source of truth.

  2. 02

    Bronze ingest

    Year-partitioned CSVs pulled from data.medicaid.gov to Azure Blob (stasiprod1eus2/healthcare/bronze). ~600 MB compressed, refreshed quarterly.

  3. 03

    Silver model

    DuckDB star schema: fact_sdud + dim_state + dim_drug + dim_date. Drug-class taxonomy applied (brand-name aware so HUMIRA → Autoimmune, OZEMPIC → GLP-1).

  4. 04

    Forecast

    12-month SARIMA + Prophet ensemble per state × class. Equal-weight blend; 6-month holdout for backtest.

  5. 05

    Generate

    DAX measures + Tabular Object Model + page-layout JSON + narrative. Published to Power BI via Fabric REST.

  6. 06

    Audit

    16 evaluators run in parallel: 8 deterministic Python + 5 Claude judges + 3 paired auditors re-inspecting deterministic findings.

  7. 07

    Verdict

    Severity-banded AUDIT.md + weighted composite scorecard. ≥ 0.85 = Ship · 0.70–0.85 = Tighten · < 0.70 or any MISS = Re-work.

System architecture

Pipeline at a glance

Specdashboard_spec.yml
BronzeAzure Blob CSV
SilverDuckDB star schema
ForecastSARIMA + Prophet
GenerateTOM + DAX + JSON
Audit16 evaluators
VerdictShip / Tighten / Re-work

Eval framework

8 deterministic Python · 5 Claude LLM · 3 paired auditors · severity-banded AUDIT.md

Compute

Azure Container Apps Jobs (quarterly cron) · DuckDB in-process · Power BI Fabric REST

Reproducibility

Every artifact regenerable from one dashboard_spec.yml · domain-neutral by design