How the program evolves — every version, what changed and why, who contributed, and what we address next. Plain English first, a little tech underneath.
The version presented to the Kelly/MCG AI group: one declarative package — the governance-led motion, the proof it's built, the meter (cost & efficiency), the ecosystem due diligence, and the live pipeline shown honestly as reference leads.
Sean ran his own open-source toolset research and brought back a strong, focused stack. v0.9 merges it into ours. His list filled the exact gaps we were thinnest on — automatic policy enforcement, compliance evidence for certifications, and a full AI audit trail. A large piece of v0.9 is his thinking and delivery.
Six additions from Sean's research — each here for a reason:
An automatic rule-checker. It decides, in real time, who and what is allowed to use a given AI model, dataset, or system — and blocks the rest.
Why we merged it: v0.8 could detect problems but couldn't enforce rules. OPA is the enforcement layer we were missing.
Turns "are we compliant?" from a manual scramble into a button. It tracks every security control and the proof for it, in a format auditors accept.
Why we merged it: certifications were our weakest stage. OSCAL fixes it directly.
A flight recorder for the AI — every prompt, answer, cost, and who did what. The trail you show when someone asks "what did the AI do, and when?"
Why we merged it: a genuine miss — we had monitoring, but not a clean audit trail. This is it.
The common plumbing that carries all the monitoring data to one place — so security and compliance dashboards actually get fed.
Why we merged it: the open backbone we implied but never named. Names it.
Tracks where data came from, where it went, and who owns it — so you can prove your AI was trained and run on the right data.
Why we merged it: rounds out data governance for audit, alongside OpenMetadata / DataHub.
The step-by-step recipe for security-testing the apps around the AI — so every engagement tests the same way, every time.
Why we merged it: we had the testing tools but not the repeatable method. This is the rinse-and-repeat part.
The confidence signal: both teams searched independently and landed on the same 13-tool core — garak, PyRIT, promptfoo, OWASP ZAP, Arize Phoenix, AIF360, Fairlearn, SHAP, LIME, OpenMetadata, DataHub, Datasheets, + OWASP LLM Top 10. When two independent searches converge, those are the non-negotiables.
Sean's doc was intentionally open-source only. We kept what his scope didn't cover, because a Fortune-10 program needs it:
Corrected in v0.9: Google Model Card Toolkit was on Sean's list as current — it's been archived (read-only) since Sep 2024. We use Hugging Face model cards / CycloneDX instead. (No fault — these move fast; it's exactly why we cross-check.)
One repeatable process for every engagement. Sean's additions notably strengthen Implementation (OPA), Certifications (OSCAL), and Maintenance (Langfuse, OTel).
Inventory + lineage, threat-model (ATLAS/OWASP), garak sweep, fairness baseline, ISO 42001 gap.
SoR + registry, guardrails, OPA, PII, supply-chain, Copilot/SDLC, OTel.
garak→PyRIT→promptfoo, ART, ZAP/Burp via OWASP WSTG, fairness.
Prompt standards + CoE, model cards + datasheets, OWASP LLM Top 10 curriculum.
ISO 42001, OSCAL evidence, AI-BOM, Langfuse audit trail.
Observability (Langfuse/Phoenix/OTel), CI red-team, drift, OPA + OSCAL refresh.
Open gap: Training (stage 04) is thin in both toolsets — it's process + people, not tooling. A real build-out item before v1.0.
The first time everything was pulled together, documented, and committed for the team.
v1.0 shipped. The queue now is what makes it client-ready at full strength.