BodhiBench
Latest deploy-gate eval scores across the six BodhiBench benchmarks. Open methodology in docs/eval/.
BodhiBench-Curriculum
PASS0.781
threshold 0.750
BodhiBench-Socratic
PASS0.823
threshold 0.800
BodhiBench-CodeSwitch
PASS0.722
threshold 0.700
BodhiBench-LowResource
PASS0.510
threshold 0.500
BodhiBench-Hallucination
PASS0.061
threshold 0.080
BodhiBench-Bias
PASS0.039
threshold 0.050
Methodology pillars
- Curriculum, Socratic, CodeSwitch, LowResource, Hallucination, and Bias gates published in `docs/eval/`.
- Inter-annotator agreement target Cohen's κ ≥ 0.7. Pre-registered on OSF.
- Contamination audit vs. Sarvam / Llama / Qwen training corpora.
- Per-subgroup fairness slices with confidence intervals.
- Argo CD pre-sync hook blocks deploy on any threshold regression.