Skip to main content

BodhiBench

Latest deploy-gate eval scores across the six BodhiBench benchmarks. Open methodology in docs/eval/.

BodhiBench-Curriculum

PASS
0.781
threshold 0.750

BodhiBench-Socratic

PASS
0.823
threshold 0.800

BodhiBench-CodeSwitch

PASS
0.722
threshold 0.700

BodhiBench-LowResource

PASS
0.510
threshold 0.500

BodhiBench-Hallucination

PASS
0.061
threshold 0.080

BodhiBench-Bias

PASS
0.039
threshold 0.050

Methodology pillars

  • Curriculum, Socratic, CodeSwitch, LowResource, Hallucination, and Bias gates published in `docs/eval/`.
  • Inter-annotator agreement target Cohen's κ ≥ 0.7. Pre-registered on OSF.
  • Contamination audit vs. Sarvam / Llama / Qwen training corpora.
  • Per-subgroup fairness slices with confidence intervals.
  • Argo CD pre-sync hook blocks deploy on any threshold regression.