Module: Raif::Evals::EvalSets::LlmJudgeExpectations
- Included in:
- Raif::Evals::EvalSet
- Defined in:
- lib/raif/evals/eval_sets/llm_judge_expectations.rb
Instance Method Summary collapse
-
#expect_llm_judge_passes(content, criteria:, examples: [], strict: false, llm_judge_model_key: nil, additional_context: nil, result_metadata: {}) ⇒ ExpectationResult
Uses an LLM judge to evaluate whether content meets specific criteria with a binary pass/fail result.
-
#expect_llm_judge_prefers(content_to_judge, over:, criteria:, allow_ties: true, llm_judge_model_key: nil, additional_context: nil, result_metadata: {}) ⇒ ExpectationResult
Uses an LLM judge to compare two pieces of content and determine which better meets specified criteria.
-
#expect_llm_judge_score(output, scoring_rubric:, min_passing_score:, llm_judge_model_key: nil, additional_context: nil, result_metadata: {}) ⇒ ExpectationResult
Uses an LLM judge to evaluate content with a numerical score based on a detailed rubric.
Instance Method Details
#expect_llm_judge_passes(content, criteria:, examples: [], strict: false, llm_judge_model_key: nil, additional_context: nil, result_metadata: {}) ⇒ ExpectationResult
The judge result includes metadata accessible via expectation_result.metadata:
-
:passes - Boolean result
-
:reasoning - Detailed explanation
-
:confidence - Confidence score (0.0-1.0)
Uses an LLM judge to evaluate whether content meets specific criteria with a binary pass/fail result.
This method leverages the Binary LLM judge to assess content against provided criteria, returning a pass or fail judgment with reasoning and confidence scores.
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
# File 'lib/raif/evals/eval_sets/llm_judge_expectations.rb', line 45 def expect_llm_judge_passes(content, criteria:, examples: [], strict: false, llm_judge_model_key: nil, additional_context: nil, result_metadata: {}) judge_task = LlmJudges::Binary.run( content_to_judge: content, criteria: criteria, examples: examples, strict_mode: strict, llm_model_key: llm_judge_model_key, additional_context: additional_context ) if judge_task.low_confidence? && output.respond_to?(:puts) output.puts Raif::Utils::Colors.yellow(" ⚠ Low confidence: #{judge_task.judgment_confidence}") end if Raif.config.evals_verbose_output && output.respond_to?(:puts) output.puts " #{judge_task.judgment_reasoning}" end = { passes: judge_task.passes?, reasoning: judge_task.judgment_reasoning, confidence: judge_task.judgment_confidence, }.compact # Merge user metadata with judge metadata = .merge() expectation_result = expect "LLM judge: #{criteria}", result_metadata: do judge_task.passes? end if expectation_result && judge_task.errors.any? expectation_result. = judge_task.errors..join(", ") end expectation_result end |
#expect_llm_judge_prefers(content_to_judge, over:, criteria:, allow_ties: true, llm_judge_model_key: nil, additional_context: nil, result_metadata: {}) ⇒ ExpectationResult
The expectation passes if the judge correctly identifies the expected winner. Due to randomization, content_to_judge may be assigned to either position A or B, and the judge’s choice is validated against the expected winner.
The judge result includes metadata accessible via expectation_result.metadata:
-
:winner - Which content won (“A”, “B”, or “tie”)
-
:reasoning - Detailed explanation of the choice
-
:confidence - Confidence score (0.0-1.0)
Uses an LLM judge to compare two pieces of content and determine which better meets specified criteria.
This method leverages the Comparative LLM judge to perform A/B testing between two pieces of content. Content placement is randomized to avoid position bias, and the judge determines which content better satisfies the comparison criteria.
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 |
# File 'lib/raif/evals/eval_sets/llm_judge_expectations.rb', line 216 def expect_llm_judge_prefers(content_to_judge, over:, criteria:, allow_ties: true, llm_judge_model_key: nil, additional_context: nil, result_metadata: {}) judge_task = LlmJudges::Comparative.run( content_to_judge: content_to_judge, over_content: over, comparison_criteria: criteria, allow_ties: allow_ties, llm_model_key: llm_judge_model_key, additional_context: additional_context ) if output.respond_to?(:puts) output.puts " Winner: #{judge_task.winner}" output.puts " #{judge_task.judgment_reasoning}" if Raif.config.evals_verbose_output end = { winner: judge_task.winner, reasoning: judge_task.judgment_reasoning, confidence: judge_task.judgment_confidence, }.compact # Merge user metadata with judge metadata = .merge() expectation_result = expect "LLM judge prefers A over B: #{criteria}", result_metadata: do judge_task.completed? && judge_task.correct_expected_winner? end if expectation_result && judge_task.errors.any? expectation_result. = judge_task.errors..join(", ") end expectation_result end |
#expect_llm_judge_score(output, scoring_rubric:, min_passing_score:, llm_judge_model_key: nil, additional_context: nil, result_metadata: {}) ⇒ ExpectationResult
The judge result includes metadata accessible via expectation_result.metadata:
-
:score - Numerical score given
-
:reasoning - Detailed explanation
-
:confidence - Confidence score (0.0-1.0)
Uses an LLM judge to evaluate content with a numerical score based on a detailed rubric.
This method leverages the Scored LLM judge to assess content against a scoring rubric, providing a numerical score with detailed reasoning and determining pass/fail based on the minimum passing score threshold.
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
# File 'lib/raif/evals/eval_sets/llm_judge_expectations.rb', line 132 def expect_llm_judge_score(output, scoring_rubric:, min_passing_score:, llm_judge_model_key: nil, additional_context: nil, result_metadata: {}) scoring_rubric_obj = scoring_rubric judge_task = LlmJudges::Scored.run( content_to_judge: output, scoring_rubric: scoring_rubric_obj, llm_model_key: llm_judge_model_key, additional_context: additional_context ) rubric_name = scoring_rubric_obj.respond_to?(:name) ? scoring_rubric_obj.name : "custom" if output.respond_to?(:puts) output.puts " Score: #{judge_task.judgment_score}" output.puts " #{judge_task.judgment_reasoning}" if Raif.config.evals_verbose_output end = { score: judge_task.judgment_score, reasoning: judge_task.judgment_reasoning, confidence: judge_task.judgment_confidence, }.compact # Merge user metadata with judge metadata = .merge() expectation_result = expect "LLM judge score (#{rubric_name}): >= #{min_passing_score}", result_metadata: do judge_task.completed? && judge_task.judgment_score && judge_task.judgment_score >= min_passing_score end if expectation_result && judge_task.errors.any? expectation_result. = judge_task.errors..join(", ") end expectation_result end |