Skip to main content

Evaluate Matching

The evaluate-matching maintenance tool checks the quality of highlights that are already matched to a tagging or cluster rule.

It loads highlights through the rule's ai_highlights_search query, resolves the active assignments for that rule, and then evaluates only the highlights that are currently linked to each tag or cluster. This tool is focused on match quality, not missed coverage.

How It Runs​

The tool has two modes:

  • Default mode reuses the current rule-relevant assignments already stored on the fetched highlights.
  • --rematch reruns matching first, then evaluates those fresh assignments instead of the stored ones.

In both modes, highlights are fetched through the same semantic-search path. The difference is only whether assignment data is reused or recomputed.

--rematch does not rerun matching on every fetched highlight. It first takes a deterministic sample of up to 2,000 highlights using a stable ID-hash strategy, then runs the assign-categories graph on that sample. This makes rematch runs repeatable for the same input set.

After assignments are resolved, the tool evaluates matched highlights only. For each category, it checks a deterministic sample of assigned highlights, up to 450 per category.

The evaluation samples are deterministic, not random. Within each category, the tool processes judge batches with bounded parallelism to improve throughput while keeping report contents and pass/fail rules unchanged.

What It Reports​

The tool writes a matched-only report bundle with:

  • *.matched.summary.csv
  • *.matched.highlights.csv
  • *.matched.methods.csv
  • *.judge-failures.jsonl when judge failures occur
  • a machine-readable summary.json

The CSVs contain per-category rollups plus highlight-level verdicts and evidence. The matched-methods report breaks matched results down by assignment method (llm, similarity, or keyword) when that metadata exists.

The optional judge-failures.jsonl file is a debug artifact for cases that later appear as missing_verdict in the highlight CSVs. Each line records the matched evaluation phase, category, failure type, and a short message. Failure types include missing insights arrays, omitted highlight rows, invalid structured rows, duplicate rows, and unknown returned IDs.

The JSON summary is intended for monitoring and automation. It includes:

  • rule metadata and run mode (evaluate-only or rematch)
  • total highlights evaluated
  • matched pass/fail counts
  • decisive, incorrect, and uncertain judgment totals
  • the top failing categories for matched evaluation

Pass and Fail Rules​

A category only passes when all of the following are true:

  • there is at least one decisive judgment
  • the decisive error rate is 12.5% or lower
  • the uncertainty rate is 12.5% or lower

Categories with no decisive evidence, or categories where the reviewer model stays too uncertain, are treated as failing.

Category Context​

Category evaluation uses the category title and description when both exist. If the description is empty, the evaluator falls back to the title-only context instead of treating that as an error. If both title and description are empty, the fallback label is Unknown.

For matched evaluation, a highlight counts as a correct match when it explicitly fits the category in a meaningful, non-negligible way. The category does not need to be the dominant theme, so multi-topic highlights can still pass when the category is clearly supported. Vague, incidental, negligible, loosely related, or contradicted overlap is treated as incorrect, and genuinely incomplete or ambiguous cases remain uncertain.

For missed-match analysis on unmatched highlights, use evaluate-matching-coverage.