Skip to main content

Evaluate Matching Reports

The evaluate-matching maintenance tool checks how well a tagging or cluster rule is working on highlights loaded from the rule's ai_highlights_search query.

How It Runs​

The tool has two modes:

  • Default mode reuses the current rule-relevant assignments already stored on the fetched highlights.
  • --rematch reruns matching first, then evaluates those fresh assignments instead of the stored ones.

In both modes, highlights are fetched through the same semantic-search path. The difference is only whether assignment data is reused or recomputed.

By default, the tool evaluates both matched and unmatched highlights. Pass --skipUnmatchedEvaluation to skip the unmatched evaluation step and omit the unmatched CSV reports.

--rematch does not rerun matching on every fetched highlight. It first takes a deterministic sample of up to 2,000 highlights using a stable ID-hash strategy, then runs the assign-categories graph on that sample. This makes rematch runs repeatable for the same input set.

After assignments are resolved, the tool always runs two evaluations:

  • Matched evaluation: for each category, it checks a deterministic sample of assigned highlights, up to 450 per category.
  • Unmatched evaluation: for each category, it checks a deterministic sample of currently unassigned highlights, up to 450 per category, to look for missed matches.

The evaluation samples are deterministic, not random. Within each category, the tool processes judge batches with bounded parallelism to improve throughput while keeping report contents and pass/fail rules unchanged.

What It Reports​

The tool writes a report bundle with:

  • *.matched.summary.csv
  • *.matched.highlights.csv
  • *.matched.methods.csv
  • *.unmatched.summary.csv
  • *.unmatched.highlights.csv
  • *.judge-failures.jsonl when judge failures occur
  • a machine-readable summary.json

The CSVs contain per-category rollups plus highlight-level verdicts and evidence. The matched-methods report breaks matched results down by assignment method (llm, similarity, or keyword) when that metadata exists.

The optional judge-failures.jsonl file is a debug artifact for cases that later appear as missing_verdict in the highlight CSVs. Each line records a durable failure event, including the evaluation phase, category, failure type, and a short message. Failure types include missing insights arrays, omitted highlight rows, invalid structured rows, duplicate rows, and unknown returned IDs.

This file is only written when judge failures occur. If you run with --skipUnmatchedEvaluation, unmatched-related judge failures are not collected.

The JSON summary is intended for monitoring and automation. It includes:

  • rule metadata and run mode (evaluate-only or rematch)
  • total highlights evaluated
  • matched and unmatched pass/fail counts
  • decisive, incorrect, missed, and uncertain judgment totals
  • the top failing categories across both matched and unmatched evaluation

Pass and Fail Rules​

A category only passes when all of the following are true:

  • there is at least one decisive judgment
  • the decisive error rate is 12.5% or lower
  • the uncertainty rate is 12.5% or lower

These rules apply to both matched and unmatched evaluation. This means categories with no decisive evidence, or categories where the reviewer model stays too uncertain, are treated as failing.

Category Context​

Category evaluation uses the category title and description when both exist. If the description is empty, the evaluators fall back to the title-only context instead of treating that as an error. If both title and description are empty, the fallback label is Unknown.