Evaluate Matching Coverage
The evaluate-matching-coverage maintenance tool checks coverage gaps for a tagging or cluster rule.
It loads highlights through the rule's ai_highlights_search query, finds the highlights that are still unmatched for the rule, and evaluates whether those unmatched highlights should have been linked to any of the rule's categories. This tool is focused on missed opportunities and coverage loss, not the quality of existing matches.
How It Runs​
The tool has two modes:
- Default mode reuses the current rule-relevant assignments already stored on the fetched highlights.
--rematchreruns matching first, then evaluates coverage against those fresh assignments instead of the stored ones.
In both modes, highlights are fetched through the same semantic-search path. The difference is only whether assignment data is reused or recomputed.
--rematch does not rerun matching on every fetched highlight. It first takes a deterministic sample of up to 2,000 highlights using a stable ID-hash strategy, then runs the assign-categories graph on that sample. This makes rematch runs repeatable for the same input set.
After assignments are resolved, the tool evaluates unmatched highlights only. For each category, it checks a deterministic sample of currently unassigned highlights, up to 450 per category, to determine whether the matcher missed a reasonable assignment opportunity.
The evaluation samples are deterministic, not random. Within each category, the tool processes judge batches with bounded parallelism to improve throughput while keeping report contents and pass/fail rules unchanged.
What It Reports​
The tool writes a coverage-focused report bundle with:
*.coverage.summary.csv*.coverage.highlights.csv*.judge-failures.jsonlwhen judge failures occur- a machine-readable
summary.json
The CSVs contain per-category coverage rollups plus highlight-level verdicts and evidence for missed-match judgments.
The optional judge-failures.jsonl file is a debug artifact for cases that later appear as incomplete or missing verdicts in the highlight CSVs. Each line records the unmatched evaluation phase, category, failure type, and a short message. Failure types include missing insights arrays, omitted highlight rows, invalid structured rows, duplicate rows, and unknown returned IDs.
The JSON summary is intended for monitoring and automation. It includes:
- rule metadata and run mode (
evaluate-onlyorrematch) - total highlights evaluated
- coverage pass/fail counts
- decisive, missed, correctly-unmatched, and uncertain judgment totals
- the top failing categories for coverage evaluation
Pass and Fail Rules​
A category only passes when all of the following are true:
- there is at least one decisive judgment
- the decisive missed rate is 12.5% or lower
- the uncertainty rate is 12.5% or lower
Categories with no decisive evidence, or categories where the reviewer model stays too uncertain, are treated as failing.
Category Context​
Category evaluation uses the category title and description when both exist. If the description is empty, the evaluator falls back to the title-only context instead of treating that as an error. If both title and description are empty, the fallback label is Unknown.
For coverage evaluation, a highlight is counted as a missed match when it meaningfully fits the category and should reasonably have been assigned to it. Highlights that are truly out of scope for the category remain correctly unmatched, and ambiguous cases remain uncertain.
For assigned-highlight quality checks, use evaluate-matching.