Evaluate Matching Coverage

The evaluate-matching-coverage maintenance tool checks coverage gaps for a tagging or cluster rule.

It loads highlights through the rule's ai_highlights_search query, finds the highlights that are still unmatched for the rule, and evaluates whether those unmatched highlights should have been linked to any of the rule's categories. This tool is focused on missed opportunities and coverage loss, not the quality of existing matches.

How It Runs

The tool has two modes:

Default mode reuses the current rule-relevant assignments already stored on the fetched highlights.
--rematch reruns matching first, then evaluates coverage against those fresh assignments instead of the stored ones.

In both modes, highlights are fetched through the same semantic-search path. The difference is only whether assignment data is reused or recomputed.

--rematch does not rerun matching on every fetched highlight. It first takes a deterministic sample of up to 2,000 highlights using a stable ID-hash strategy, then runs the assign-tags agent or assign-clusters agent on that sample. This makes rematch runs repeatable for the same input set.

For tagging rules, rematch uses the same primary LLM-only assignment agent as production tagging assignment and does not use vector-search based propagation or filtering. Production tagging stores those primary assignments directly; maintenance rematch may still run second-pass cleanup depending on the tool path. For cluster rules, rematch keeps the LLM plus similarity matching path unless --disableVectorSearch is passed.

After assignments are resolved, the tool evaluates unmatched highlights only. For each category, it checks a deterministic sample of currently unassigned highlights, up to 450 per category, to determine whether the matcher missed a reasonable assignment opportunity.

The evaluation samples are deterministic, not random. Within each category, the tool processes judge batches with bounded parallelism to improve throughput while keeping report contents and pass/fail rules unchanged.

What It Reports

The tool writes a coverage-focused report bundle with:

*.coverage.summary.csv
*.coverage.highlights.csv
*.judge-failures.jsonl when judge failures occur
a machine-readable summary.json

The CSVs contain per-category coverage rollups plus highlight-level verdicts and evidence for missed-match judgments.

The optional judge-failures.jsonl file is a debug artifact for cases that later appear as incomplete or missing verdicts in the highlight CSVs. Each line records the unmatched evaluation phase, category, failure type, and a short message. Failure types include missing insights arrays, omitted highlight rows, invalid structured rows, duplicate rows, and unknown returned IDs.

The JSON summary is intended for monitoring and automation. It includes:

rule metadata and run mode (evaluate-only or rematch)
total highlights evaluated
coverage pass/fail counts
decisive, missed, correctly-unmatched, and uncertain judgment totals
the top failing categories for coverage evaluation

Pass and Fail Rules

A category only passes when all of the following are true:

there is at least one decisive judgment
the decisive missed rate is 12.5% or lower
the uncertainty rate is 12.5% or lower

Categories with no decisive evidence, or categories where the reviewer model stays too uncertain, are treated as failing.

Category Context

Category evaluation uses the category title and description when both exist. If the description is empty, the evaluator falls back to the title-only context instead of treating that as an error. If both title and description are empty, the fallback label is Unknown.

For coverage evaluation, a highlight is counted as a missed match when it meaningfully fits the category and should reasonably have been assigned to it. Highlights that are truly out of scope for the category remain correctly unmatched, and ambiguous cases remain uncertain.

For assigned-highlight quality checks, use evaluate-matching.

How It Runs​

What It Reports​

Pass and Fail Rules​

Category Context​

How It Runs

What It Reports

Pass and Fail Rules

Category Context