Evaluate Matching

The evaluate-matching maintenance tool checks the quality of highlights that are already matched to a tagging or cluster rule.

It loads highlights through the rule's ai_highlights_search query, resolves the active assignments for that rule, and then evaluates only the highlights that are currently linked to each tag or cluster. This tool is focused on match quality, not missed coverage.

How It Runs

The tool has two modes:

Default mode reuses the current rule-relevant assignments already stored on the fetched highlights.
--rematch reruns matching first, then evaluates those fresh assignments instead of the stored ones.

In both modes, highlights are fetched through the same semantic-search path. The difference is only whether assignment data is reused or recomputed.

--rematch does not rerun matching on every fetched highlight. It first takes a deterministic sample of up to 2,000 highlights using a stable ID-hash strategy, then runs the assign-tags agent or assign-clusters agent on that sample. This makes rematch runs repeatable for the same input set.

For tagging rules, rematch uses the same primary LLM-only assignment agent as production tagging assignment and does not use vector-search based propagation or filtering. Production tagging stores those primary assignments directly; maintenance rematch can still run the second-pass cleanup judge unless --disableCleanup is passed. For cluster rules, rematch keeps the LLM plus similarity matching path unless --disableVectorSearch is passed.

After assignments are resolved, the tool evaluates matched highlights only. For each category, it checks a deterministic sample of assigned highlights, up to 450 per category.

The evaluation samples are deterministic, not random. Within each category, the tool processes judge batches with bounded parallelism to improve throughput while keeping report contents and pass/fail rules unchanged.

What It Reports

The tool writes a matched-only report bundle with:

*.matched.summary.csv
*.matched.highlights.csv
*.matched.methods.csv
*.judge-failures.jsonl when judge failures occur
a machine-readable summary.json

The CSVs contain per-category rollups plus highlight-level verdicts and evidence. The matched-methods report breaks matched results down by assignment method (llm, similarity, or keyword) when that metadata exists.

The optional judge-failures.jsonl file is a debug artifact for cases that later appear as missing_verdict in the highlight CSVs. Each line records the matched evaluation phase, category, failure type, and a short message. Failure types include missing insights arrays, omitted highlight rows, invalid structured rows, duplicate rows, and unknown returned IDs.

The JSON summary is intended for monitoring and automation. It includes:

rule metadata and run mode (evaluate-only or rematch)
total highlights evaluated
matched pass/fail counts
decisive, incorrect, and uncertain judgment totals
the top failing categories for matched evaluation

Pass and Fail Rules

A category only passes when all of the following are true:

there is at least one decisive judgment
the decisive error rate is 12.5% or lower
the uncertainty rate is 12.5% or lower

Categories with no decisive evidence, or categories where the reviewer model stays too uncertain, are treated as failing.

Category Context

Category evaluation uses the category title and description when both exist. If the description is empty, the evaluator falls back to the title-only context instead of treating that as an error. If both title and description are empty, the fallback label is Unknown.

For matched evaluation, a highlight counts as a correct match when it explicitly fits the category in a meaningful, non-negligible way. The category does not need to be the dominant theme, so multi-topic highlights can still pass when the category is clearly supported. Vague, incidental, negligible, loosely related, or contradicted overlap is treated as incorrect, and genuinely incomplete or ambiguous cases remain uncertain.

For missed-match analysis on unmatched highlights, use evaluate-matching-coverage.

How It Runs​

What It Reports​

Pass and Fail Rules​

Category Context​

How It Runs

What It Reports

Pass and Fail Rules

Category Context