Scale & inputs
Typical input CSV: on the order of 652 event rows and 31 columns;
district (NCES LEA id) + event_year form the cache key—about 485 unique keys (duplicate
rows for E/M/H levels, etc.). Search consumes cleaned lea_name, state_location, event_type
(e.g. fixed vs opening); other research columns pass through to the output.
Output shape
Final table: one row per accepted article—all original columns preserved in order, plus event-level fields
(e.g. link_count, stop_reason, incomplete / uncertain-date flags) and article-level fields
(article_index, url, source, title, published_date,
source_query, coverage_summary). Appended text fields are English. Zero-hit events
still produce one row with empty article fields. Authoritative column count (~42) is defined in the collaborator’s
spec.md.
Query matrix (summary)
For short name D, state S, year Y, type T, the subagent runs a fixed
sequence of web queries (redistricting, attendance zone, rezoning, school boundary change, …). When T indicates an
opening-style event, an extra “new school” query is included. After base queries, a small set of extension
queries may run only when intermediate link counts and marginal yield justify the long tail—exact gating is in spec.
Filter pipeline (summary)
URL normalize and dedupe across queries; keep published_date ≤ 2022-12-31 to match the research window;
relevance requires district name plus boundary-related terms in title or snippet. Uncertain dates are counted, not silently
dropped. In batch mode, relevance is judged primarily from search titles/snippets to avoid per-URL full fetch.
Stop reasons & quality bar
Subagents must emit a valid stop_reason from a small enumerated set. Design targets up to ~10
accepted links per key; 4 is a completeness floor, not an excuse to skip query types. HARD
RULES in the subagent prompt enforce “run the full base matrix” and forbid early stop on link count alone.
Artifacts & scripts
Runtime writes state/results.jsonl, state/completed_keys.txt, state/progress.log; batch
steps may stage state/partial/*.json before merge. A stdlib Python script joins JSONL back to the source CSV for the
final wide file.