Zhiyuan Song
← Home · Projects

Redistricting News Harvester

Built around April 2026 with a PhD student in the economics of education: a resumable pipeline that, for U.S. K–12 districts, collects public news URLs and basic metadata on redistricting, attendance-zone, and related events (no full-article scrape, no policy evaluation) and merges them into their event table for empirical work. Logic lives mainly in Markdown prompts; Python stdlib handles JSONL merge and CSV pivot. Inside Claude Code, an orchestrator session dispatches per–cache-key subagents for batched web search and structured JSON output.

Highlights

  • Research-shaped I/O: input is a poolable events table (NCES district id, LEA name, state, event year, event type, …); output is a wide, analysis-ready CSV with one row per accepted article, not a loose bag of links.
  • Cost / redundancy control: Web search runs once per (district, event_year) cache key; multiple E/M/H rows share the same link set.
  • Resumable batching: append-only JSONL, completed-key log, progress log; optional partial merge on interrupt—suited to long Claude Code sessions and multi-day runs.
  • Rules in prompts: query matrix, filter pipeline, stop conditions, and HARD RULES live in prompts/subagent.md to curb satisficing (e.g. stopping before all base queries run).
  • English downstream fields: per link: publisher source, title, date, one-sentence English coverage_summary—easy for mixed-language teams and manuscripts.
  • Adoption: after pilot use, the collaborator rolled the same workflow out to their USC research group for similar harvest-to-CSV tasks with strong feedback.

Technical details

Scale & inputs

Typical input CSV: on the order of 652 event rows and 31 columns; district (NCES LEA id) + event_year form the cache key—about 485 unique keys (duplicate rows for E/M/H levels, etc.). Search consumes cleaned lea_name, state_location, event_type (e.g. fixed vs opening); other research columns pass through to the output.

Output shape

Final table: one row per accepted article—all original columns preserved in order, plus event-level fields (e.g. link_count, stop_reason, incomplete / uncertain-date flags) and article-level fields (article_index, url, source, title, published_date, source_query, coverage_summary). Appended text fields are English. Zero-hit events still produce one row with empty article fields. Authoritative column count (~42) is defined in the collaborator’s spec.md.

Query matrix (summary)

For short name D, state S, year Y, type T, the subagent runs a fixed sequence of web queries (redistricting, attendance zone, rezoning, school boundary change, …). When T indicates an opening-style event, an extra “new school” query is included. After base queries, a small set of extension queries may run only when intermediate link counts and marginal yield justify the long tail—exact gating is in spec.

Filter pipeline (summary)

URL normalize and dedupe across queries; keep published_date ≤ 2022-12-31 to match the research window; relevance requires district name plus boundary-related terms in title or snippet. Uncertain dates are counted, not silently dropped. In batch mode, relevance is judged primarily from search titles/snippets to avoid per-URL full fetch.

Stop reasons & quality bar

Subagents must emit a valid stop_reason from a small enumerated set. Design targets up to ~10 accepted links per key; 4 is a completeness floor, not an excuse to skip query types. HARD RULES in the subagent prompt enforce “run the full base matrix” and forbid early stop on link count alone.

Artifacts & scripts

Runtime writes state/results.jsonl, state/completed_keys.txt, state/progress.log; batch steps may stage state/partial/*.json before merge. A stdlib Python script joins JSONL back to the source CSV for the final wide file.

Architecture & phases

Orchestrator (main Claude Code session, entry: GUIDE): optional single-event test mode—one synchronous subagent with a verbose trace for human sanity-check—then batch mode—skip completed keys, group by cache key, dispatch parallel subagents per batch, append JSONL, merge partials, log progress, run CSV conversion at the end.

No separate “validator agent”: compliance is enforced via subagent HARD RULES and a strict JSON schema. CSV writing is centralized after batch merge.

Deliverable & roles

Self-contained tree: GUIDE.md (resume procedure), spec.md, prompts/orchestrator.md, prompts/subagent.md, dispatch templates, scripts/jsonl_to_csv.py (+ merge helpers), sample inputs, state/ and outputs/ conventions. Requires Claude Code + Python 3.9+.

My focus: prompt system & HARD RULES, batch dispatch templates, cache-key dedupe strategy, JSON/CSV contract alignment, and test→batch handoff. Collaborator owns research questions, event table, and domain acceptance boundaries.

Adoption

After validating on their project, the collaborator rolled the same workflow out to their USC research group; members reuse it for similar harvest-to-CSV work with strong feedback. This page is a public capability summary—no private data paths or undisclosed samples.

Resume bullets

Co-designed a Claude Code–driven research pipeline for an economics-of-education PhD project: an orchestrator dispatches parallel WebSearch subagents per (district, year) cache key with a fixed query matrix and filter pipeline; append-only JSONL state for resume; Python stdlib merges to a one-row-per-article CSV with English metadata and coverage summaries. Rolled out to the collaborator’s USC research group for similar harvest workflows.

中文要点

教育经济学合作、Claude Code 编排器与子代理、缓存键去重、JSONL 可续跑、查询矩阵与硬规则、英文摘要字段、USC 组内推广。

Note

Full code and data remain with the collaborator; no public repo link here. If the spec evolves, their spec.md is authoritative.