Three AI agents that make RL and SFT data curation feel like working with a knowledgeable teammate — not running pipelines.
Building a frontier training corpus means solving hard problems that don't fit neatly into any existing toolchain.
Thousands of datasets land on HuggingFace every week. Manually evaluating each one against your corpus design guidelines is impossible at scale. Most teams rely on reputation and gut feel.
Even after ingestion, answering "what domains are we weak in?" or "how does our Hindi data quality compare to frontier?" requires custom scripts and hours of analysis. There's no interactive way to explore the corpus.
Transforming data for training — applying chat templates, deduplicating, regenerating responses, creating ablation mixtures — is all bespoke scripting. Running an ablation takes days of setup, not hours.
Three specialized agents sharing a single catalog. Each has a different kind of authority. They work asynchronously and bring everything to you via Slack and a live dashboard.
Finds and evaluates new datasets. Checks them against your corpus design guidelines and recommends candidates with a full dossier.
Open-ended analyst over the entire corpus. Answers questions, runs deep analyses, identifies gaps, and drafts mixture specs for ablations.
The only agent with write authority over training artifacts. Executes transformation specs and produces the curated data your training pipeline consumes.
Scout polls HuggingFace, applies hard gates (gated, private, stale synthetic), streams 100 representative samples, computes quality signals (FineWeb-Edu, dedup rate, language profile), and checks the dataset against your corpus design guidelines YAML. It writes a structured dossier to the shared catalog.
Scout posts a Slack card with the dossier, decision rationale, and prescribed transformation methods. You reply yes / no / explore. "Explore" opens the dataset's card in the dashboard. Both surfaces are synced in real time.
The dashboard shows a live UMAP of all samples colored by semantic cluster, a language × domain coverage heatmap, and the Cartographer chat sidebar. Ask any open-ended question — "what capabilities are we missing?", "design a 3-arm ablation to test Indic reasoning" — and get a grounded answer.
On approval, Processor runs the prescribed operations in order — dedup, quality filter, chat template, response regeneration — and writes the output artifact to the corpus with a full execution record. Each sample carries its provenance into training, enabling data-aware ablations later.
Everything shown below is real and running today against our actual training corpus data.
350+ samples embedded with all-MiniLM-L6-v2 and projected with UMAP. Each dot is one user prompt, colored by named cluster. Clusters: LeetCode algorithms, competitive programming, creative writing, historical newspaper articles — named by Claude, not by hand. Color by data type, language, PT/SFT stage, reasoning traces, and more.
Ask any question about the corpus. Claude uses 8 tools — fetch samples, classify by domain, compare to frontier, design ablation mixes, find capability gaps — and streams the answer progressively as it reasons. Multi-model: Claude (Vertex) for code/reasoning, Seed (Ark) for multilingual analysis.
Language × category token heatmap shows exactly where the corpus is sparse. Six metric cards surface PT/SFT readiness, turn count distribution, token length percentiles, % reasoning traces, and context window fit. All live from the catalog.
Browse the actual curated training data with free-text search, dataset filter, and PT/SFT stage filter. See the raw question, ground truth, source, difficulty, and token count for every sample. Click to expand.
Every curated dataset shows its full Curator pipeline — which filters ran,
how many rows were dropped at each step, which tokenizer was used, and the
final PT/SFT token counts. Built with Sarvam's in-house tokenizer
(sarvam2-tokenizer-in22_un6).
Scout and Processor post Block Kit approval cards. Reply yes / no / explore in thread. The dashboard's pending-approvals page syncs bidirectionally — approve from either surface. Full feedback log for rejected candidates feeds into V2 guideline updates.
The MVP gets the loop working. V2 closes it — the system learns from your team's decisions and integrates with your existing data pipelines.
synthetic-distil-gpt4-v3 — GPT-4 outputs, not allowlisted. Add rule: reject any synthetic dataset where generator_model is not in our allowlist."generator_model_allowlist: [claude-opus-4-7, sarvam-2-9b] to the corpus guidelines YAML, surfaces it for human commit.web_search domain,
this rule fires automatically — Scout prescribes the chain, Processor invokes
regenerate_via_web_search_pipeline as a registered handler,
and the upstream responses are replaced with outputs from our retriever.