Cowork for the Data Team

The Problem

Three pain points every
PT data engineer knows

Building a frontier training corpus means solving hard problems that don't fit neatly into any existing toolchain.

🔍

Needle in a haystack

Thousands of datasets land on HuggingFace every week. Manually evaluating each one against your corpus design guidelines is impossible at scale. Most teams rely on reputation and gut feel.

🗺️

The corpus is a black box

Even after ingestion, answering "what domains are we weak in?" or "how does our Hindi data quality compare to frontier?" requires custom scripts and hours of analysis. There's no interactive way to explore the corpus.

⚙️

Painful, manual pipelines

Transforming data for training — applying chat templates, deduplicating, regenerating responses, creating ablation mixtures — is all bespoke scripting. Running an ablation takes days of setup, not hours.

The Solution

An agentic data team that
works alongside you

Three specialized agents sharing a single catalog. Each has a different kind of authority. They work asynchronously and bring everything to you via Slack and a live dashboard.

One catalog. Three agents. Human in the loop at every decision.

Information accretes through the pipeline — nothing is overwritten. Scout's dossier, Cartographer's analysis, Processor's execution records — all queryable, forever.

Agent 1

Scout

Finds and evaluates new datasets. Checks them against your corpus design guidelines and recommends candidates with a full dossier.

Polls HuggingFace automatically every N days
Runs targeted searches on a brief ("find Tamil science writing")
Accepts direct dataset submissions
Posts approval requests to Slack with full rationale

Agent 2

Cartographer

Open-ended analyst over the entire corpus. Answers questions, runs deep analyses, identifies gaps, and drafts mixture specs for ablations.

Semantic UMAP corpus map with named clusters
Coverage heatmap across language × domain
Frontier quality comparison (LLM-as-judge)
Ablation mix designer with dataset proportions
Capability gap analysis against frontier model requirements

Agent 3

Processor

The only agent with write authority over training artifacts. Executes transformation specs and produces the curated data your training pipeline consumes.

Applies chat templates, dedup, quality filters
Regenerates responses with allowlisted models
Translates to Indic languages via IndicTrans
Produces per-sample provenance that travels into training

How it works

From raw dataset to
training-ready artifact

HuggingFace / partner

→

Scout evaluates

→

Slack approval

→

Processor transforms

→

Corpus artifact

Scout discovers and profiles

Scout polls HuggingFace, applies hard gates (gated, private, stale synthetic), streams 100 representative samples, computes quality signals (FineWeb-Edu, dedup rate, language profile), and checks the dataset against your corpus design guidelines YAML. It writes a structured dossier to the shared catalog.

Human approves in Slack or the dashboard

Scout posts a Slack card with the dossier, decision rationale, and prescribed transformation methods. You reply yes / no / explore. "Explore" opens the dataset's card in the dashboard. Both surfaces are synced in real time.

Cartographer tracks and analyzes the corpus

The dashboard shows a live UMAP of all samples colored by semantic cluster, a language × domain coverage heatmap, and the Cartographer chat sidebar. Ask any open-ended question — "what capabilities are we missing?", "design a 3-arm ablation to test Indic reasoning" — and get a grounded answer.

Processor executes the transformation chain

On approval, Processor runs the prescribed operations in order — dedup, quality filter, chat template, response regeneration — and writes the output artifact to the corpus with a full execution record. Each sample carries its provenance into training, enabling data-aware ablations later.

What we built

Live demo surfaces

Everything shown below is real and running today against our actual training corpus data.

🗺️

Semantic corpus map

350+ samples embedded with all-MiniLM-L6-v2 and projected with UMAP. Each dot is one user prompt, colored by named cluster. Clusters: LeetCode algorithms, competitive programming, creative writing, historical newspaper articles — named by Claude, not by hand. Color by data type, language, PT/SFT stage, reasoning traces, and more.

💬

Cartographer chat (streaming)

Ask any question about the corpus. Claude uses 8 tools — fetch samples, classify by domain, compare to frontier, design ablation mixes, find capability gaps — and streams the answer progressively as it reasons. Multi-model: Claude (Vertex) for code/reasoning, Seed (Ark) for multilingual analysis.

📊

Coverage heatmap + metrics

Language × category token heatmap shows exactly where the corpus is sparse. Six metric cards surface PT/SFT readiness, turn count distribution, token length percentiles, % reasoning traces, and context window fit. All live from the catalog.

🔬

Sample viewer with filtering

Browse the actual curated training data with free-text search, dataset filter, and PT/SFT stage filter. See the raw question, ground truth, source, difficulty, and token count for every sample. Click to expand.

⚙️

Curator pipeline provenance

Every curated dataset shows its full Curator pipeline — which filters ran, how many rows were dropped at each step, which tokenizer was used, and the final PT/SFT token counts. Built with Sarvam's in-house tokenizer (sarvam2-tokenizer-in22_un6).

🔔

Slack approval flow

Scout and Processor post Block Kit approval cards. Reply yes / no / explore in thread. The dashboard's pending-approvals page syncs bidirectionally — approve from either surface. Full feedback log for rejected candidates feeds into V2 guideline updates.

Design principles

Three rules we didn't break

📚

Information accretes, nothing is overwritten

Scout's dossier, Cartographer's decision, Curator's execution record — all attached to every dataset, queryable forever. Full provenance from discovery to training.

⚖️

Each agent has one kind of authority

Scout: inbound decision. Cartographer: analytical read-only. Processor: write authority on artifacts. Mixing these creates routing ambiguity and harder failure modes.

🧑‍💻

Human in the loop at every consequential decision

Agents don't auto-commit data. Every acceptance, rejection, and transformation chain requires an explicit human approval before anything is written to the training corpus.

What's next

V2: Learning and extensibility

The MVP gets the loop working. V2 closes it — the system learns from your team's decisions and integrates with your existing data pipelines.

🔁

Human feedback loop

Every time a human rejects a Scout recommendation or overrides a transformation suggestion, that signal goes into a structured feedback log. In V2, Cartographer reads this log and proposes updates to the corpus design guidelines — which a human reviews and commits from the dashboard. Over time, Scout's acceptance criteria and transformation prescriptions drift toward your team's actual taste, not a static YAML.

Example rejection signal

"Rejected synthetic-distil-gpt4-v3 — GPT-4 outputs, not allowlisted. Add rule: reject any synthetic dataset where generator_model is not in our allowlist."

Cartographer's proposed guideline update

Adds generator_model_allowlist: [claude-opus-4-7, sarvam-2-9b] to the corpus guidelines YAML, surfaces it for human commit.

🔌

Internal pipeline registry

Today Processor has a fixed set of transformation handlers. In V2, any internal data pipeline can register itself as a transformation handler by implementing a simple interface. The corpus design guidelines can then prescribe any registered handler by ID — no code changes required to add a new pipeline.

Example: web search pipeline

Our corpus guidelines require that any dataset containing web-search-style prompts must be processed through our web search regeneration pipeline before SFT use — because the value of this data is our retriever's behaviour, not the upstream model's response. The guidelines express this once:

          domain_rules:

            - domain: web_search

              training_stage: sft

              transformation_prescriptions:

                - id: extract_prompts_only

                - id: regenerate_via_web_search_pipeline

                - id: apply_chat_template

              rationale: |

                Web search SFT data MUST go through our retrieval

                pipeline. Upstream responses are discarded.

When Scout finds any dataset that Cartographer classifies as web_search domain, this rule fires automatically — Scout prescribes the chain, Processor invokes regenerate_via_web_search_pipeline as a registered handler, and the upstream responses are replaced with outputs from our retriever.

🌐

Cross-dataset dedup at pipeline scale

MinHash-LSH across the full corpus (not just within a dataset). Any new candidate is checked against everything already curated before Processor runs.

🧪

Ablation execution (not just planning)

Cartographer's mixture specs become actual training runs. Processor prepares each arm, results feed back into the catalog so future ablations learn from past ones.

📈

Automatic guideline refinement

After each RL/SFT training run, Cartographer correlates data mix against eval results and proposes targeted guideline updates — closing the loop from training outcome to data policy.

Cowork forthe Data Team

Three pain points everyPT data engineer knows