Curating Training Data for LLMs

A practical guide to building training datasets for each stage of LLM development.

1 min readllmtraining-data

I published a reference guide on training data curation for LLMs. I wanted it to be a quick reference for me because the space is changing so rapidly as new approaches to RL get published. I don't think I even can keep up with it without using an AI agent to help.

The guide covers the full pipeline: pre-training corpora, fine-tuning pairs, preference rankings, RL trajectories, and safety, etc.

data-guide.usagentix.com

Details on the guide:

The guide organizes training data along three axes:

  • Stage: what phase of training the data is for (pre-training, fine-tuning, preference learning, RL, safety)
  • Format: the shape of each example (document chunks, prompt-response pairs, ranked answers, action trajectories)
  • Behavior: what the model should learn (knowledge, helpfulness, safety, tool use, planning, self-correction)

What's in it

11 sections covering the full lifecycle:

  • Practical guidance for getting started
  • The training pipeline and how stages connect
  • Deep dives into pre-training, fine-tuning, preference/RL, and safety data
  • A behavior-to-dataset mapping
  • Reference tables