Developer

How to certify an internal AI training dataset

An internal AI training dataset is a living part of the product: it contains data, choices, exclusions, cleaning processes, versions, annotations, and often also "distilled data" generated by other models. Documenting it well ensures you know what was used, when, where it came from, and in what form. Before training, retraining, or sharing the model, create an organised package of the dataset and certify its key versions.

1. How it usually happens

In a company, a dataset almost always starts out looking less elegant than the pitch deck describes. Initially, there’s a shared folder, some CRM exports, CSV files generated by an ERP, historical PDFs, support tickets, emails, call transcripts, product catalogues, technical manuals, images, application logs, internal documents, and maybe an Excel sheet named training_good_data_final_v2_really.xlsx. The journey begins there.

To train or improve an AI system, you can use very different materials: raw data, clean data, Q&A examples, classified texts, labelled images, annotated conversations, human-corrected outputs, synthetic data, negative examples, edge cases, benchmarks, prompts, evaluation rubrics, and configuration files. In some SaaS projects, user interaction logs, production feedback, closed tickets, technical documentation, and internal knowledge bases are also used. In the AI world, the dataset is not just "the fuel": it's often also the instruction manual, the logbook, and the drawer where someone tossed the IKEA hex keys.

Then comes the cleaning phase. Duplicates are removed, errors corrected, formats normalised, sensitive information redacted, and data split into training, validation, and test sets. This is where many companies lose their memory: they remember the final file, but can no longer explain how they got there. Who excluded those 3,000 records? Why was one category merged? Where do those labels come from? Who decided certain examples were the "gold standard"?

With language models and generative systems, "distillates" often enter the scene. In practice, outputs, explanations, classifications, or examples produced by a larger model, an older system, or an automated pipeline are used to train, refine, or evaluate a smaller or more specialised model. It's like having a brilliant student take notes and then using those notes to teach the whole class. It works, but you need to know who wrote them, with which model, with which prompts, on which inputs, with which filters, and with what human oversight.

Distilled data deserve special attention because they might seem "new", but they derive from previous sources and processes. If an internal dataset contains synthetic answers, summaries, automatic classifications, exported embeddings, LLM-generated examples, automatic translations, or rankings produced by another system, it must be clearly documented. You must distinguish between original data, transformed data, human-annotated data, and model-generated or enriched data. Otherwise, six months down the line, no one will know if that perfect answer was written by an internal expert, an external model, a highly motivated intern, or an Excel macro gone rogue.

There's also an unusual perspective: the dataset can document the company's operational culture. Support tickets show how the company talks to clients; manuals show what it considers important; operator corrections show which mistakes are tolerated and which aren't. Training a model on these materials means transferring part of the organisation's way of working. Therefore, certifying a dataset version isn't just for disputes: it also helps govern product evolution better.

2. What you need to prove

The point is to prove the existence of a certain dataset version, with specific content, at a given time. It is not enough to say "we used our internal data": you must be able to reconstruct which data, in what format, with what transformations, and how it links to the trained model.

It can be useful to prove:

  • the existence of the dataset in a specific version;
  • the content of the files used for training, validation, and testing;
  • the internal or external origin of the sources;
  • the transformations applied to the raw data;
  • the presence of synthetic or distilled data;
  • the model, prompts, or pipeline used to generate any distillates;
  • human annotations and revisions carried out;
  • the documentation of exclusions and cleaning;
  • the version linked to a specific training or release;
  • internal communications approving the use of the dataset;
  • the dataset's state before a modification, dispute, or retraining.

The practical goal is to accurately answer questions like: "which dataset produced this version of the model?", "were these examples already present before?", "was this data added later?", "where did the distilled data come from?".

3. What to collect

Prepare a documentary package that tells the story of the dataset as a verifiable technical history. Don't limit yourself to the final file: also keep context, sources, steps, and criteria.

Collect:

  • raw dataset, if preservable;
  • clean dataset in the version used for training;
  • training, validation, and test splits;
  • original CSV, JSONL, Parquet, TXT files, images, audio, or documents;
  • data card or descriptive dataset sheet;
  • README detailing scope, structure, fields, and format;
  • cleaning, normalisation, and deduplication scripts;
  • notebooks used for analysis or preparation;
  • data processing pipeline logs;
  • list of internal and external sources;
  • inclusion and exclusion criteria;
  • anonymisation or minimisation policies;
  • annotation files and guidelines for annotators;
  • quality control examples;
  • reports on duplicates, errors, known biases, or imbalanced categories;
  • prompts used to generate synthetic or distilled data;
  • outputs generated by models and their corresponding inputs;
  • name and version of models used for any distillates;
  • training configurations linked to the dataset;
  • emails, chats, or internal approvals regarding data usage;
  • contracts, licences, or permissions linked to sources;
  • screenshots of dashboards, repositories, or data management systems;
  • short videos showing the package structure and content, when useful.

For distilled data, create a separate section. Clearly indicate which files are generated, what process they derive from, what inputs they used, which model produced them, whether there was human review, and how they were filtered. A distillate without context is like an unlabelled sauce in the fridge: it might be delicious, but no one wants to take responsibility for it during an important dinner.

4. How to proceed

Start by creating an organised snapshot of the dataset before training. Assign a clear version name, for example Customer_Support_AI_Dataset_v1.2_2026-05-01. Inside that folder, you must have the data, documentation, scripts, and an explanation readable even by those who didn't write the pipeline.

Write a simple README: what the dataset contains, where the data comes from, what it's for, what fields it includes, what transformations were applied, what parts are generated or distilled, and what known limitations exist. This file becomes the treasure map. Without a map, after three sprints, even the team that created the dataset starts looking at it as if a lost civilisation produced it.

Practical procedure:

  • identify the dataset that will be used for training;
  • separate raw data, clean data, annotated data, and distilled data;
  • assign a consistent version name;
  • prepare a descriptive README;
  • keep pipeline scripts, prompts, configurations, and logs;
  • document sources, licences, authorisations, and limitations;
  • indicate which data was excluded and why;
  • create a ZIP archive of the dataset version;
  • create a second package for technical documentation, if the dataset is very large;
  • certify the main files and reference packages;
  • link the dataset version to the trained model version;
  • store secure copies in controlled environments.

When the dataset is updated, repeat the process. Avoid overwriting the previous version. In machine learning, "we just added a few examples" can change metrics, behaviour, and operational liabilities more than you might think.

5. Mistakes to avoid

The most common mistake is treating the dataset as temporary material. In reality, it is a core component of the AI system. If it is lost, modified, or mixed up with other versions, explaining what was actually used becomes very difficult.

Common mistakes:

  • certifying only the final model and forgetting the dataset;
  • storing the dataset without a README;
  • mixing raw, clean, synthetic, and distilled data in the same folder;
  • failing to indicate the prompts used to generate synthetic data;
  • failing to record which model produced the distillates;
  • modifying already certified files;
  • using vague names like dataset_ok.zip or training_new_final.csv;
  • forgetting scripts, configurations, and logs;
  • not documenting exclusions, filters, and deduplications;
  • ignoring licences, authorisations, and internal policies;
  • including personal or sensitive data without adequate assessment;
  • failing to link dataset, model, benchmarks, and release.

Besides technical certification, good governance practices are needed: access control, versioning, change logs, source review, licence management, data minimisation, and periodic audits of datasets used in production. Free certification is useful because it allows you to quickly lock down a major dataset version without slowing down every training cycle.

6. After the documentation

After documenting the dataset, formally link it to the model's lifecycle. Every significant training run should indicate which dataset it used, with which configuration, what metrics it produced, and which model version was released.

Involve the relevant internal figures: AI team, product managers, security, compliance, data governance, management, and external consultants when the project has commercial, contractual, or regulatory impacts. For datasets involving external sources, personal data, user-generated content, or distillates from third-party models, a preventive review is often much cheaper than reconstructing things after a problem arises.

If the dataset has already been used and documentation is missing, create a retrospective package anyway: gather what exists, describe the gaps, separate certain materials from reconstructed ones, and lock down the current state. In the AI world, perfect memory is rare; organised documentation, even started today, is vastly better than a folder named old_training_maybe_important.