Developer

How to protect a clean AI dataset before sharing it with third parties

A clean AI dataset is often more valuable than the raw dataset: it contains selections, corrections, normalisations, exclusions, annotations, and operational decisions. Before handing it over to a client, you should document exactly what you are sharing, in which version, and with what agreed usage limits. Prepare an organised package, secure it in time, and keep an identical copy.

1. How it usually happens

Usually, the client commissions "an AI-ready dataset" as if it were a file you export with a button click. In practice, that file is the result of many micro-decisions: removing duplicates, correcting formats, harmonising fields, excluding inconsistent records, anonymisation, category normalisation, adding labels, quality control, enrichment, and perhaps producing synthetic examples.

The raw dataset is often a digital attic: historical CSVs, converted PDFs, CRM exports, tickets, catalogues, logs, tables with mysterious columns, and a file named clients_new_bis_ok.xlsx that no one dares to open without coffee. The clean dataset, however, is the result of the cleanup. That's where the value is concentrated: not just in the data, but in the criteria by which they were selected, arranged, and made usable.

When the dataset is commissioned by a third party, the issue becomes delicate. The client may have provided the raw sources, while the supplier handled the cleaning, structuring, annotation, or enrichment. Or the supplier may have integrated their own taxonomies, scripts, pipelines, prompts, controls, and validation methods. After delivery, the classic question can arise: "what was included in the delivered dataset?" Right after comes the thornier question: "who could use what, and for what purposes?"

An unusual perspective: the clean dataset is also a snapshot of the supplier's professional choices. Two teams can start from the same raw data and produce very different datasets. One deletes borderline records, another flags them as edge cases; one merges categories, another separates them; one leaves free text, another turns it into structured fields. It’s a bit like seeing two chefs receive the same ingredients: in the end, both will say "pasta with tomato sauce", but one pot will taste like a corporate canteen and the other like an Italian Sunday.

Furthermore, in AI projects, you must clearly distinguish between the delivered dataset and the tools used to produce it. The client may receive the final result, while internal scripts, prompts, pipelines, checklists, and quality criteria remain separate documents, unless agreed otherwise. This distinction must be clarified prior to sharing, because after sending it, the file starts travelling: it gets copied, uploaded to the cloud, imported into notebooks, turned into embeddings, split into training sets, and maybe renamed by someone to dataset_final_final_this_time.csv.

2. What you need to prove

The point to prove is that, prior to sharing, a specific version of the clean dataset existed, with a certain content, structure, and accompanying documentation. You also need to reconstruct the delivery context: what was sent, to whom, when, and under what operational or contractual conditions.

It can be useful to prove:

  • the existence of the clean dataset in the delivered version;
  • the exact content of the shared files;
  • the structure of fields, columns, labels, or classes;
  • the transformations applied compared to raw data;
  • the exclusions made and the main criteria used;
  • the presence of annotated, synthetic, enriched, or derived data;
  • the technical documentation attached to the delivery;
  • the content of communications with the client;
  • the agreed terms of use, delivery, or confidentiality;
  • the distinction between the delivered dataset and internal tools used to create it;
  • the version of the package prior to sending it to third parties.

In practical terms, you must be able to answer the question: "this is what was prepared and shared, in this format, before it left our perimeter".

3. What to collect

Collect both the dataset and the framework that explains it. A file without context risks looking like a simple export; a well-documented package instead tells the story of the work, choices, and boundaries of the delivery.

Collect:

  • clean dataset in the final delivered format;
  • any raw dataset received from the client, if preservable;
  • comparison files or transformation reports;
  • README describing the content;
  • data dictionary with fields, formats, labels, and meanings;
  • notes on cleaning, deduplication, and normalisation;
  • inclusion and exclusion criteria;
  • quality reports, checks, and known anomalies;
  • before/after record examples, if shareable;
  • pipeline scripts or logs, at least in documentary form;
  • prompts or procedures used for AI enrichments, when relevant;
  • list of files included in the delivery package;
  • sending emails and read receipts;
  • exported chats with instructions, requests, or approvals;
  • contract, purchase order, quote, specifications, and annexes;
  • any usage instructions or operational limitations;
  • screenshots of the delivery folder or transfer system;
  • upload receipts, generated links, or transfer reports;
  • a ZIP copy identical to the one delivered.

If the dataset contains synthetic or AI-generated parts, separate or describe them clearly. The client must be able to understand which data comes from original sources, which was transformed, and which was generated or enriched.

4. How to proceed

Before sharing the dataset, create a closed delivery version. Choose a clear name, with date and version number, for example Clean_Dataset_Project_X_v1.0_2026-05-01. Inside must be the final files and essential documentation. The practical rule is simple: an external person must be able to open the folder and understand what they are looking at.

Prepare a brief but useful README: data source, dataset purpose, format, number of records, main fields, applied transformations, relevant exclusions, any synthetic or annotated data, known limitations, and reference to the specific project/order. Add a list of files contained in the package. If there are elements excluded from the delivery, such as internal scripts, proprietary prompts, or processing tools, indicate this in the related commercial or technical documentation.

Practical procedure:

  • close the dataset version to be delivered;
  • verify that the files are the correct ones;
  • separate dataset, documentation, and internal materials;
  • create a README, data dictionary, and quality report;
  • check authorisations, licences, and confidentiality constraints;
  • create a ZIP archive of the delivery package;
  • certify the dataset or complete package before sending;
  • keep an identical copy in the internal archive;
  • send the package via a trackable channel;
  • save emails, receipts, confirmations, and delivery messages;
  • internally log the date, version, recipient, and sent content.

After sending, avoid silent modifications to the same package. If the client asks for corrections or additions, create a new version. In the dataset world, "I just changed two columns" is a phrase that has caused more confusion than many 90-minute meetings.

5. Mistakes to avoid

The most common mistake is delivering the clean dataset as a simple attachment, without securing a version and without explaining what it contains. At that moment, the work seems done; six weeks later, when someone asks why certain rows are missing or where certain labels came from, the archaeology begins.

Common mistakes:

  • sending editable files without keeping a closed copy;
  • delivering only the dataset, without a README or data dictionary;
  • mixing the final dataset and internal processing tools;
  • failing to clarify what is excluded from the delivery;
  • not documenting transformations, deduplications, and filters;
  • using updatable cloud links without tracking the sent version;
  • sending new versions with the same file name;
  • forgetting authorisations, licences, or usage limits;
  • failing to save emails, chats, and transfer receipts;
  • underestimating synthetic, enriched, or derived data;
  • not verifying that the package received from the client is the correct one.

Besides technical certification, contractual and organisational precautions are needed: clear delivery agreements, controlled access, trackable channels, internal policies, a version log, and handling of subsequent requests. Free certification is useful because it allows you to lock the package before sharing, without turning every delivery into a heavy procedure.

6. After the documentation

After sharing, update the internal project log: delivered version, recipient, date, used channel, attached documents, and referenced conditions. Keep the certified copy alongside the delivery communications and contractual materials.

Notify the people involved in the project: technical lead, commercial contact, management, data governance, security, and external consultants when the dataset contains sensitive information, third-party sources, or significant usage restrictions. If the client requests changes, additions, or a new export, manage it as a new version and maintain the delivery history.

If a dispute arises, avoid reconstructing everything from memory. Start from the archived package, communications, and accompanying documentation. A clean dataset delivered well isn't just an organised file: it is a delivery that can be narrated with verifiable dates, contents, and decisions.