How to protect an internally trained open-source AI model

1. How it usually happens

It all starts with an open-source model: downloaded from a repository, tested locally, maybe fine-tuned "just for testing". Then the project grows. You add internal datasets, prompts, post-processing rules, filters, rankings, evaluations. The model improves and suddenly becomes part of the product.

At that point, a grey area emerges: the base model is public, but the version running in the company isn't really public anymore. It has been trained, adapted, filtered, optimised. It's like buying a standard bike and turning it piece by piece into a racing bike: the frame is the same, but the result is entirely different.

In AI projects, this transformation is often poorly tracked. People work on notebooks, scripts, repo branches, cloud environments. They try one dataset, then another. They change the learning rate, add prompts, modify pipelines. A few weeks later, no one remembers exactly which combination produced "the version that actually works".

A typical anecdote: a team fine-tunes an open-source model with internal data. The result is excellent. A few months later, a client reports strange behaviour. The team tries to reconstruct how the original model was trained... and discovers that half the files were overwritten and the notebooks have cells executed in random order. At that point, the problem isn't technical, it's archaeological.

There's also a less obvious aspect: the value is not just in the final model (the weights), but in the process. Datasets, filters, prompts, metrics, corrected errors, discarded versions. In many cases, it is precisely this combination that makes the system competitive. Protecting it means documenting not just "what came out", but also "how you got there".

Finally, open source introduces another layer: licences, terms of use, attribution obligations. Knowing which base model you used, in which version, and how you modified it is part of the protection, because it clarifies what derives from third parties and what is the fruit of your internal work.

2. What you need to prove

The point isn't to prove that the open-source model is yours, but to credibly document your trained version and the path that generated it.

It can be useful to prove:

the existence of a specific version of the trained model;
the weight files (checkpoints) in the used or released version;
the starting open-source model and its version;
the datasets used for training or fine-tuning;
the training configurations (parameters, epochs, batches, etc.);
the additional prompts, filters, or pipelines;
benchmark and evaluation results;
the technical documentation associated with the model;
internal or external communications about that version;
the sequence between experimental versions and the "stable" version;
any declared known limitations or behaviours.

In practice, you must be able to say: "this is the model version we built on top of an open-source base, with these modifications and this data, at this time".

3. What to collect

Collect both the model and everything needed to understand and reconstruct it.

Collect:

model weight files (checkpoints);
training configurations and setup files;
name and version of the starting open-source model;
fine-tuning or training scripts;
notebooks used (even if messy, it's better to save them);
datasets or references to the used datasets;
prompts, rules, and inference pipelines;
training logs and execution outputs;
metrics, benchmarks, and evaluation reports;
technical documentation (README, release notes);
screenshots of dashboards or training tools;
short videos showing the model's behaviour;
emails and chats containing relevant technical decisions;
contracts or agreements if the model is shared with third parties;
licence file of the base open-source model;
changelog of made modifications.

A useful detail: also keep significant intermediate versions. Not everything, but at least the key steps. Often, that is exactly where you understand what made the difference.

4. How to proceed

Treat the model like a versioned product, not like a continuous experiment. Every significant version must have its own clear identity.

Create a folder for the model version, with an explicit name, for example Custom_AI_Model_v1.3_2026-05-01. Insert the weights, configurations, and minimum documentation needed to understand what it contains.

Write a README that explains:

base model used;
type of training performed;
main datasets;
key configurations;
metrics obtained;
known limitations.

Practical procedure:

identify the model version to protect;
collect weights, configurations, and scripts;
document the base model and modifications;
add datasets or documented references;
create a clear README;
organise everything in an orderly structure;
create a ZIP archive of the version;
certify the main package;
keep an unmodified internal copy;
link the model version to any releases or clients.

If the model is updated, create a new version. Avoid overwriting existing files. In the AI space, changing one parameter can completely alter system behaviour.

5. Mistakes to avoid

Many problems stem from an overly "artisanal" handling of the model.

Common mistakes:

saving only the final model without context;
forgetting which open-source model was used;
failing to keep configurations and training parameters;
losing track of used datasets;
overwriting important checkpoints;
not documenting additional prompts or pipelines;
using unclear names like model_final2.pt;
ignoring base model licences;
not separating experiments from official versions;
not preserving logs and metrics.

Besides technical certification, you need good practices: model versioning, repository management, access control, minimum standard documentation, and review of open-source dependencies. Free certification is useful because it allows you to quickly lock down a major version without slowing down the team's work.

6. After the documentation

After documenting the model, integrate this practice into the development cycle. Every significant version should have its package and documentation.

Share the information with the technical, product, and management teams, so everyone knows which version is in use. If the model is distributed to clients or partners, prepare dedicated, documented versions.

In the event of problems or disputes, having well-archived versions allows you to quickly reconstruct what was done and when. And in the AI world, where everything evolves rapidly, having an organised technical memory is one of the most tangible advantages you can build.