AW Dev Rethought

🕵️ Debugging is like being the detective in a crime movie where you are also the murderer - Filipe Fortes

AI in Production: Why AI Systems Need Versioning Beyond Code


Introduction:

Version control is one of the most fundamental practices in software engineering. Every code change is tracked, every deployment is recorded, and rolling back to a previous state is a routine operation that most engineers can perform without thinking.

AI systems inherit this practice but stop short of where it actually needs to go. Teams version their model training code, commit it to git, and consider the versioning problem solved. But in an AI system, the code is only one of several things that determine how the system behaves. Models, datasets, features, hyperparameters, and environment configurations all contribute to the output — and most teams version none of them with the same rigour they apply to code.

The result is systems that are difficult to reproduce, impossible to audit, and dangerous to roll back when something goes wrong in production.


Code Versioning Alone Does Not Reproduce a Model:

The same training code run twice on different data produces different models. The same training code run on the same data with different hyperparameters produces different models. The same training code run on the same data with the same hyperparameters but a different random seed can produce models with meaningfully different behaviour.

This means that a git commit hash does not uniquely identify a model. It identifies the code that was used to train the model, but without also capturing the dataset, the hyperparameters, the environment, and the random state, you cannot reproduce that model from the commit alone.

Teams that discover a production model is behaving incorrectly and want to roll back to the previous version often find they cannot — because they never captured enough information to recreate it.


Models Are Artefacts That Need Their Own Versioning:

A trained model is a binary artefact with its own lifecycle. It has a creation timestamp, a training dataset, a set of hyperparameters, an evaluation result, and a deployment history. All of these need to be tracked alongside the model itself.

Model registries exist precisely for this purpose — tools like MLflow, Weights and Biases, and SageMaker Model Registry provide structured storage for model artefacts along with the metadata needed to understand what each version is, how it was trained, and how it performed.

Without a model registry, models get stored in S3 buckets with names like model_final_v2_updated.pkl and the institutional knowledge of what each file represents lives in someone's head or a Slack message from six months ago.


Data Versioning Is the Hardest Part:

If code versioning is table stakes and model versioning is underappreciated, data versioning is the part most teams avoid entirely. Training datasets are large, expensive to store in multiple versions, and change in ways that are difficult to track incrementally.

But data is arguably the most important thing to version in an AI system. A model trained on data from January behaves differently from a model trained on data from June — even if the code and hyperparameters are identical. If you cannot identify exactly which data a model was trained on, you cannot explain its behaviour, audit its outputs, or reproduce its results.

Tools like DVC and Delta Lake make data versioning more tractable, but the cultural shift — treating datasets as versioned artefacts rather than mutable files — is harder than the tooling problem.


Feature Definitions Change Silently:

In systems that use feature stores, features are computed from raw data and served to models at both training and inference time. The definition of a feature — how it is computed, what data it draws from, how nulls are handled — can change over time as upstream data sources evolve or as engineers refine the computation logic.

When a feature definition changes without being versioned, models trained on the old definition are suddenly receiving different inputs at serving time without anyone explicitly deciding to change the model's behaviour. The model has not changed. The code has not changed. But the system behaves differently because the inputs are different.

Versioning feature definitions and tracking which model version was trained against which feature version is essential for understanding why model behaviour changes in production — and it is almost universally neglected.


Hyperparameters and Configs Are Part of the Model:

Learning rate, batch size, regularisation strength, architecture depth — these hyperparameters are as much a part of the model as the weights themselves. A model trained with a learning rate of 0.001 is a different model from one trained with 0.01, even if every other input is identical.

Configuration files that control training runs need to be versioned and linked to the model artefacts they produce. This sounds obvious but in practice, hyperparameter configurations are often stored separately from models, managed manually, and not consistently linked to the experiments that used them.

When a model performs unexpectedly in production and you need to understand what changed, the ability to trace back to the exact configuration that produced it is the difference between a one-hour investigation and a three-day one.


Versioning Enables Meaningful Rollback:

In traditional software, rolling back means deploying a previous code version. In AI systems, rolling back means serving a previous model version — and that is only possible if the previous model version still exists, is correctly identified, and can be deployed without retraining.

Teams that discover a newly deployed model is producing worse outputs than its predecessor need to be able to switch back immediately. If models are not versioned and stored with their full metadata, rollback becomes a retraining exercise that takes hours or days rather than a deployment operation that takes minutes.

Versioning is not just about reproducibility and auditability. It is about operational safety — the ability to respond quickly when a deployed model causes harm.


Conclusion:

AI systems are not just software systems with models attached. They are systems where behaviour is determined by the interaction of code, data, models, features, and configuration — and where any of these can change independently and affect outputs in ways that are difficult to detect and harder to explain.

Versioning practices that were designed for code are necessary but not sufficient for AI systems. Teams that extend those practices to cover every artefact that influences model behaviour build systems that are reproducible, auditable, and safe to operate at scale. Teams that do not will eventually face a production incident where they cannot explain what changed, cannot roll back, and cannot reproduce the system state they need to investigate.


If this article helped you, you can support my work on AW Dev Rethought. Buy me a coffee


Rethought Relay:
Link copied!

Enjoyed this post?

Stay in the loop

New posts + weekly digest, straight to your inbox.

or

Create a free account

  • Save posts to your vault
  • Like posts & build history
  • New-post alerts

Comments

Add Your Comment

Comment Added!