ANALYSIS

AI training after the SRB ruling: A practical playbook for engineers who now define compliance

The CJEU's SRB judgment highlights that identifiability is not a theoretical property, but a practical one, meaning compliance is now something that happens in system diagrams, access controls, data flows and model tests.

Published
Subscribe to IAPP Newsletters

Contributors:

Roy Kamp

AIGP, CIPP/A, CIPP/E, CIPP/US, CIPM, CIPT, FIP

Legal Director

UKG

Noemie Weinbaum

AIGP, CIPP/A, CIPP/C, CIPP/E, CIPP/US, CIPM, CIPT, CDPO/FR, FIP

Senior Managing Counsel, Privacy and Compliance

UKG

Those who build artificial intelligence systems today — especially in domains like human resources, workforce analytics, health-adjacent services or behavioral platforms — are no longer just writing code. They are shaping how data protection law applies to systems.

That is not because engineers are being asked to become lawyers. It is because the law, particularly after the Court of Justice of the European Union's Single Resolution Board judgment, is starting to catch up with how systems actually work. The EU General Data Protection Regulation no longer treats personal data as a label stuck to a dataset forever. It treats identifiability as something that depends on architecture, access and capability.

In other words, what a system can realistically do determines what the law thinks is happening.

This is most visible when AI training involves special category data. Many teams assume that if such data ever existed in the pipeline, training is either forbidden or requires some heroic legal justification. That assumption is wrong; and it's not "don't worry about it." The correct answer is to design systems so the legal question has a clear, documented and defensible answer.

The SRB ruling clarifies that there are two legitimate ways to do this. Which one is the result depends almost entirely on engineering choices.

Let's start with something that often confuses teams. Unlike consent, legitimate interest, vital interests, performance of a contract, and others, pseudonymization is not a separate lawful basis of processing the data. Hashing identifiers, tokenizing records or stripping names does not magically authorize new processing. It is a technical measure. It reduces risk. It does not change why the data exists or what the controller is allowed to do with it.

From an engineering perspective, that is actually good news. It means pseudonymization can be applied aggressively without triggering a new legal regime. But it also means pseudonymization alone does not answer the question of "how do we train AI models and systems?" It just sets the stage.

The real question begins once data enters the training environment. At that point, the only thing that matters is whether the system, and those who operate it, can still tie data back to real individuals using means that are realistically available.

This is where engineers need to stop thinking in terms of "is the data anonymous?" and start thinking in terms of "can anyone here identify someone, even indirectly?"

If the answer is no, consistently — tested regularly over longer periods of time — and demonstrably no, engineers are in the first pathway. If the answer is yes, even in edge cases, they are in the second.

The first pathway

The first pathway is the cleanest, but it requires discipline. Here, the training environment is deliberately blind. It never sees direct identifiers. It never sees reversible tokens. It never has access to keys, salts or lookup tables. It does not share infrastructure or secrets with ingestion systems. Engineers working on model training are not technically able to reach back into customer systems or reconstruct identities in some other way.

This is not about trusting people. It is about designing systems where curiosity or mistakes cannot, even unintentionally or inadvertently, result in identification.

In practice, this usually means that all identity-handling happens upstream, before data ever reaches the training pipeline. Identifiers are transformed using one-way techniques, often salted hashing, in a preprocessing layer that the training system cannot access. The salt is stored somewhere the training system will never see, or maybe even deleted, such as a controller-side hardware security module or a separate trust domain. Once the transformation is done, raw identifiers are discarded. They simply do not exist in the training world.

At the same time, engineers must pay attention to indirect identifiers. High-cardinality attributes, precise timestamps, rare combinations of features, or free-text fields can all reintroduce identifiability. If a model can learn from patterns, it does not need that level of precision. Coarsening, aggregation and selective suppression are not "nice to haves." They are what makes true non-identifiability credible.

Equally important is access design. If the same engineers can access both raw ingestion systems and training environments, organizationally and technically, it becomes much harder to argue that identifiability is not reasonably possible. Separation of duties is not just a governance slogan here. It is a critical part of the identifiability analysis.

When this pathway is implemented properly, something important happens. From the processor's perspective, the data no longer relates to identifiable individuals. The AI model is learning statistical relationships, not personal histories. Article 9 of the GDPR, which exists to protect people from being singled out or harmed based on sensitive attributes, no longer has practical bite in this context.

That does not mean governance disappears. Purpose limitation, compatibility assessments, onward transfer controls and contractual restrictions still matter, because the same dataset might be personal again for someone else. But for the training activity itself, the core GDPR risk is neutralized by design.

Many teams aim for this pathway, but far fewer achieve it. The most common failure is subtle. Hashing is done, but the salt lives in the same environment as the training code. Or tokenization is used, but the training system can call the token vault. Or auxiliary datasets are available that allow joins "for debugging." Each of these choices reintroduces identifiability in a way regulators will consider reasonably likely.

When that happens, engineers fall into the second pathway.

The second pathway

The second pathway is not a failure mode. It is simply an acknowledgment of reality. In many AI systems, especially complex ones, it is not feasible to eliminate identifiability entirely without destroying utility. In those cases, the data remains personal data for the processor, even after pseudonymization.

Here, the question becomes whether AI training can rely on legitimate interest as a lawful basis, despite the data's origins in special category processing.

The SRB judgment helps explain why the answer can be yes, but only if the system is designed carefully.

From an engineering perspective, legitimate interest lives or dies on three things: necessity, safeguards and impact.

Necessity means it must be able to be explained, concretely, why the training is needed to achieve a legitimate goal like improving accuracy, reducing bias or increasing robustness. Vague claims about "making the model better" will not survive scrutiny. Engineers need to be able to show why certain features matter, why certain data cannot be fully anonymized, and why alternatives like synthetic data would not deliver the same results. In this context, it must be shown that there are no other means of achieving the same with less intrusive processing or if there are, they are not commercially realistic or feasible. 

Safeguards are where most of the work is. Pseudonymization remains essential, even if it is no longer fully decisive. Keys must be separated. Access must be tightly limited. Training pipelines must be isolated from production systems. Models must be designed and tested to avoid memorization and leakage. If models can reproduce training records or allow membership inference, the balance very quickly and decisively tips against legitimate interest.

Impact is the hardest concept to internalize, but the most important. Legitimate interest does not ask whether data is sensitive in theory. It asks whether individuals are likely to be affected in practice. Engineers influence this directly. Models that operate at aggregate levels, produce probabilistic outputs, and cannot be used to make decisions about specific people present far less risk than models that generate individual-level insights tied to real users or where the impact is directly on the individuals whose data is processed.

This is also where privacy-enhancing techniques come into play, but they are not magic. Differential privacy, noise injection, federated learning and secure enclaves only matter if they actually reduce the system's ability to identify, single out or harm individuals. Used well, they can materially support the case for legitimate interest as the lawful basis of processing. Used poorly, they are just buzzwords.

One of the most important realizations for engineers is that legitimate interest is not something lawyers should "apply" after the system is built. It is something the system either supports or undermines by design. If the architecture makes it impossible to limit purpose, access or reuse, no balancing test will be enough.

Finally, engineers need to be aware of role clarity. Training models strictly under a controller's instructions are different from training models for general product improvement — or some other reason for that matter. The same team can be a processor in one context and a controller in another. What matters is not the label, but whether the system design reflects that distinction. Clear separation of datasets, models and purposes is essential.

An opportunity 

The SRB judgment does not introduce new obligations for engineers. It makes existing ones real. It tells us that identifiability is not a theoretical property, but a practical one. That means compliance is no longer something that happens in documents alone. It happens in system diagrams, access controls, data flows and model tests.

For engineers, this can feel like an extra burden. In reality, it is an opportunity. When identifiability is controlled deliberately, legal uncertainty drops, engineering projects become more focused on the purpose of the processing. Governance discussions become concrete. And AI systems become more robust, trustworthy and defensible.

The law has finally acknowledged what engineers have always known: systems are relational. The challenge now is to build them as if that actually matters.

CPE credit badge

This content is eligible for Continuing Professional Education credits. Please self-submit according to CPE policy guidelines.

Submit for CPEs

Contributors:

Roy Kamp

AIGP, CIPP/A, CIPP/E, CIPP/US, CIPM, CIPT, FIP

Legal Director

UKG

Noemie Weinbaum

AIGP, CIPP/A, CIPP/C, CIPP/E, CIPP/US, CIPM, CIPT, CDPO/FR, FIP

Senior Managing Counsel, Privacy and Compliance

UKG

Tags:

AI and machine learningLaw and regulationLitigation and case lawProgram management

Related Stories