Where the audit goes that the scan can't

A high-quality open source scan gives you a reasonable view of a codebase's security vulnerabilities and its basic licensing picture. That's most of what you need to know—but the deal-breaking issues tend to live in the part a scan can't reach. A real audit goes deeper and further: upstream of the components, into chain-of-custody, what actually ships, and whose code it really is. That's ownership and rights territory, and it's where the stories in this blog post live—each one from a real engagement, each one involving a finding that changed how the deal was negotiated, priced, or closed.


When the chain of custody breaks: unlicensed code hiding in a paid component

The most basic question an audit answers sounds almost too simple: Do you have the right to ship what’s in your product? The answer can turn on whether a license is permissive or copyleft. More often, though, it turns on something upstream of the license itself: whether anyone in the chain of custody actually had the authority to hand the code over.

On one engagement, we found commercial libraries from Rebex.net inside the target’s product. The target wasn’t worried—they’d bought a third-party component in good faith. But the intermediary vendor had never been licensed to redistribute Rebex code in the first place. Every copy the target had shipped was, in effect, unlicensed proprietary software.

You don’t fix this with an attribution file. In an M&A context, a finding like this drives rep-and-warranty language, indemnity scope, and a post-close remediation budget.

When code leaks out the back door

An audit doesn’t just look at what came in. It can also reveal what went out.

Early in one engagement, we flagged heavy overlap between the target’s proprietary code and a public GitHub repository. The target knew nothing about it. A former employee had quietly pushed the code up on his way out the door.

For the buyer, that opened a much larger set of questions: Had trade secret protections been broken? Had anyone downstream forked the code? How defensible was the IP now? The company had a perfectly good open source policy. What it lacked was governance over what individual employees pushed to the public internet—a gap that stays invisible until someone compares the deployed artifacts to what’s out there in the world.

The founder, the friendly email, and the GPL surprise

It’s tempting to assume that internal code is the safe part of the codebase. The team wrote it; the company owns it; what could go wrong? Plenty, it turns out.

In one memorable engagement, the company’s founder had written a useful piece of code years earlier, and in a moment of collegiality, emailed it to an outside friend. That friend later published it on GitHub under the GPL. The code was now, in every practical sense, open source software—on terms the company had never agreed to.

The buyer’s lawyers were left with no clean answers: Did the friend have authority to publish? Could downstream users claim GPL rights? Could the company still assert exclusivity over code anyone could pull off GitHub? Without contribution agreements and publication controls, a single friendly email became an IP problem that only surfaced once the deal clock was ticking.

When MIT isn't really MIT: forks, relabels, and hidden copyleft

Even companies with mature open source programs and scanning tools routinely get the licenses wrong. The labels lie.

In one audit, a target was confidently shipping a component declared as MIT. When we looked closer, it turned out to be a fork of a GPL project—someone along the way had simply swapped the LICENSE file. The code itself showed line-for-line derivation from the original GPL codebase, and the copyleft obligations had quietly come with it.

A good audit looks past the declared license to the code itself, matching it against known sources. For buyers, this isn’t academic: An unexpected GPL obligation can force changes to how a product is distributed, how customers are licensed, and—in the worst cases—what business model is available going forward.

Tetris in a sewing machine: when nobody knows what's in the binary

There’s almost always a gap between what engineering thinks is in the product and what’s actually loaded onto the device or burned into the firmware. Sometimes the gap is funny. Sometimes, it’s expensive. Often, it’s both.

Our favorite example is a high-end sewing machine. Buried in the shipped firmware was a fully functional open source Tetris clone—an engineer had used it years earlier to test the display, and it had survived every release since.

The Tetris license itself wasn’t especially scary. The fact that it shipped at all was: The company didn’t know what it was distributing. In an embedded product, that’s a flashing red light about build controls and release discipline. This is why we audit the shipped binary, not just the source repo. The repo tells you what engineering meant to ship; the binary tells you what the company is actually on the hook for.

AI diligence: the risk isn't the model, it's the training data

As AI features push their way into more products, open source diligence has had to grow up. It’s no longer just about libraries—it’s about the models and the data those models were trained on. And the data is where the surprises live.

In one engagement, the target’s flagship AI feature was trained on WebFace260M—a well-known dataset distributed under a custom license that limits use to research and education, with further restrictions tied to the copyright of the underlying images. The target was using it in a commercial product.

That single fact called the entire AI feature set into question. Could the product even be lawfully sold? Remediation here isn’t swapping a library—it could mean retraining models from scratch on a clean dataset. This is the shape of AI-era diligence: Buyers have to look at the data and the models too, because in an AI product, those are the assets the value depends on.

Conclusion: the role of open source audits in due diligence

Unauthorized commercial code. Proprietary code in a public repo. A founder’s helpful email. A Tetris clone in a sewing machine. A research-only dataset powering a commercial AI feature. None of these are edge cases. They’re the kind of thing a real audit finds because someone outside the company finally went looking. For an acquirer, that’s the point: not a clean bill of health, but a clear-eyed view of what they’re actually buying—and a head start on fixing it.
 

Learn more about Black Duck Audits

Continue Reading

Explore Topics