Rebuilding Docker Images? The Bad News: Only ~1 in 40 Are Truly Reproducible

Container images are now a core unit of software delivery—and a prime target for supply-chain attacks. In theory, reproducible container builds offer a clean integrity check: rebuild an image from its Dockerfile and compare hashes. If the hash matches, you have strong evidence that the published image corresponds to the source.

In “It’s Not Just Timestamps: A Study on Docker Reproducibility” (Oreofe Solarin, 2026), the author puts that promise to the test at scale, with a measurement pipeline applied to 2,000 GitHub repositories containing Dockerfiles. The headline result is blunt: reproducible Docker images are extremely rare in the wild, and the reasons go far beyond timestamps.

What the Study Measured

The pipeline:

  • clones each repo,

  • finds Dockerfiles (preferring a root Dockerfile when possible),

  • attempts clean builds,

  • then checks reproducibility under three increasingly strict lenses:

    1. Bitwise reproducibility: do two clean builds produce identical image digests?

    2. Infra-reproducibility: after “hardening” build infrastructure (e.g., normalized timestamps via SOURCE_DATE_EPOCH), do the digests match?

    3. Content-level checks: for non-matching digests, tools like diffoci and diffoscope identify what actually differs inside the images.

Key Findings: Buildability Is a Bigger Problem Than You Think

Before reproducibility, there’s a more basic hurdle: many Dockerfiles don’t even build reliably in an automated pipeline.

  • Out of 2,000 sampled repos, only 56.1% produced any buildable image.

  • The rest failed due to build errors, repository access issues, or timeouts.

So even basic “can we rebuild this?” often fails in practice.

Reproducibility Results: “As-Is” Bitwise Matches Are Nearly Nonexistent

Among the 1,123 buildable Dockerfiles:

  • Only 2.7% were bitwise reproducible as-is (same Docker toolchain, two clean builds).

  • After infrastructure hardening, reproducibility improved by 18.6 percentage points.

  • Yet 78.7% of buildable Dockerfiles remained non-reproducible, even after removing major infrastructure-level nondeterminism.

Even more striking: the study observed zero cases where images were “semantically identical but hash-different.” In other words, most hash mismatches corresponded to real content differences, not just harmless metadata noise.

What Breaks Reproducibility (Spoiler: It’s Often the Dockerfile’s Fault)

The paper’s central claim is in the title: timestamps aren’t the main villain—or at least, not the only one.

After infrastructure fixes, diff analysis showed recurring culprits largely under developer control:

  • File ordering / formatting differences (78.1%)

  • System logs baked into images (43.3%) — e.g., /var/log/apt/*, dpkg.log

  • Caches and on-disk databases (36.8%) — package caches, font caches, linker caches, etc.

  • Compiled artifacts (20.0%) — ELF binaries, bytecode outputs

  • App-specific generated files (13.0%) — reports, downloaded models, generated assets

  • Random/non-deterministic data (9.4%)

  • Package-manager state (5.6%)

The story here is practical: many Dockerfiles accidentally preserve “runtime junk” (logs, caches, machine IDs, bytecode, nondeterministic build outputs) inside the final image. Those files change between builds, so your digest changes too.

A Simple Example: Four Fixes Can Flip the Outcome

The paper shows a small but powerful illustration: a typical Dockerfile using a floating base tag and unpinned dependencies produces different digests across rebuilds. But with a handful of changes—like pinning the base image by digest, pinning package versions, and normalizing build timestamps—the same Dockerfile becomes bit-identical across builds.

Practical Dockerfile Guidelines (Actionable Takeaways)

Based on the most common root causes, the study suggests developer-facing best practices that are directly CI-lintable:

  • Pin everything

    • Base images by digest (FROM python:3.11@sha256:…)

    • Packages by version (avoid floating latest-style installs)

  • Clean up aggressively

    • Remove apt lists and caches (/var/lib/apt/lists/*, archives)

    • Remove logs (/var/log/*) when they don’t belong in the final image

    • Avoid leaving build caches (pip/npm caches) behind

  • Reduce non-deterministic outputs

    • Prevent Python bytecode drift (PYTHONDONTWRITEBYTECODE=1 or controlled compilation)

    • Keep generated artifacts out of final layers when possible

  • Use reproducibility-friendly build settings

    • Normalize timestamps with SOURCE_DATE_EPOCH

    • Consider disabling provenance/SBOM if your goal is strict bitwise matching (trade-off: you lose metadata that can be useful for security auditing)

Why This Matters for Supply-Chain Security

Reproducible builds are often treated as a “simple integrity check.” This paper shows why that check rarely works for containers today: even with hardened infrastructure, most Dockerfiles embed build-to-build drift inside the image.

The implication: if you want hash-based verification to be meaningful in container supply chains, reproducibility has to become a developer habit—and ideally, a CI-enforced standard via linters and automated checks.

In short: it’s not just timestamps. It’s the everyday Dockerfile choices we’ve normalized for years.

source: https://arxiv.org/pdf/2602.17678

Leave a Reply

Your email address will not be published. Required fields are marked *