I was reading about the recently introduced “NVIDIA Inference Microservices (NIMs)” and how they can be deployed on Azure Container Apps using “serverless GPUs". In a tutorial in the official docs , there’s a dedicated section on the importance of enabling what Microsoft calls “artifact streaming", which sparked my curiosity about how it works.
In very simplified terms, it’s a strategy where only the essential container image layers are pulled first, allowing workloads to initialize faster. The remaining layers are downloaded subsequently (at least in AKS) or only when needed (GKE implementation).
The earliest mention I found of this idea was in a 2016 study that brought up an interesting statistic:
“Image download accounts for 76% of container startup time, but on average, only 6.4% of the fetched data is actually needed for the container to start doing useful work."
In 2021, Google
implemented image streaming on GKE
and, a year later, Amazon
open-sourced a solution
to provide this capability to containerd
. Microsoft is still catching up with this
feature in preview
since late 2023 in ACR.
If you are interested in learning more, here are some links:
containerd
internals blog series"
by Samuel Karp — specifically
day 4
about container images.