Why Package Installation Is Slow (And How to Fix It)

nimda January 20, 2026

0 6 6 minutes read

Why Package Installation Is Slow (And How to Fix It)

he knows how to wait. You type the install command and watch the cursor blink. The package manager navigates to its directory. Easy seconds. He wonders if something is broken.

This delay has a specific cause: metadata bloat. Most package managers maintain a monolithic index of all available packages, versions, and dependencies. As the ecosystem grows, these indicators grow as well. Conda-forge has over 31,000 packages for multiple platforms and architectures. Other ecosystems face challenges of the same scale with hundreds of thousands of packages.

When package managers use monolithic indexes, your client downloads and separates everything for every operation. You're downloading metadata for packages you'll never use. Compounding the problem: more packages mean larger directories, slower downloads, higher memory usage, and unpredictable build times.

This is not unique to any single package manager. It's a scaling problem that affects any package ecosystem that offers thousands of packages to millions of users.

Architecture of Package Indexes

Conda-forge, like other package managers, distributes its directory as a single file. This design has advantages: the solver gets all the information it needs in advance in a single request, which allows for efficient dependency resolution without round-trip delays. When the ecosystems were small, a 5 MB index was downloaded in seconds and distributed with little memory.

At scale, the design collapses.

Consider conda-forge, one of the largest community-driven scientific Python channels. Its repodata.json file, which contains metadata for all available packages, exceeds 47 MB compressed (363 MB uncompressed). All local operations require parsing of this file. If any package in the channel changes – which happens often with new builds – the entire file must be downloaded again. One new package version invalidates your entire cache. Users also download 47+ MB to get access to one update.

The results are measurable: multi-second download times on fast connections, minutes on slow networks, memory spikes for transferring a 363 MB JSON file, and CI pipelines spending more time on dependency maintenance than actual builds.

Sharding: A Different Approach

The solution lends itself to database design. Instead of one monolithic index, you split the metadata into many smaller pieces. Each package gets its own “shard” that contains only its metadata. Clients download the shards they need and ignore the rest.

This pattern is seen in all distributed systems. Database for data sharing on servers across the board. Content delivery networks store content regionally. Search engines distribute indexes across clusters. The principle does not change: if one data structure becomes too large, split it.

Used in package management, allocation transforms metadata from “download everything, use less” to “download what you need, use it all.”

The implementation works with the two-part system shown in the diagram below. First, a lightweight manifest file, called the shard index, lists all available packages and maps each package name to a hash. Think of a hash as a unique fingerprint from the contents of a file. If you change even one byte of the file, you get a completely different hash.

A shared repodata structure showing the manifest directory and individual shard files. Mini-manifest package names to separate hashes, allowing for better lookup of each package's metadata files. Photo by the author.

This hash is calculated from the contents of the compressed shard file, so each shard file is uniquely identified by its own hash. This manifest is small, about 500 KB with the conda-forge subdirectory for linux-64 containing over 12,000 package names. It only needs to be updated when packages are added or removed. Second, the individual shard files contain the actual package metadata. Each chart contains all versions of a single package name, saved as a separate compressed file.

An important understanding is content-addressable storage. Each shard file is named after the hash of its compressed content. If the package doesn't change, its shard content stays the same, so the hash doesn't change. This means that clients can store shards forever without checking for updates. No round trip to the server is required.
When you request a package, the client creates a dependency break as shown in the diagram below. It fetches the shard index to look up the package name and find the associated hash, then uses that hash to download the specific shard file. A chart contains dependency information, which the client uses to download the next set of additional charts in parallel.

NumPy client download process using distributed repodata. The workflow shows how conda retrieves package metadata and iteratively resolves dependencies by downloading a parallel shard. Photo by the author.

This process finds only the packages that may be needed, typically 35 to 678 packages for a typical installation, rather than downloading the metadata for all packages on all platforms in the channel. Your conda client only downloads the metadata it needs to update your site.

Measuring Impact

The conda ecosystem recently implemented a separate repodata with CEP-16, a community definition developed jointly by developers at prefix.dev, Anaconda, Quansight, a voluntarily maintained channel that hosts more than 31,000 community-built packages without any single company. This makes it an ideal proving ground for infrastructure changes that benefit the wider ecosystem.

The ratings tell a clear story.

For metadata retrieval and partitioning, the default repodata delivers a 10x speed improvement. Cold cache operations that previously took 18 seconds completed in less than 2 seconds. Network transmission decreases by 35%. Installing Python previously required downloading 47+ MB of metadata. With sharding, you download about 2 MB. Maximum memory usage drops by 15 to 17x, from over 1.4 GB to under 100 MB.

The cache behavior is also changing. With monolithic indexes, any channel update invalidates your entire cache. By contrast, only the shard of the affected package needs to be updated. This means better cache and fewer unwanted downloads in the long run.

Design Tradeoffs

Shading introduces complexity. Clients need to understand to decide which pieces to download. Servers require infrastructure to generate and serve thousands of small files instead of one large file. Cache invalidation is more granular but also more complex.
The CEP-16 specification addresses this trade-off in a two-stage approach. A simple manifest file lists all available bits and their checks. Clients download this manifest first, and then download only the shards that the packages need to resolve. HTTP caching handles the rest. Unmodified charts return 304 responses. Changed charts download new ones.

This design keeps the client logic simple while moving the complexity to the server, where it can be developed once and benefit all users. For conda-forge, the Anaconda infrastructure team managed this server-side operation, meaning that 31,000+ package maintainers and millions of users benefit without changing their workflow.

Wide Applications

The pattern extends beyond conda-forge. Any package manager using monolithic indexes faces the same scaling challenges. An important understanding is to separate the discovery layer (what packages exist) from the resolution layer (what metadata do I need for my specific dependencies).

Different ecosystems have taken different approaches to this problem. Others use per-package APIs where each package's metadata is downloaded separately – this avoids downloading everything, but can lead to many consecutive HTTP requests during dependency resolution. A shared repodata provides a middle ground: you download only the packages you need, but you can also download dependencies related to the cluster in parallel, reducing both bandwidth and request overhead.

For teams building internal package collections, the lesson is design: design your own metadata layer to cut outside of your package count. Whether you choose per-package APIs, deprecated references, or another method, the alternative is to watch your build times increase with every package you add.

Trying You

Pixi already has shared repodata support via the conda-forge channel, which is installed by default. Just use pixi normally and you are already benefiting from it.

If you use conda with conda-forge, you can enable support for abandoned repodata:

conda install --name base 'conda-libmamba-solver>=25.11.0'
conda config --set plugins.use_sharded_repodata true

The feature is in conda beta and the conda maintainers are gathering feedback before general availability. If you run into problems, the conda-libmamba-solver repository on GitHub is the place to report them.

For everyone, the takeaway is simple: if your tool feels slow, look at the metadata layer. The packaging itself may not be a bottle. The index is usually.