When Meta trained Llama 3 on a 16,384 H100 cluster, the cluster failed roughly every three hours. Over a 54-day run, 419 unexpected component failures interrupted training. Half of them traced back to faulty GPUs and HBM3 memory.1 That isn’t a Meta problem. That’s a public dataset on hyperscale GPU reliability, and it points at something the industry doesn’t talk about enough. We frame the AI compute shortage as a fab problem. Not enough wafers, not enough HBM, not enough capacity to meet demand. All true. Sitting underneath the supply story, though, is a quieter problem: the chips already deployed are failing fast, and the aftermarket infrastructure to keep them running hasn’t scaled with the buildout. If we can’t build AI chips fast enough, we’d better get a lot better at keeping the ones we already have alive.

How Constrained AI Hardware Supply Actually Is

The bottleneck in 2026 isn’t GPU silicon. NVIDIA and TSMC have ramped production. The squeeze is in High Bandwidth Memory, the stacked DRAM that AI accelerators depend on. Three companies make HBM (SK Hynix, Samsung, and Micron) using specialized processes that can’t be ramped quickly. HBM demand has grown roughly 5x between 2023 and 2026, and HBM now consumes around 23% of total DRAM wafer capacity.23 That has knock-on effects in the secondary market. NVIDIA’s H100, which everyone expected to depreciate quickly once Blackwell shipped, hasn’t depreciated the way analysts predicted. Secondhand H100 prices are roughly $12,000 to $18,000 in early 2026, down from the $40,000 peak but still commanding a premium that’s unusual a generation in.4 Rental prices in some markets have actually surged 40%.5 The reason is simple. H100s are still excellent inference hardware. As frontier training migrates to Blackwell, every H100 freed from training duty becomes inference capacity for some other team, some other product, some other lab. Every working chip is wanted by someone. So it’s worth asking how many of them are actually working.

The GPU Failure Curve Nobody’s Talking About

Meta’s Llama 3 paper is the most public dataset we have on hyperscale GPU reliability. The numbers from that 16,384 H100 cluster:1

  • One unexpected failure every three hours over 54 days
  • 30.1% of failures traced to GPU faults
  • 17.2% to HBM3 memory
  • An implied annualized failure rate around 9% in year one

A datacenter architect at Google, quoted in trade press, has put expected datacenter GPU service life at one to three years.6 Compounding a 9% annual failure rate gets you to roughly 27% cumulative failure by year three, which lines up with that estimate.7 Apply the math to a generic 10,000-GPU AI cluster. At a 9% annual failure rate, that’s 900 GPUs per year that need to be triaged, repaired, or replaced. At today’s H100 secondary value, you’re looking at $9 to $18 million of equipment per cluster, per year, sitting in the queue between “broken” and “back online.” Multiply across the global fleet of training clusters and the industry’s reliability problem starts to look like an inventory problem. When chips are scarce, every failed unit is a compute cliff. You don’t get to swap it for a fresh one off the shelf. You get to wait, you get to repair, or you get to push your job to a slower box.

AI Repair Infographic 1

Repair and Refurbishment Are Part of Compute Infrastructure

This is the part the industry has been slow to internalize. Repair and refurbishment aren’t just a sustainability story or a cost-recovery story when your hardware is on a 12-month lead time. They’re capacity. A repaired GPU is the same compute as a new GPU, available faster, at a fraction of the unit cost. Three things matter when you start treating aftermarket as infrastructure. Speed of return-to-service. RMA lifecycles built around consumer electronics return windows weren’t designed for $20K accelerators that earn revenue every minute they’re online. An H100 sitting in a 30-day RMA pipeline isn’t a warranty case, it’s stranded compute. Hyperscalers know this and have started building internal repair capability. The long tail of AI buyers (national labs, sovereign cloud providers, mid-market AI companies, enterprises running on-prem inference) does not have that option. Component-level repair, not module swap. SXM modules, HBM stacks, NVLink connectors, and liquid cooling assemblies aren’t the kind of thing you fix with a board swap. The repair industry that grew up around laptops and cell phones does not, by default, have the bench skills, ESD environments, or BGA rework capability that AI hardware demands. That capability has to be built deliberately, with technicians, equipment, and process discipline that takes years to mature. Refurbishment as a real secondary market. Last-generation accelerators are not e-waste. They’re the inference capacity that smaller labs, sovereign deployments, and on-prem enterprises will run on for the next half-decade. A working refurbishment channel keeps that capacity in circulation rather than letting it accumulate in decommissioned racks. The industry needs the same thing for GPUs that’s long existed for enterprise servers: certified-refurbished tiers, documented test procedures, warranty-backed resale.

The Sovereignty Layer

There’s a layer to this that’s specific to the current moment. Governments are funding sovereig

n AI compute as a strategic capability. Canada announced its Sovereign AI Compute Strategy in 2024, with up to $700 million for ecosystem investment, $1 billion for public supercomputing, and $300 million for an AI Compute Access Fund. In January 2026, Ottawa opened a call for proposals for sovereign data centres exceeding 100 megawatts. Canada’s total IT capacity now exceeds 10 GW across operational, under-construction, and committed projects.89 You can’t have sovereign AI compute without sovereign aftermarket. The whole rationale for keeping training and inference inside national borders falls apart if every failed accelerator has to be flown to a third country for repair, with the data sanitization and chain-of-custody complications that creates. The same logic that justifies domestic datacenter investment justifies domestic repair, refurbishment, and ITAD capacity. That logic isn’t unique to Canada. Every country building sovereign compute has the same downstream problem.

What This Implies for the Next Few Years

A few things follow if the industry takes the repair gap seriously. OEMs and hyperscalers will keep insourcing first-line repair for their own fleets, but the long tail of AI buyers will need certified third-party depot repair partners with real GPU-class capability. That capability is rare today, and it doesn’t appear overnight. Sustainability reporting on AI infrastructure will start to track not just energy and water but extended hardware life. Chips that get repaired and refurbished into a second deployment cycle dramatically improve the embodied-carbon math on AI training. Expect ESG frameworks to catch up. The secondary market for GPUs will get more formalized. Certified refurbished programs for accelerators will follow the pattern OEMs already run for laptops and servers. Buyers running inference on H100s and A100s for the next several years will demand it. And anyone running a meaningful AI cluster will start to ask their service partners a question that wasn’t standard a year ago: what’s your repair turnaround on an SXM module, and how many do you have on the bench right now? If the answer is “we don’t really do that,” that’s a procurement gap. The compute shortage isn’t only about how fast we can build new chips. It’s also about how fast we can fix the ones we have.


Sources

  1. Tom’s Hardware. “Faulty Nvidia H100 GPUs and HBM3 memory caused half of failures during LLama 3 training.”
  2. GPUnex. “GPU Shortage 2026: The HBM Memory Crisis Explained.”
  3. Clarifai. “GPU Shortages: How the AI Compute Crunch Is Reshaping Infrastructure.”
  4. Silicon Analysts. “NVIDIA GPU Prices 2026: B200 at $40K, H100 Dropping to $20K as Supply Eases.”
  5. Kavout. “Why are NVIDIA H100 GPU rental prices surging by 40%.”
  6. Tom’s Hardware. “Datacenter GPU service life can be surprisingly short, only one to three years is expected.”
  7. Jason A. Hoffman. “GPU Failure Rates and the Vocabulary Problem.”
  8. Innovation, Science and Economic Development Canada. “Canadian Sovereign AI Compute Strategy.”
  9. Data Center Knowledge. “Canada Emerges as Global Data Center Powerhouse.”

Microland Technical Services Inc. is a Markham, Ontario-based provider of depot repair, reverse logistics, refurbishment, ITAD, and lifecycle services for OEMs, distributors, and channel partners across Canada. Founded in 1994.

Share This