News

Microsoft Unveils Maia 200 Inference Chip as Hyperscalers Seek Alternatives to Nvidia

Microsoft on Monday introduced Maia 200, a custom chip designed to run artificial intelligence models efficiently at scale, as big cloud providers look for ways to manage rising inference costs and reduce reliance on Nvidia's graphics processors.

The new accelerator targets inference, the stage where trained models generate text, images, and other outputs for users. As AI services move from experimentation to production, the cost of generating tokens has become a growing share of overall spending. Microsoft framed Maia 200 as an attempt to improve those economics through a combination of low-precision compute, higher-bandwidth memory, and networking designed for large clusters.

"Today, we're proud to introduce Maia 200, a breakthrough inference accelerator engineered to dramatically improve the economics of AI token generation," Scott Guthrie, Microsoft's executive vice president for Cloud and AI, wrote in a blog post announcing the chip.

Maia 200 is built on TSMC's 3-nanometer process and is designed around lower-precision math used in modern inference workloads. Microsoft said each chip contains more than 140 billion transistors and delivers more than 10 petaFLOPS in 4-bit precision (FP4), and more than 5 petaFLOPS in 8-bit precision (FP8), within a 750-watt thermal envelope. The chip includes 216 gigabytes of HBM3e memory with 7 terabytes per second of bandwidth, 272 megabytes of on-chip SRAM, and data movement engines to reduce bottlenecks that can limit real-world throughput even when raw compute is high.

"Crucially, FLOPS aren't the only ingredient for faster AI," Guthrie wrote. "Feeding data is equally important."

The launch comes as Microsoft, Google, and Amazon invest heavily in custom silicon alongside Nvidia GPUs. Google's TPU family and Amazon's Trainium chips offer alternatives within their cloud services, and Microsoft has long signaled that it wants greater control over costs and capacity in its AI infrastructure. Maia 200 follows Maia 100, introduced in 2023, and the company is positioning the new chip as an inference-focused workhorse for its AI products.

Microsoft said Maia 200 will support multiple models, including "the latest GPT-5.2 models from OpenAI," and will be used to deliver a performance-per-dollar advantage to Microsoft Foundry and Microsoft 365 Copilot. The company also said its Microsoft Superintelligence team plans to use Maia 200 for synthetic data generation and reinforcement learning as it develops in-house models. Guthrie wrote that, for synthetic data pipelines, Maia 200's design can accelerate the generation and filtering of "high-quality, domain-specific data."

The chip is also an effort to compete on headline performance with hyperscaler rivals. Guthrie wrote that Maia 200 is "the most performant, first-party silicon from any hyperscaler," adding that it offers "three times the FP4 performance of the third generation Amazon Trainium" and "FP8 performance above Google's seventh generation TPU." Reuters-style comparisons often hinge on vendor-provided benchmarks, and Microsoft did not, in its post, provide full test configurations for those claims.

At the systems level, Microsoft said Maia 200 uses a two-tier scale-up network design built on standard Ethernet, rather than proprietary interconnects. The company said each accelerator provides 2.8 terabytes per second of dedicated, bidirectional scale-up bandwidth and supports collective operations across clusters of up to 6,144 accelerators. Within a tray, four accelerators are connected with direct, non-switched links to keep high-bandwidth communication local, while a unified transport protocol is used across trays, racks, and clusters to simplify scaling.

Deployment is underway inside Microsoft's own data centers. The company said Maia 200 is already deployed in its U.S. Central region near Des Moines, Iowa, with the U.S. West 3 region near Phoenix, Arizona coming next, and additional regions planned later. Microsoft also emphasized tighter integration with Azure's control plane, including chip- and rack-level security, telemetry, diagnostics, and management.

To build a developer ecosystem, Microsoft said it is previewing a Maia software development kit that includes PyTorch integration, a Triton compiler, optimized kernel libraries, and access to a low-level programming language. Guthrie wrote that the SDK also includes a simulator and cost calculator intended to help developers optimize earlier in the development cycle.

The company highlighted process changes aimed at speeding deployment from silicon arrival to data center availability. Guthrie wrote that Microsoft used a "sophisticated pre-silicon environment" to model model computation and communication patterns, and that this approach helped cut time from first silicon to first rack deployment to "less than half" that of comparable programs.

Investors are also watching how internal chips could affect Microsoft's spending and margins as it scales AI services. Microsoft shares rose about 1% in early U.S. trading on Tuesday ahead of the company's fiscal second-quarter earnings report due Wednesday, according to a market note that linked the move to attention on Azure growth, AI capacity constraints, and the Maia 200 rollout. Microsoft has previously flagged AI capacity limits that will last through June, and earnings are expected to draw scrutiny on capital expenditures, supply, and the pace at which AI demand translates into durable revenue.

Microsoft's broader bet is that custom chips plus software tooling can reduce costs and improve control over AI infrastructure without sacrificing compatibility. Guthrie described Maia 200 as "the most efficient inference system Microsoft has ever deployed," and wrote that it delivers "30% better performance per dollar than the latest generation hardware" in Microsoft's fleet.

For customers, the more immediate question is less about peak petaFLOPS and more about availability, developer support, and whether Maia 200 meaningfully reduces the cost of serving AI features. Microsoft is betting that inference optimization, including memory bandwidth, networking, and software integration, will become as strategically important as training as AI systems move into everyday products.

About the Author

John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at [email protected].

Featured