Hop Into Eggciting Learning Opportunities | Flat 25% OFF | Code: EASTER
ai4 min read

Huawei Launches CloudMatrix 384 AI System

Michael WillsonMichael Willson
Huawei Launches CloudMatrix 384 AI System

Huawei has unveiled the CloudMatrix 384, its most advanced AI computing system to date. It combines 384 Ascend 910C chips into a single cluster that rivals, and in some ways exceeds, Nvidia’s latest AI platform. The launch took place at the 2025 World AI Conference in Shanghai and positions Huawei as a key player in large-scale AI infrastructure—especially as the US tightens restrictions on Nvidia’s exports to China.

This article breaks down what CloudMatrix 384 is, how it performs, what makes it different from competitors, and why it matters in the global AI race.

Certified Artificial Intelligence Expert Ad Strip

What Is CloudMatrix 384?

CloudMatrix 384 is a high-performance AI cluster built by Huawei. It connects 384 of Huawei’s own Ascend 910C chips using an all-optical interconnect. The system is designed to support the most demanding AI workloads, including large language models and multi-modal inference.

Unlike traditional GPU clusters that focus on raw chip speed, Huawei’s approach relies on system-level performance. By tightly integrating chips, memory, and compute logic, the platform delivers faster output and better bandwidth for massive AI tasks.

Key Features and System Design

The system uses a unified bus architecture, which allows direct communication between chips at high speed. This is paired with 192 Kunpeng CPUs and 48 terabytes of HBM memory, making it suitable for training and deploying large-scale foundation models.

CloudMatrix 384 supports advanced parallelism strategies like expert parallel (EP320) and uses token optimization for inference acceleration. The onboard software, called CloudMatrix-Infer, handles peer-to-peer token dispatch and memory-efficient model serving.

Core Specs and Capabilities of CloudMatrix 384

Component Specification Impact
Accelerators 384 Ascend 910C NPUs High compute power for training/inference
Memory 48 TB HBM Supports large context windows in LLMs
CPUs 192 Kunpeng cores Coordinates scheduling and task management
Interconnect All-optical supernode Low latency, high bandwidth between chips
Peak Compute (BF16) Up to 300 PFLOPS Exceeds Nvidia GB200 NVL72 (180 PFLOPS)

How It Compares to Nvidia

Huawei openly admits that a single Ascend 910C is not as powerful as Nvidia’s best chips. But CloudMatrix makes up for that with scale. By integrating more chips with better system design, Huawei claims to deliver superior performance at the cluster level.

CloudMatrix 384 achieves 3.6 times more memory and over 2 times more memory bandwidth than Nvidia’s GB200 NVL72. Its peak compute power is nearly 66 percent higher. However, it comes at a cost—power usage is significantly higher, with the system drawing around 559 kilowatts.

CloudMatrix 384 vs Nvidia GB200 NVL72

Feature Huawei CloudMatrix 384 Nvidia GB200 NVL72 Winner
Number of AI Chips 384 Ascend 910C 72 Nvidia GB200 Huawei (more chips)
Memory 48 TB HBM ~13 TB HBM Huawei
Peak Compute (BF16) 300 PFLOPS 180 PFLOPS Huawei
Power Consumption ~559 kW ~240 kW Nvidia (more efficient)
Interconnect Optical Supernode NVLink Switch Huawei (lower latency)

Strategic Value for China

CloudMatrix 384 is more than a technical achievement. It represents China’s push to build domestic AI hardware and reduce reliance on US-based companies like Nvidia. With export controls limiting access to advanced chips, Huawei’s system gives China a homegrown alternative.

Huawei invests over ¥180 billion per year in R&D. CloudMatrix shows that the focus has shifted from single-chip performance to full-stack ecosystem control. That includes hardware, compilers, interconnects, and training software—all built in-house.

Use Cases and Workload Targets

Huawei built CloudMatrix 384 for:

  • Training foundation models
  • Real-time inference at scale
  • Multi-modal systems using vision and language
  • Expert parallel models for higher throughput

This system supports inference speeds of over 6,600 tokens per second per chip for prefill tasks and close to 2,000 tokens per second for decoding. It also maintains token latency under 50 milliseconds and delivers 538 tokens per second within a 15 ms latency cap using INT8 quantization.

Who Should Care

This launch is significant for researchers, enterprises, and governments focused on sovereign AI infrastructure. If you work in AI deployment, edge computing, or national computing policy, CloudMatrix 384 is a case study in scaling up with constraints.

To understand how large systems like this are designed and optimized, professionals should explore programs like the AI Certification. For engineers working with model performance and system bottlenecks, the Data Science Certification offers practical insights. For strategy, enterprise use, and commercialization, the Marketing and Business Certification is ideal.

Final Takeaway

Huawei’s CloudMatrix 384 is a bold response to AI chip restrictions and growing demand for domestic compute infrastructure. By focusing on system architecture and tight integration, Huawei has built an AI cluster that rivals Nvidia’s best—at least at the data center level.

The real test will be adoption. If Chinese tech firms, universities, and government agencies switch to CloudMatrix for large-scale AI tasks, it could change the global balance in AI hardware.

Huawei isn’t just building chips—it’s building control over the AI future, one cluster at a time.

Related Articles

View All

Trending Articles

View All

Search Programs

Search all certifications, exams, live training, e-books and more.