Running 671B Models Locally? This $20K Chinese LLM Box Does It with 1 GPU, 1TB of RAM, and 20 Tokens/Sec

The landscape of local large language model (LLM) inference is often defined by the limitations of GPU VRAM. Enthusiasts meticulously plan multi-GPU setups, hunt for deals on used high-VRAM cards, and carefully select quantization levels to squeeze models onto their available hardware. However, a recent development out of China, while not directly accessible to most Western enthusiasts yet, presents an intriguing alternative architecture worth examining: the “Brown Ant” (褐蚁) HY90 appliance from Xingyun Integrated Circuits.

Priced at approximately ¥149,000 RMB (around $20,000 USD at the time of writing), the HY90 is positioned as an enterprise solution, but its underlying design philosophy – leveraging massive system memory capacity and bandwidth – offers valuable insights for DIY builders pushing the limits of local LLM inference. The core claim is the ability to run large models like the full DeepSeek R1/V3 (671 Billion Parameters) effectively, bypassing the traditional VRAM bottleneck.

Hardware Architecture: Dual EPYC and a Sea of High-Bandwidth DDR5

At the heart of the HY90 lies a dual-socket configuration featuring AMD EPYC 9355 processors. While the specific motherboard model remains undisclosed – a common trait in pre-built systems – the choice of dual EPYC is pivotal. These server-grade CPUs provide access to a significantly higher number of memory channels compared to consumer platforms. The HY90 exploits this by populating the system with a staggering 24 DIMMs of 48GB DDR5-6400 memory.

This configuration yields a total system RAM capacity of 1152 GB (24 channels * 48 GB/channel). Perhaps more critically for LLM inference, especially the prompt processing (prefill) and token generation (decode) stages which are memory bandwidth sensitive, this 24-channel DDR5-6400 setup delivers an aggregate theoretical memory bandwidth of 1.2 TB/s. This figure substantially surpasses the bandwidth available even on high-end desktop platforms (typically dual or quad channel, topping out around 100-200 GB/s) and begins to approach the territory of lower-end HBM-equipped accelerator cards, albeit with much higher latency.

Model HY90 Performance HY70 Balanced  HY50 Low Cost
Form Factor Tower / Rack-mounted Tower / Rack-mounted Tower
Parameter 671B in FP8 671B in FP8 671B in INT4
Main Configuration Dual AMD EPYC 9355
2×48G 6400 MT/s
RTX 5090D
To be released soon To be released soon
Inference FP8: 21+ t/s
INT4: 28+ t/s
FP8: 20+ tokens/s
INT4: 24+ tokens/s
INT4: 20+ tokens/s
Context Length Supports models up to 128K context
FP8 1k tokens: 21.5+ t/s
FP8 2k tokens: 20+ t/s
FP8 16k tokens: 19+ t/s
Supports models up to 128K context To be released soon
First Token Latency ≤40s for 8k context
≤80s for 16k context
RAG Supported
LLM Support Open-source LLMs To be released soon To be released soon
Value-added Services Custom backend software
Delivery Cycle Within 1 month
Warranty 2 years
Price $20,500 (149,000 RMB) Pending Pending

Xingyun’s core strategy appears to be treating this vast, high-bandwidth system RAM pool as the primary storage for the LLM weights, rather than relying solely, or even primarily, on GPU VRAM. This directly addresses the VRAM capacity wall that plagues enthusiasts trying to run 70B, 100B, or even larger models locally. The manufacturer suggests this architecture could theoretically support models with up to 1.5 trillion parameters, although practical performance at that scale would need independent verification.

Complementing the CPU and RAM is an NVIDIA RTX 5090D GPU. In a system where model weights primarily reside in system RAM, the GPU’s role likely shifts towards accelerating specific computational bottlenecks. This could involve handling the intensive matrix multiplications during the prompt processing (prefill) stage or offloading certain layers during token generation to leverage its superior parallel processing power and potentially higher internal bandwidth for those specific tasks, even if its VRAM capacity isn’t sufficient for the entire model. The specific brands for the system memory, motherboard, and storage subsystems were not detailed in the announcement.

Performance Claims and Inference Optimization

Xingyun claims specific performance figures for the HY90 running the full DeepSeek models, presumably using their own optimized inference engine, which they state incorporates improvements like reducing Mixture-of-Experts (MoE) layer token latency from over 30ms down to 18ms under INT4 precision.

For the Decode stage (token generation speed after prompt processing):

  • FP8 Precision: Stable performance exceeding 20 tokens per second (TPS). Notably, they claim it maintains 15 TPS even with a long context length of 128K tokens, suggesting the memory bandwidth is sufficient to keep the pipeline fed despite the larger KV cache demands.
  • INT4 Precision: Performance increases to 28 TPS, tested with a 1K context length. This aligns with the expectation that lower precision formats reduce memory bandwidth requirements and often increase computational throughput.

For the Prefill stage (time to first token):

  • FP8 Precision: The delay for the first token with a 16K context length is around 80 seconds. While not instantaneous, processing a 16K prompt entirely in system RAM within this timeframe, potentially accelerated by the 5090D, is a notable data point for this architecture.

These figures, particularly the stable TPS at high context lengths and the massive RAM capacity, are the key selling points. Many lower-cost enterprise appliances, and indeed many enthusiast builds, struggle with either VRAM capacity for large models or see performance plummet with longer contexts, especially when relying heavily on system RAM offloading with standard desktop memory bandwidth.

Implications for the Local LLM Enthusiast

While a $20,000 pre-built system is outside the budget of most home users, the HY90’s design offers several takeaways:

System RAM Bandwidth is Key
It highlights the critical importance of memory bandwidth when offloading model layers to system RAM. The jump from typical dual/quad-channel desktop DDR5 to 12 or even 24 channels in server platforms makes a profound difference in inference speed, potentially turning system RAM offloading from a slow fallback into a viable primary strategy for large models.

Viability of Used Server Hardware
This validates the approach some enthusiasts are already exploring: building systems around older, used EPYC (Rome, Milan) or Xeon Scalable platforms. While DDR4 offers less bandwidth than DDR5-6400, the sheer number of channels (often 8 or 12 per socket) on these platforms can still yield aggregate bandwidth significantly higher than desktop systems, potentially offering a budget path to running larger quantized models effectively. Finding motherboards with maximum DIMM slots populated with affordable, high-density RAM becomes a critical build strategy.

Shifting Role of the GPU
In a high-bandwidth system RAM setup, the GPU doesn’t necessarily need colossal VRAM to be useful. A card with good compute capabilities and decent bandwidth (like the speculated 5090D, or perhaps more realistically for enthusiasts, used 3090s/4090s or even older compute cards) can act as a powerful accelerator for specific parts of the inference process, even if the bulk of the model resides elsewhere. This opens up different hardware combinations than the standard “stack VRAM” approach.

Software Optimization Matters
Xingyun’s mention of a custom inference engine optimizing MoE latency underscores that hardware alone isn’t the full picture. Efficient software (like llama.cpp, vLLM, TGI) that intelligently manages memory access patterns and computation scheduling across CPU, RAM, and GPU(s) is crucial to realizing the potential of such hardware configurations.

Future Outlook

Xingyun has indicated the HY90 is the top-tier offering, with plans for more cost-effective HY70 (Balanced) and HY50 (Low Cost) models. These might feature less RAM, slower CPUs, or potentially utilize Xingyun’s own forthcoming AI accelerators instead of high-end NVIDIA GPUs, aiming for better performance-per-dollar in lower budget brackets. They also plan multi-node “Ant Colony” solutions for increased concurrency.

Conclusion

The Xingyun “Brown Ant” HY90, despite its enterprise focus and current geographical limitation, serves as a compelling case study in leveraging high-bandwidth system memory for large-scale local LLM inference. It challenges the VRAM-centric paradigm and suggests that focusing on maximizing system memory channels and speed, potentially using accessible used server hardware, could be a viable path forward for enthusiasts seeking to run ever-larger models without breaking the bank on cutting-edge, high-VRAM GPUs. While we won’t be buying an HY90 off-the-shelf soon, the architectural lessons learned from its design – prioritizing memory bandwidth via multi-channel server platforms – provide valuable food for thought for the next generation of DIY local inference builds.

Leave a Reply

Your email address will not be published. Required fields are marked *