Huawei Debuts AI Inference Tech With China UnionPay, Promises 90% Cut in First-Token Latency-钛媒体官方网站

UnionPay reported that using UCM cut model inference times for customer-service classification from 600 seconds to under 10 seconds — a 50x improvement — while boosting classification accuracy from under 10% to 80%.

AI-generated image

TMTPOST — Huawei has unveiled a new AI inference technology designed to slash latency, lower costs and boost the commercial viability of large AI models, as demand shifts from training to inference workloads.

The system, called UCM Inference Memory Data Manager, aims to improve the speed and efficiency of AI by caching previously processed results and retrieving them from high-performance shared storage rather than recalculating from scratch. Huawei says the approach can cut “first-token” latency by up to 90%, increase tokens processed per second by as much as 22 times in long-sequence scenarios, and reduce per-token costs — all without major new hardware investments.

UCM consists of three key modules: connectors that integrate with popular inference engines, an accelerator library for hierarchical KV Cache management, and an adapter that speeds up access to professional shared storage. By coordinating inference frameworks, computing power, and storage, Huawei says the system addresses industry pain points of “slow” and “expensive” inference.

Huawei is piloting UCM with China UnionPay in high-frequency financial scenarios, where response time and accuracy are critical. UnionPay reported that using UCM cut model inference times for customer-service classification from 600 seconds to under 10 seconds — a 50x improvement — while boosting classification accuracy from under 10% to 80%.

According to Huawei, demand for inference computing power now exceeds training demand, accounting for 58.5% of workloads. But China’s AI sector faces higher latency, slower output speeds and smaller context windows than leading overseas models, partly due to lower infrastructure investment and limited access to advanced chips.

Huawei plans to open source UCM in September, making it compatible with multiple inference engines, storage systems, and hardware vendors. The company says it hopes to rally industry players around common standards for AI inference acceleration. scanning headlines.

Huawei Debuts AI Inference Tech With China UnionPay, Promises 90% Cut in First-Token Latency

敬原创，有钛度，得赞赏