Parallel Track Transformers: Enables Fast GPU Inference with Reduced Synchronization

Efficient large-scale inference of large transformer-based language models (LLMs) remains a significant programming challenge, often requiring multiple GPU parallelisms to meet tight latency and throughput targets. Conventional tensor parallelism degrades matrix performance across devices but introduces massive inter-GPU synchronization, leading to communication bottlenecks and scaling down. We propose the Parallel Track (PT) Transformer, a novel architectural paradigm that reorganizes computation to reduce inter-device dependencies. PT achieves up to 16x reduction in synchronization performance relative to standard tensor coherence, while maintaining competitive model quality in our tests. We integrate PT into two widely used LLM stacks—Tensor-RT-LLM and vLLM—and report consistent improvements in serving efficiency, including up to 15-30% reduced initial token time, 2-12% reduced time to extract each token, and up to 31.90% increased output in both settings.
- ** Work done while at Apple


