Ratattention: Refers to the small size of the Sliding windows in the local attention Model

Spatial attention models have recently emerged as compelling alternatives to traditional retailers, promising improvements in both training and operational efficiency. However, a critical choice of window size yields pareto performance Current models, such as Gemma2 and Mistral, adopt conservative window sizes (eg, 4096 by 8192 length) to preserve performance. This work investigates strategies for changing this pareto frontier, which enables global spatial models to find effective solutions even in short-lived regimes. Our basic motivation is to address a limitation of spatial attention – its complete indifference to tokens outside the defined window. We are testing rattiven, spatial attention combined with a special method of linear attention designed to capture information from these out-window organs. Examining the tests at scale 3B and 12b shows that the rattive achieves a superior pareto trade-off between performance and efficiency. As a sweet spot, rattivention with a window size of just 512 is consistent with the performance of all attention models across different settings. In addition, the general condition from the rattivent's direct attention component contributes to the increased contextual performance, as confirmed in the Rulen Benchmark. Obviously, this improvement does not affect the effectiveness of the training; Thanks to the implementation of the special kernel and the reduced window size, ratattention keeps the training speed compared to the existing state-of-the-art methods.
- ** Work Done while at Apple



