Theory, analysis, and the best habits of Sigmoid

* The main donors
Attention is an important part of the transformer construction. It is a consecutive sequence of sequence that converts each sequence object to the amount weighed. Weights are usually detected as a softmax of DOT products between buttons and questions. The latest work has checked other modifies of softmax to change, such as Rela and activation. In this work, we update the attention of the SIGMOID and conduct the deepest processed and empirical anizing. Similarly, we prove that consist of sigmoid converts is an universe work and benefit from progress compared to the Softmax ignite. For a detailed anizi's analysis, it points to the intensity of the first attention process during the first training phases as an important feature of successful models with a successful Sigmoid. We also launched FlashSigmoid, the implementation of the hardware and memory of the SIGMOID to allow 17% of decorative kernel speed-up with the Flashtantion of H100 GPU. Inquiring language all, the idea, and talk indicates that the general attention of the Sigmoid is compatible with solid attention to softmax. Our work includes old art and establishes the best ways of Sigmoid payments such as the softmax replacement in transformers.



