Cut your loss in large vocabulary models of vocabulary

As models of languages that grow up, so make their vocabulary. This has changed the Memory Footprint of llms during the unjust training in one layer: Cross-entropy on a compulsory complication. Cross-Entropy Creates a Text Inputs for each Input Tokens and vocabulary and, small models, the largest llm orders combined with the combined llm. We propose a cross-endropy (CCO), the method involving entropy loss of entropy without inserting material for all tokens into all memory. Instead, CCCE is the only role in the Foken Rirchle Shop and reading Log-Sum-EXT-EXT Only in all the installation of the flies. We use the custom kernel that makes the dupression of the matrix and the reduction of log-sum-expense over vocabulary with flash memory, making the use of the global memory in the Cross-Entroy Memory. This has a huge impact. Taking a Gemma 2 (2B) model, CCE CCE reduces the Memory Footprint of a lost Complaint in 1 GB to 1 GB. Improving the CCE's fulfillment, we get the specialty of Sparsity of Softmax and suggest to unlock the electric components of the computer (ie, under the accuracy of the amounts) in Gradient. The test shows that surprising reductions for memory usage is done without having to wake up the speed of training or conversion.



