Machine Learning

From equal instruments to smart months: OTPO's best understanding of LLM

Volume

They appeared in search tools for AI to code, write, and research. They are now available through smartphone apps with Internet Apis, put a strong AI accessible AI. These programs become an integral part of our daily lives. People use AI to seek personal relationships, the facts that test for ideas (even though it is well meant to make mistakes), food plans and the following holiday plans.

Since the more powerful models are presented, the reliability question from the models are explored to make sure that those generated answers are honest and aligned with human values. This is not new questions. Traditionally, well-organized models in the popular preferences (usually containing the entries, the correct answer, the reject response) before launching public application. To align the model and safety into major research areas, many algoriths are developed to train the alignment model. Among all of all algorithms are relevant to all algorithms, most popular in direct preference (DPO) due to simplicity and efficiency.

But DPO has a basic limit. When listing the opportunity to respond, it uses the same weight of each name or the existing token in response, even though humans are naturally giving importance or weight naturally. For example, let's take a look at the following user connections with a llm.

User: What is the capital of France?
Llm: The French capital is Paris, and a beautiful city with many attractions.

In this connection, people care about the accuracy of the “Paris” rather than Stylistic prosperous, but the standard DPO gives equal weight to all tokens, allows little content to reduce the learning signal.

There were many attempts to correct DPO problems. Legoithms such as Sippo and Sapo is introduced to deal with different problems. In this post, we will look for another algorithm published in May 2025 to “Oneplo TTimport-based scheme Kindindication OnePtimization (Otpo). “This post explains basic ideas after their work and creates the foundation of the llM understanding and popularity.

Why is the equal weight of the token fail

To understand why we should start the news, we need to check how DPO is actually processing tokens. Usually traditional-trained models are trained in parameters, and they are trained and trained using the DPO to personal preferences to accompany the public preference before being released from the public.
DPO works with computing computer log opportunities between selected answers and crossing at the Token level. Each example of the selected training of the_the response and refused the answer y_l, DPO complies its amount of purpose. The DPO spine is lying on its formula of loss:

Picture from the DPO sheet

PI_theta (πθ) is a model to be done well, PI_Refere (π_ref) is a reference model, and π * (y |

Π with a selected response with tokens [t₁, t₂, ..., tₙ]It is possible that it may be something:

Log π * (Y | x) = Σ σᵢ Log π (Tᵢ | X, T₁ … Tᵢ₋₁)

Each token donates its log about chronological order, and no session to reduce important content is more than filling. Let's look at the example of data to choose.

Invest: What is the capital of France?
Fax: The French capital is Paris.
Unmarked: The French capital is Italy, it is actually wrong.

DPO indicates opportunities for all the same tokens.
Chosen: log P("The") + log P("capital") + log P("of") + log P("France") + log P("is") + log P("Paris") + log P(".")

Rejected: log P("The") + log P("capital") + ... + log P("Italy") + ... + log P("incorrect") + log P(".")

The critical difference is lying on “Paris” vs “vs” vs “vs” vs “vs” vs “vs” vs “vs.

The model receives an equal learning signal from important tokens of essential (“Paris”) and Important (“which”, “actually”). This leads to a young man trap, a number of sequences collect the logs as many opportunities for Sheer Token Count, so DPO can be passionate about quality.

When the important Seensical tokens are the stylistic development, reading signals are unanswered, resulting in the most popular reading of the Suboptimal. These problems can be solved if we have a better way to give more weight in appropriate tokens when listing opportunities. That's exactly what you do.

The weight of a well-based toke

Now as we understand the weight reduction issue, let's see how Otpo solves you using the perfect view of travel. Otpo is watched for popularity as transportation as transportation, how many attempts does it take to change one answer in another?

The main understanding of any of the small effort required to change “French capital is Paris” Capital of France is Italy “?

OTPO builds this as a relevant transport problem when the sources are selected sources of response, tokens for refused tokens, and travel costs reflect the semantic matches between the bandages between the tokens. Similar tokens (such as “Paris” and “London”) have low travel costs, while the remote tokens (such as “Paris” with “Apple”) have high cost.

The algorithm shows the appropriate travel solution that tells us how to submit a maximum between low cost response answers. Token pairs who have a high share in those transportation, especially those who need the effective alteration of the Semantic, get higher weights in the final calculation. This means that OTPO is automatically concentrated in read tokens to the most interesting tokens in one's choice, solving a DPO weight problem.

Math After OTPO

Now let's go into the entry of the OTPO figure. Algorithm has three main objects, forming the cost matrix, to solve the appropriate transportation, and loss of weight token.

Step 1: The cost of building matrix

OTPO begins with the cost of the cost of M Equity of Semantic Grade in all the tokens. I-th token to checher (w) up with j-th token in response (L) Answer

Kind[i][j] = (H[w][i]- h[l][j] ) ²

When h[w][i] and h[l][j] The last representations of hidden tokens from the model. This Euclidean Grade is photographed in the Semantic Similar. Similar tokens are like “Paris” and “London” have low cost, while far-in-token such as “Paris” and “Apple” have high cost.

Step 2: The right transport problem

Otpo builds the weight token as well done in an unequal transportation:

Picture from Otpo paper

Here Γ Transportation Program (resolving them)) Synchronize tokens between selected and shortcuts. Ω entropy control controls. KL conditions ensures that the distribution of Γ is closer to the same DPO instruments. Solution

Step 3: Computing Token instruments

From the correct solution solution, we find the Token Lifestyle Summarizing the size:

Picture from Otpo paper

Here, γ (I, j) represents the intended weight a couple of Token couple (i, J) from the selected (W) and reply (R). Finally these metals are used in DPO to replace uniform position. Rewarding differences about the weight loss system.

Picture from Otpo paper

Test results and restrictions

OTPO is inspected in various activities but in a controlled area. When used in summaries, demonstrated by making progress by 8.5% in other ways. When tested in the length of research on Ultrafeedback Databack with small models such as LLAMA-3-8b, OTPO produced short answers. These initial exams provide proof that OTPO is to help reduce the zeal and improve the quality of the answers that may be selected by people.

The test was not enough for enough to introduce the accuracy number across the domain. There were mixed results in different datasets. OTPO requires the expensive metric count and transport. Also, the LLM as the judge was used to calculate the quality of response, which continued to be considered by the hand of a few people. These methods are only good but they are completely dependent on potential reviewers in certain datasets.

Store

The llM alignment has been a major research title, and OTPO provides prominent results in the controlled temple. While this method is perfect, the introduction of the choice of weight-size is losing the foundation for the capacity of the dangers in alignment.

References:

  1. Policy efficiency (DPO).
  2. A program is ready to go on the Token speed.
  3. To end the tricky length of a direct greeting of popularity (sipo).

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button