Finding Examples of Gold: A sharp way of learning according to status

Volume
Using large models of language (llms), Men-Cwriting LFinding (ICL), where the installation and issue of the llms are provided to learn from it before managing the following, proven to be the most effective in directing a LLM in the right place. Most ICL strategies have been developed. Some famous shots were shot (they provide one example), they were shot in a few (they provide many examples), Chain-of-Temples (reflecting step by step) …: etc. Let us take a simple example of better understanding the reading of the context.
When you ask the llm, “which animal makes noise 'and what is its type?” You might expect an answer as “a cow, mammals.” But the llms offers more information. For example, when asked the same question, Chatgpt gave the following answer:
User: What animal makes the sound 'moo' and what is its type?
The animal that makes the sound "moo" is a cow.
Animal: Cow
Type: Mammal (specifically, a domesticated ungulate belonging to the species Bos taurus)
Other examples of non-mammals include birds (like eagles 🦅), reptiles (like snakes 🐍)
, fish (like salmon 🐟), and insects (like butterflies 🦋).
Link: https://chatgpt.com/share/688636F-7B48-8000-5477-54405-5440533
Now, teaching the llm to produce results as we expect (the name of the animal, the nature), which is the most effective format when training llms requires important resources. Alternatively, during the influence, we can provide examples before asking the question to educate our LLM format. That is exactly what the context is read. The llms read the pattern from the examples provided to do the work that is near. Here's the same interaction with Chatgt for examples in the middle of the context:
User: What animal makes the 'woof' sound and what is its type?
Assistant: Dog, mammal
User: What animal makes the 'meow' sound and what is its type?
Assistant: Cat, mammal
User: What animal makes the sound 'moo' and what is its type?
In this case, the LLM gave the correct answer: a cow, animal.
Link: https://chatgpt.com/share/688664f0-96f-8-8-8-8-940b24D2773
As we can see, the llms adapt to the status of the situation (ICL) to achieve their goals. Studies have shown ICL to enhance efficiency and accuracy of llms. But the ICL tends. Work is very sensitive to what examples you choose, their order, and small formatting changes. The ICL is acting in the same as a pattern rather than a real learning, so it depends on the most frequent paths. Consider the complex work such as code, text-to-SQL … etc. Therefore, the great challenge of ICL says “How can you pay for instances that help actually (not any examples)? “
In this case, we will look at the AUPAAAAAAAAAAAAAR's research paper: The golden example of the Codet repairs published by Google Depmind to deal with the problems. AUPAIR Definition is the selection of an example of code repair code (repairing the buggy code). This post aims to explain basic ideas after their work and create the basis for understanding how to produce orderly examples with ICL.
Choosing an effective example
Now, we understand the first ICL challenge to find the correct set of examples. Before we look at how the Ausair approach is dealing with this problem, let's look at the traditional way of an example of an example. Usually, with specific Domain problems (such as the generation of the code / repair or text-to-SQL), we choose from time to time related data problems, and use them during ICL's operations. Other expansion of this, we build a pool of examples and use the same searches to pull the right examples during a running to add as ICL.
In the traditional traditional process, we have no power to measure what an example is the most effective in the llM test in the right place. Now, let's look at the Autair method and how it deals with this problem. Instead of choosing random examples, the Aupair first forms a large dataset of pairs and uses the best choice for pairs. Let us look at each step.
Section 1: The second generation of a couple
The first step is to create a large collection of two representatives. AUPAIR begins with dataset for coding issues with testing. For each problem, asking for a llm to produce the original solution (guess). If this is a partial speculation (score between 0 and 1), it is added to the training data set.
The preparation process takes this broken code and asks the llm to prepare a few Prompto Shot Prompt with pairs are selected in pairs as the context was used). If scores are better produced than the original speculation, this becomes a couple (guess → prepare). The wise part is that if repairs is in complete, it becomes a new “broken” code installed on training data to improve improvement in the following development. This creates increased chains for upgrading. The AUPAAAAAR also repeats this process Many times in building large ponds of the two partners including different types of bugs and their adjustment.
Section 2: Golden (AU) Pair EXCTION
Once we have data collection data, we need to select the most effective pairs. This process takes place in 2 steps. First, we need to measure how a great impact on the preparation couple maintaining election correction preparation is, and secondly, we need to choose the best of the greedy algorithm.
Let us first consider that midnight performance in the election correction is measured.

Rate efficiency, we begin creating verification data – basically a collection of broken Code problems. Then, for each problem in verification data, we take each option to fix the election and use it as a 1 example and the verification problem. As long as the preparation is produced, evaluated against the unit test cases, and calculates points in the problem problem.
We create matrix of quality matrix quality when m[i,j] It represents how a couple help to solve the J, which gives us a perfect idea of what banks help a lot of different problems of problems.

The next step is to find AkAIRS using the performed performance. Algorithm selects a couple against the highest number in all verification problems and add to the Auupaire list. The next follow-up step is to release a couple's donation to all the remaining pairs of matrix. This ensures that we do not choose Redundants in pairs but keep full pairs, each new AUPAAAir should resolve the various problems than the previously selected. This process continues until the development is found under the limit, which leads to the gold order list where each one teaches something different.

Test results
AUPAIR was on the other side of the 7 different contracts of coding 5 different llm models. Expected hardship and the best sample methods to solve problems. The results also show that Akainint achieved the efficiency of 2-3x. It only takes 12 Aupaitain to reach the same functionality that needs random pairs. The results also indicate that Akainint produced in the operating codes that are effective in completely different dacts are similar to Hackground and Atcoder. This proves that as long as you have made a good set of gold pairs, they can do well with new problems in the same domain.
Limitations
AUPAIR shows promising results, but it has fewer issues. First, it requires large cost of integration to make llm calls to produce examples of parents who are against the coming correction. Secondly, it depends largely on the metrics of testing (such as a unit unit test) to measure development, which may not be available in all domains, and it takes that corresponding examples will result in better performance. While this works codes, it may not be true for all domains. Eventually, AUPAIR was considered to resist the problems of competitiveness rather than the Real-World Real-World Codes.
Store
AuPAAir indicates a sharp way to make the reading within the code to fix code. Instead of choosing from occasional choosing, it uses a systematic way of finding the most efficient repair patterns that help the llm perform better. While requiring a large cost of a main computer and works better when having good test marks, the results prove the proper investment, especially as gold pairs are effective in different information. This study opens opportunities to use similar techniques for example options in other areas (eg text-to-SQL) where we can generally productive and measure effectiveness.



