Should we use the llms as if there were knives in Switzerland?

Nevertheless, it was impossible to refuse to increase the hypertension on Ai, especially with an Ai and Agentic AI. As a data scientist working in a consultation company, I saw a number of questions about how we can raise this new technology to make effective or default processes. And while this offspring can praise the data scientists, sometimes it looks like people expect magic from AI models, it is as if they can solve every problem without rushing. In the other hand, while I believe that AI Aderoration and Agentic Ai has changed (and we will continue to change) how we work and how we live, when we see food).
As I have a nerd and I understand how the llms work, I wanted to test their performance in a logic game as a Logic Spanish version that contracts (many details found here). Specially, I had the following questions:
- Will my algorithm get better than the llm model?
- Does the skills consulting skills affect their performance?
To create a LLM based solution
To find a solution by the LLM Model, I built up three main rests. The first one was referred to find first guess:
Suppose I am playing the Word, but in Spanish. It's a game when you have to guess the 5-character name, and only 5 characters, in 6 attempts. Also, the book can be repeated in the last name.
First, let's review the game's rules: Every day the game chooses the name of the five letters that they come to guess within six efforts. After the player has entered the name they think, Each character is marked in green, yellow, or gray: green means the book is right and in good condition; Yellow means that a book is in a hidden word but not in the right position; While gray it means that the book is not in the hidden word.
But if you place a book twice and one is green and other yellow, it means that the book is from twice: once in a green place, once once in the wrong position.
Example: If the name is hidden “Pizza”and your first attempt “Panel”The answer will look like this: the “P” can be green, the “A” yellow, no “N”, “E”beside “L” gray.
From now on the intended word, give me a good name starting – you think it will provide helpful information to help the last name.
After that, Second Sport will be used to show all the rules of the word (Prompt Here is not fully shown due to space, but the complete version also has exemplary games and thoughts):
Now, the idea is that we update the game strategy. I will be giving you the results of the game. The idea is, given to this outcome, lifts the new 5 letter name. Remember, too, there are only 6 attempts. I will give you the result in the following format:
Letter -> ColorFor example, if the name is hidden Giizzaand the effort is that TabryI will give the result in this format:
P -> green (first letter of last name)
A -> yellow (in the Word, but not in secondary situation – instead of that is last)
N -> gray (is not in the name)
E -> gray (is not in the name)
L -> gray (is not in the name)Let us remember the rules. If the book is green, it means that it is in place where they are put to place. If it is yellow, it means a book is in the Word, but not in that position. When gray, it means that it is not in the Word.
If you place a book twice and one shows the raw and the rest of the gray, the book means that the book is only from the name. But if you place a book twice and one showing green and other yellow, it means that this book is twice: once in a blue place, and another time in a different place (not yellow).
All the details I give you should be used to create your proposal. At the end of the day, we want to 'convert' all the characters are green, because that means we guessed the name.
Your last response must contain only a word suggestion – not your thinking.
Final Prompt was used to obtain a new proposal after receiving the effect of our efforts:
Here is the result. Keep in mind that the name must have 5 characters, that you should use the rules and all game information, and that the vaccine is “turning” all the green letters, without more than 6 attempts to guess the name. Take your time to think about your feedback – I don't need a quick response. Don't give me your thinking, only your last effect.
Something important here I have never tried to straighten the llms or point to errors or errors in Logic. I wanted the result from llm-based and I didn't want to wear a solution to any situation or form.
First assessment
The fact is that my first hypothesis is that when my algorithm get better than llmiths, I thought the best Ai solution would do the best job, but after some days, I see what the answer is “like the following):
The answer was indeed obvious: It had only to change two letters. However, Chatgt is answered as the same speculation as before.
After seeing these types of mistakes, I began asking about the end of the games, and the llms basically admitted their mistakes, but did not show a clear explanation for their reply:

While these are just two examples, this type of behavior was normal when producing a clean LLM solution, showing certain restrictions on the grounds for basic models.
The analysis of the results
Throughout the information considered, I used 30 days. 15 days I compared my algorithm against 3 llm models:
- ChatGpt's 4O / 5 model (after the Openaai issued GPT-5 Model, I couldn't convert between models in the Free Chatgpt version)
- Gemini's 2.5-flash model
- 1 Meta Llama model
Here, compared two main mints: WIN Percent and Point Programs (any green book in the Final Guess given 3 points, and yellow letters given 0 points):

As seen, my algorithm (while specified in this case of use, it took me only a day or to build) the end of the daily routine. Analyzing of the LLM models, Gemini provides the worst work, while Chatgpt and Meta's Llama provide similar numbers. However, as it can be seen in the image on the right, there is great variations in each model and suitability.
However, these results were not final if we were not able to analyze the llm consulting model with my algorithm (and the llim model). Therefore, in the next 15 days I compared the following models:
- Chatgipt's 4O / 5 model using the ability to consult
- Gemini's 2.5-flash (same model as before)
- 4 Llama Llama model (same model as before)
Other important ideas here: At first, I planned to use Grok again, but after Grok 4 was removed, a mental punishment of Grok 3 disappeared, making comparisons difficult; On the other hand, I tried to use the Gemino's 2,5-Pro, but unlike Chatgpt's Reasoning, the use of this is not flexible, but the unique model allowed 5 posters a day, which can allow us to finish the full game. With this in mind, we show the results of the following 15 days:

The ability to consult after the late llms empowering the workforce in this work, which requires the understanding of any book that can be used in each case, that they are the vulnerable for all results. Not only are the interable results, but also and working closely, as in two unquestioned matches, only one letter missed. Despite this development, some algorithm I have built is still better depending on working, but as I said before, this is done with this specific work. Something interesting in the 15 games, Base LLM models (Gemini 2.5 Flash and Llama 4) did not win once, and to find out the lucky or not.
Last Speech
The purpose of this work was to try to test the performance of the llms against the specialized algorithm that requires effective laws. We have seen that the base models are not working properly, but that the skills of the Center of LLM solution provides important empowerment, producing similar performance with the effects of the additional algorithm. One important thing to consider is that while the development is literally, through the original property applications and production programs should take response period or, in this case, according to the Azyure Openaith August of 2025, the price of the General Token's Purpose GETP-4O-Mini General Purpose Model around $ 0.15, and the O4-MINI COVER SERVICE. While I firmly believe that the AI ​​Generative AI will continue to arise as applicable, unable to treat them as a Switzerland juice.


