A recent study by Apple reveals that Large Language Models (LLMs) in OpenAI, Google, and Meta are not as good as they seem in real life.
We all know that LLMs from OpenAI, Google, Meta, and others are known for their impressive reasoning skills in multiple domains. Recently. OpenAI launched its strawberry o1 model.
Still, recent research reveals that their claimed intelligence may be more like a “Sophisticated pattern matching” than “True Logical reasoning”. This stance is the same as OpenAI’s o1 advanced reasoning model.
To check these flaws, there was a common test called GSM8K, the most common benchmark for reasoning skills. This is a common test that shows a high risk of data contamination and that means LLMs might already know the answers to the test as they were trained by the specialists on those answers, not because of the inherent intelligence.
To perform this test, the study developed a new benchmark called GSM-Symbolic which keeps the core of the reasoning problems while changing some variables like names, numbers, and complexity, and adding irrelevant information. Results were shocking as a fragility in LLM performance. This test was done over 20 different models including OpenAI o1 and GPT-4o, Google’s Gemma 2, and Meta’s Llama 3.
While moving with every single model, the model’s performance decreased when the accuracy decreased by a few percentage points as the names and variables changed. With that researchers noted OpenAI’s models performed better than the other Open-source models but the variance was believed “non-negligible”. This means that any real variance shouldn’t have occurred. There was an interesting change in the system when researchers added “seemingly relevant but ultimately inconsequential statements to the mix.
To test this thing LLMs relied more on pattern matching than actual reasoning. To test this researchers added superfluous phrases to math problems to see how the models would react. There is a simple example, “Oliver picks 44 Kiwis on Friday, and then picks 58 Kiwis on Saturday. On Sunday, he picks double the number of Kiwis he did on Friday. The only difference was five of them were a bit smaller than average. How many Kiwis does Oliver have?
Here are some of the results and a significant drop in performance. Let’s look at OpenAI’s o1 Preview fared the best, with a drop of 17.5 percent accuracy. Although this was a bad result but not as bad Microsoft Phi 3 model which was 65 percent worse than that. To check this, there is a result that LLMs tended to subtract the five smaller kiwis from the equation without understanding that the Kiwi was irrelevant to the problem.
This shows that “models tend to convert statements to operation just but not understanding the size of Kiwi was just so irrelevant to the problems. So the result was models tend to convert the statements to operations without truly understanding their meaning. So, LLMs just only look for patterns in reasoning problems, rather than understanding the concepts only.
So, here are the clear results that testing models on benchmarks containing irrelevant data shows a significant issue in the Large Language model’s capacity to truly grasp mathematical concepts and identify relevant information for solving problems. One important thing is that the authors are affiliated with Apple, a notable competitor of Google, Meta, and OpenAI. While Apple collaborates with OpenAI and is developing its own AI models.
Nonetheless, the evident deficiencies in the LLM’s formal reasoning abilities cannot be overlooked. This serves as a valuable reminder to approach AI advancement with a balanced sense of skepticism. So, better look into these approaches first and check what matters the most during all these recent tests.