I WAS WRONG - $10K CLAIMED! ## The Claim Two days ago, I confidently claimed that "GPTs will NEVER solve the A::B problem". I believed that: 1. GPTs can't truly learn new problems, outside of their training set, 2. GPTs can't perform long-term reasoning, no matter how simple it is. I argued both of these are necessary to invent new science; after all, some math problems take years to solve. If you can't beat a 15yo in any given intellectual task, you're not going to prove the Riemann Hypothesis. To isolate these issues and raise my point, I designed the A::B problem, and posted it here - full definition in the quoted tweet. ## Reception, Clarification and Challenge Shortly after posting it, some users provided a solution to a specific 7-token example I listed. I quickly pointed that this wasn't what I meant; that this example was merely illustrative, and that answering one instance isn't the same as solving a problem (and can be easily cheated by prompt manipulation). So, to make my statement clear, and to put my money where my mouth is, I offered a $10k prize to whoever could design a prompt that solved the A::B problem for random 12-token instances, with 90%+ success rate. That's still an easy task, that takes an average of 6 swaps to solve; literally simpler than 3rd grade arithmetic. Yet, I firmly believed no GPT would be able to learn and solve it on-prompt, even for these small instances. ## Solutions and Winner Hours later, many solutions were submitted. Initially, all failed, barely reaching 10% success rates. I was getting fairly confident, until, later that day, @Peter Schmidt-Nielsen and @Sydney submitted a solution that humbled me. Under their prompt, Claude-3 Opus was able to generalize from a few examples to arbitrary random instances, AND stick to the rules, carrying long computations with almost zero errors. On my run, it achieved a 56% success rate. Through the day, users @dontoverfit (Opus), @Hubert Yuan (GPT-4), @Jeremy Kritz (Opus) and @Parth Thakkar (Opus), @Peter Schmidt-Nielsen (Opus) reached similar success rates, and @reissbaker made a pretty successful GPT-3.5 fine-tune. But it was only late that night that @Bob posted a tweet claiming to have achieved near 100% success rate, by prompting alone. And he was right. On my first run, it scored 47/50, granting him the prize, and completing the challenge. ## How it works!? The secret to his prompt is... going to remain a secret! That's because he kindly agreed to give 25% of the prize to the most efficient solution. This prompt costs $1+ per inference, so, if you think you can improve on that, you have until next Wednesday to submit your solution in the link below, and compete for the remaining $2.5k! Thanks, Bob. ## How do I stand? Corrected! My initial claim was absolutely WRONG - for which I apologize. I doubted the GPT architecture would be able to solve certain problems which it, with no margin for doubt, solved. Does that prove GPTs will cure Cancer? No. But it does prove me wrong! Note there is still a small problem with this: it isn't clear whether Opus is based on the original GPT architecture or not. All GPT-4 versions failed. If Opus turns out to be a new architecture... well, this whole thing would have, ironically, just proven my whole point But, for the sake of the competition, and in all fairness, Opus WAS listed as an option, so, the prize is warranted. ## Who I am and what I'm trying to sell? Wrong! I won't turn this into an ad. But, yes, if you're new here, I AM building some stuff, and, yes, just like today, I constantly validate my claims to make sure I can deliver on my promises. But that's all I'm gonna say, so, if you're curious, you'll have to find out for yourself (: #### That's all. Thanks for all who participated, and, again - sorry for being a wrong guy on the internet today! See you. Gist: gist.github.com/VictorTaelin/8

Taelin
@VictorTaelin

Apr 06

View on Twitter

A::B Prompting Challenge: $10k to prove me wrong! # CHALLENGE Develop an AI prompt that solves random 12-token instances of the A::B problem (defined in the quoted tweet), with 90%+ success rate. # RULES 1. The AI will be given a random instance, inside a <problem/> tag. 2. The AI must end its answer with the correct <solution/>. 3. The AI can use up to 32K tokens to work on the problem. 4. You can choose any public model. 5. Any prompting technique is allowed. 6. Keep it fun! No toxicity, spam or harassment. # EVALUATION You must submit your system prompt as a reply to this tweet, in a Gist. I'll test each submission in 50 random 12-token instances of the A::B system. The first to get 45 correct solutions wins the prize, plus the invaluable public recognition of proving me wrong

If nobody solves it, I'll repost the top 3 submissions, so we all learn some new prompting techniques :) # DETAILS ON GIST gist.github.com/VictorTaelin/8

(The winning prompt will be published Wednesday, as well as the source code for the evaluator itself. Its hash is on the Gist.)

Taelin

@VictorTaelin

Founder of @HigherOrderComp Building the massively parallel future of computing Reaching AGI to cure all diseases and suffering is all that matters

Follow on X

twitter-thread.com/t/1777049193489572064

Taelin@VictorTaelin

Taelin@VictorTaelin

Taelin

Taelin
@VictorTaelin

Taelin
@VictorTaelin