Large Language Models are Zero-Shot Reasoners
Pretrained large language models (LLMs) can become decent zero-shot reasoners.
This is AiSupremacy premium.
A paper came out recently that I felt was maybe a big deal. Let me know what you think in a comment below.
Simply adding “Let’s think step by step” before each answer increases the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with GPT-3.
That is, a recent discovery of large language models like GPT-3 and Instruct-GPT have shown that specific prompts can elicit text generation that follows some basic reasoning.
Link to full paper (only 36 pages and an easy read): https://arxiv.org/abs/2205.11916
This is super salient to pre-human level artificial intelligence debates. A research team from the University of Tokyo and Google Brain addresses this deficiency in their new paper Large Language Models are Zero-Shot Reasoners, which demonstrates that LLMs can become decent zero-shot reasoners through the addition of a simple prompt — “Let’s think step by step” — that motivates a step-by-step thinking process before each question is answered.
Although these LLM models are not AGI, they are built with the now ubiquitous Transformer architecture that future multi-model proto AGI might be based upon. Fair discussion on Less-wrong about this as well.
Allie Miller on LinkedIn summarized this in a recent LinkedIn post.
Can artificial intelligence “reason” like a human can?
What happens when we ask it to?
When an AI system’s answer was initialized with the phrase “let’s think step by step,” the model accuracy on reasoning benchmarks improved.
The system was given tasks we might ask an elementary school student, including basic math questions (like the example below), flipping coins, common sense, moving objects, and understanding dates in a calendar year.
With no additional training, the small addition of “let’s think step by step” improved accuracy on two math benchmarks by about 4x.
While this is not proof of real reasoning (and it would be dangerous to conclude that), it’s a reminder that being asked to “show your work” may be a good idea
Synced covered this on May 30th, 2022.
Takeshi Kojima is from the University of Tokyo.
LLMs can Become Decent Zero-Shot Reasoners
Pretrained large language models (LLMs) are now scaled to more than 100B parameters and have revolutionized the field of natural language processing (NLP) with their excellent few-shot and zero-shot learning capabilities. However, although state-of-the-art LLMs make short work of system-1 tasks, they still struggle on system-2 tasks that require slow and multi-task reasoning.
A research team from the University of Tokyo and Google Brain addresses this deficiency in their new paper Large Language Models are Zero-Shot Reasoners, which demonstrates that LLMs can become decent zero-shot reasoners through the addition of a simple prompt — “Let’s think step by step” — that motivates a step-by-step thinking process before each question is answered. Their resulting Zero-shot-CoT (chain of thought prompting) model achieves huge performance gains compared to the zero-shot baseline.
Their resulting Zero-shot-CoT (chain of thought prompting) model achieves huge performance gains compared to the zero-shot baseline.
Why Prompting Matters
The division of human thinking into fast/automatic (system-1) and slow/rational (system-2) processes was proposed in the 2011 bestseller Thinking, Fast and Slow by psychologist Daniel Kahneman and has been widely adopted by machine learning researchers seeking to endow their models with more advanced and humanlike reasoning capabilities.
The proposed Zero-shot-CoT is a zero-shot template-based prompting approach for chain-of-thought reasoning that, unlike conventional methods, does not require human engineering of prompt examples. Zero-shot-CoT uses an initial prompt for reasoning and a second prompt for answer extraction, enabling it to generate a plausible reasoning path in a zero-shot manner and obtain correct answers where standard zero-shot approaches often fail. It is also versatile and task-agnostic, making it applicable in areas ranging from arithmetic and symbolic tasks to common-sense reasoning.
The evidence really speaks for itself, quite fascinating stuff. It’s a dramatic difference. This is also pretty cutting edge and shows you how fast academics build on the work of each other in a dynamic way.
Chain of Thought Prompting
Chain of thought prompting Multi-step arithmetic and logical reasoning benchmarks have particularly challenged the scaling laws of large language models [Rae et al., 2021]. Chain of thought (CoT) prompting [Wei et al., 2022], an instance of few-shot prompting, proposed a simple solution by modifying the answers in few-shot examples to step-by-step answers, and achieved significant boosts in performance across these difficult benchmarks, especially when combined with very large language models like PaLM [Chowdhery et al., 2022].
All credit to:
In the evaluations, the proposed Zero-shot-CoT achieved astounding performance improvements compared to the zero-shot baseline — boosting accuracy from 17.7 percent to 78.7 percent on MultiArith and from 10.4 percent to 40.7 percent on GSM8K.
Overall, this work demonstrates the potential of LLMs as zero-shot reasoners, and the team hopes it will encourage further research aimed at fully realizing and exploiting the high-level and multi-task zero-shot capabilities inside such models.
Without chain of thought reasoning, the performance does not increase or increases slowly as the model scale is increased, i.e., the curve is mostly flat. In contrast, the performance drastically increases with chain of thought reasoning, as the model size gets bigger.
In this experiment, "Let’s think step by step." achieves the best results. Interestingly, it is found that different templates encourage the model to express reasoning quite differently, where the difference in accuracy is significant depending on the sentence.
There’s a lot to take in here and it’s this time actually worth reading the paper if you have the time.
Keep in mind you can’t make serious AGI or even PHLAI claims here. The work is based on prompting methods for large language models. LLMs have been trained on large corpora from various sources on the web, and have shown to capture and amplify biases found in the training data. Prompting is a method that looks to take advantage of the patterns captured by language models conducive to various tasks, and therefore it has the same shortcomings.
Thanks for reading!