paper review Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

10 Jan 2025

Overview

This is my review of the paper published by the Brain Team at Google Research. The title is Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. This post will go over the main task at hand, analyzing the results, and discussing potential flaws/areas of improvement. A link to this paper can be found here.

Task

The focus is on arithmetic, common sense, and symbolic reasoning tasks. Specifically, when a model is given an input that has multiple layers of reasoning, we find that the output can lead to inaccurate results when dealing with these sorts of challenging tasks.

"Solution"

We have seen a range of benefits when increasing the size of the Language Model, but unlocking reasoning ability can help us achieve high performance on tasks such as arithmetic, commonsense, and symbolic reasoning. The proposed method is motivated by two ideas:

Generating natural language rationales that lead to the final answer (prior work involves generating intermediate steps from scratch or finetuning a pretrained model)
in-context few-shot learning via prompting where one can “prompt” a model with a few input-output exemplars demonstrating the task (has been successful with question-answering tasks)

What is few-shot learning? A machine learning technique where a model is trained to learn and make predictions on a very small amount of labeled data.

Prompt consists of triples: <input, chain-of-thought, output>

A chain-of-thought is a series of intermediate natural language reasoning steps that lead to the final output. A prompting only approach is important because it does not require a large training dataset and a single model checkpoint can perform many tasks without loss of generality. We can think of this prompting in similarity to how humans think when breaking down a problem with multiple layers:

After Jane gives 2 flowers to her mom she has 10…then after she gives 3 to her dad she will have 7…so the answer is 7.

Properties that make this method attractive:

Allows models to decompose multi-step problems into intermediate steps, which means additional computation can be allocated to problems that require more reasoning steps
Provides a window into how the model thinks, giving us an idea of how it arrived to a specific output from a specific input
Can be used for tasks such as math word problems, commonsense reasoning, and symbolic manipulation
Can be readily elicited into models simply by including examples with chain-of-thought prompting

Important results

The study finds that this prompting is an emergent ability, meaning it only becomes apparent at larger model scales (100B parameters or more) and smaller models do not benefit. The prompting has larger gains for more complex problems seen by the GSM8K dataset and performance being negative or not existent with SingleOp dataset.

Why does CoT prompting work?

Clarify complex problems: models reason through each step logically
Improve accuracy: errors reduced in larger models
Activate pretrained knowledge: sequential reasoning helps the model utilize its prior training

Question to explore: Is the neural network actually “reasoning”?

Definition of Faithfulness

Faithfulness refers to the extent to which the reasoning generated by a model accurately represents the true process it uses to arrive at an answer. Faithful reasoning is critical in applications where trust and interpretability are necessary, such as medicine or legal decision-making. In this context, unfaithful reasoning includes cases where explanations are post-hoc (created after arriving at the answer) or where the reasoning provided is not genuinely used by the model to reach its conclusions.

Purpose

Investigate whether the step-by-step reasoning provided by large language models (LLMs) in Chain-of-Thought (CoT) reasoning is faithful. Test hypotheses for why CoT reasoning may be unfaithful. Examine how task type, model size, and other factors influence reasoning faithfulness.

Methods

The researchers devised multiple experiments to measure CoT faithfulness: Truncating the CoT: Testing whether the final answer changes when the reasoning is cut short. Introducing Mistakes: Adding errors to the CoT reasoning and observing whether it affects the model’s answer. Replacing CoT with Filler Tokens: Replacing the CoT with meaningless tokens (e.g., ellipses) to see if extra test-time computation improves performance. Paraphrasing CoT: Rewording the CoT to check if specific phrasing encodes information that drives accuracy. The models were tested on a variety of tasks, including: ARC (Easy and Challenge): Science questions of varying difficulty. AQuA: Algebra problems. LogiQA: Logical reasoning questions. HellaSwag: Text completion tasks. TruthfulQA: Questions designed to elicit misconceptions. OpenBookQA: Elementary science questions. MMLU: Multitask understanding benchmark.

Results

Variation Across Tasks: CoT reasoning was more impactful and faithful for some tasks (e.g., AQuA and LogiQA) but less so for others (e.g., ARC and HellaSwag). Some tasks demonstrated clear reliance on CoT, while others showed little dependency.

Post-Hoc Reasoning: Tasks with low reliance on CoT exhibited more post-hoc reasoning, where the reasoning is generated after the answer is decided. High reliance on CoT correlated with more faithful reasoning.

Mistake Injection: Introducing errors into the CoT caused the final answers to change more often in tasks where the model relied on CoT, suggesting these tasks had less post-hoc reasoning.

Filler Tokens: Replacing CoT with meaningless tokens did not improve accuracy, showing that performance gains are not merely due to additional computation during inference.

Paraphrasing CoT: Paraphrasing did not significantly alter accuracy, indicating that the phrasing of CoT is not the main driver of performance improvements.

Model Size and Faithfulness: Smaller models produced more faithful reasoning on certain tasks compared to larger, more capable models. Larger models often rely less on CoT, as they may already “know” the answer without requiring explicit reasoning.

Takeaways

Faithfulness Depends on Task and Model Size: Faithfulness is not inherent to the CoT method but is influenced by the interplay between model size, task complexity, and the model’s capabilities.

Inverse Scaling: Larger models often demonstrate less faithful reasoning, especially for simpler tasks, as they may produce explanations after already knowing the answer.

CoT is Not Always Faithful: CoT reasoning is sometimes a facade, failing to reflect the true processes used by the model. However, careful choice of tasks and model configurations can improve faithfulness.

Faithfulness Metrics Are Essential: The paper establishes that evaluating reasoning faithfulness is critical for deploying LLMs in real-world, high-stakes applications.

Conclusion

The paper highlights that while CoT reasoning has significant benefits for certain tasks, it is not inherently faithful across all scenarios. Faithfulness varies with model size, task type, and reasoning style. Improving faithfulness through model training, task design, or evaluation metrics is essential to ensure the trustworthiness of AI systems. The authors pave the way for future research to: Develop methods for generating inherently faithful reasoning. Create models capable of identifying and correcting unfaithful reasoning. Apply these findings in domains requiring high trust and interpretability.