Claude 3.7 Sonnet Review: Anthropic's Hybrid Reasoning AI Tested

Anthropic has launched Claude 3.7 Sonnet, introducing what the company claims is the first “hybrid reasoning model” in the AI industry. This innovative approach allows the AI to switch between quick responses and extended step-by-step thinking within a single system, marking a significant departure from competitors’ strategies.

The model, which launched on Monday, is free to use in its standard mode, while the extended thinking feature requires Claude’s Pro subscription at $20 per month. Anthropic’s philosophy differs fundamentally from other AI companies: “We regard reasoning as simply one of the capabilities a frontier model should have, rather than something to be provided in a separate model,” an Anthropic spokesperson told Business Insider.

Performance Testing Results

Business Insider conducted hands-on testing comparing Claude 3.7’s extended thinking mode against OpenAI’s ChatGPT o1 and xAI’s Grok 3. The tests focused on logical reasoning and creative tasks to evaluate whether additional thinking time improves AI performance.

In a logic riddle test, OpenAI’s ChatGPT o1 delivered the correct answer (“a dream”) in just six seconds. Grok 3’s Think Mode took 32 seconds with detailed step-by-step reasoning. Claude 3.7’s normal mode responded quickly with the correct answer, but its extended thinking mode took nearly a minute, exploring alternatives like “hallucination” and “virtual reality” before arriving at the same conclusion. The model demonstrated remarkably human-like self-correction, questioning its own reasoning throughout the process.

For creative tasks, Claude 3.7’s extended thinking mode showed more impressive results. When asked to write a poem about AI sentience, the extended mode brainstormed seven different metaphors over 45 seconds before selecting one, producing what reviewers considered a more layered and thoughtful poem titled “Emergent.” ChatGPT o1 completed the task in seconds with a more clichéd result, while Grok 3 took 22 seconds for a middle-ground approach.

Technical Capabilities and Benchmarks

Developers using Claude’s API can adjust the “thinking budget” to balance speed, cost, and answer quality—particularly useful for complex coding problems or agentic tasks. According to Anthropic, Claude 3.7 Sonnet achieved 62.3% accuracy on the SWE benchmark (which evaluates real-world software engineering tasks), significantly outperforming OpenAI’s o3-mini model at 49.3% and DeepSeek’s offerings.

When asked if it ever overthinks, Claude 3.7 acknowledged it can “over-analyze simple questions” and get “caught considering too many edge cases,” adding that the ideal thinking amount is context-dependent.

Key Quotes

We developed hybrid reasoning with a different philosophy from other reasoning models on the market. We regard reasoning as simply one of the capabilities a frontier model should have, rather than something to be provided in a separate model.

An Anthropic spokesperson explained the company’s unique approach to Business Insider, distinguishing their strategy from competitors like OpenAI who offer reasoning as a separate model feature. This philosophical difference underpins Claude 3.7’s hybrid architecture.

Oh, wait - there’s another angle I hadn’t considered. What about ‘darkness’? Actually, there’s another possibility I hadn’t fully considered: ‘your closed eyes.’ I’ve been going back and forth, but based on the complete perceptual dominance suggested by the second clue, I think the answer is more likely to be dreams, sleep, or closed eyes than imagination.

Claude 3.7’s extended thinking mode demonstrated remarkably human-like self-correction during the riddle test, flagging its own indecision and exploring multiple possibilities before settling on an answer. This transparency in the reasoning process distinguishes it from competitors.

As with human thinking, Claude sometimes finds itself thinking some incorrect, misleading, or half-baked thoughts along the way. Many users will find this useful; others might find it (and the less characterful content in the thought process) frustrating.

Anthropic acknowledged in a recent blog post that extended thinking comes with trade-offs, including visible reasoning errors. This honest assessment reflects the company’s transparency about the model’s limitations and the subjective value of seeing AI’s thought process.

Our Take

Claude 3.7 Sonnet’s hybrid reasoning model represents a fascinating experiment in AI architecture that challenges conventional wisdom about how reasoning should be implemented. The testing reveals a critical insight: context matters more than raw processing time. For straightforward logical problems, ChatGPT o1’s speed advantage is clear, but for creative and complex tasks requiring exploration of multiple approaches, Claude’s extended thinking delivers superior results.

What’s particularly intriguing is the model’s self-awareness—acknowledging when it overthinks and recognizing that different tasks require different cognitive approaches. This mirrors human intelligence more closely than previous AI models. The 62.3% SWE benchmark score is impressive and suggests real-world utility for developers. However, the $20/month paywall for extended thinking may limit adoption compared to competitors. Anthropic’s approach of integrating reasoning into a unified model rather than creating separate systems could prove more sustainable long-term, offering a glimpse into how next-generation AI assistants might balance speed, depth, and versatility.

Why This Matters

Claude 3.7 Sonnet represents a pivotal shift in AI reasoning architecture, challenging the industry trend of creating separate reasoning models. Anthropic’s hybrid approach could influence how future AI systems are designed, potentially making advanced reasoning capabilities more accessible and cost-effective for everyday users.

The model’s performance reveals an important nuance in AI development: more thinking time doesn’t universally improve results. For logical tasks, speed and accuracy matter most, while creative and complex problems benefit from extended reasoning. This insight has significant implications for businesses implementing AI solutions—they’ll need to match the right reasoning mode to specific use cases.

The superior performance on software engineering benchmarks (62.3% vs. OpenAI’s 49.3%) suggests Claude 3.7 could become a preferred tool for developers, potentially shifting market dynamics in the competitive AI assistant space. As AI models increasingly handle complex coding and agentic tasks, the ability to adjust “thinking budgets” offers developers unprecedented control over performance optimization. This development signals that the AI industry is moving beyond raw speed toward more nuanced, context-aware intelligence that mirrors human cognitive flexibility.

For those interested in learning more about artificial intelligence, machine learning, and effective AI communication, here are some excellent resources:

Source: https://www.businessinsider.com/anthropic-claude-3-7-sonnet-test-thinking-grok-chatgpt-comparison-2025-2