Anthropic unveiled Claude 3.7 Sonnet this week, its newest AI model that puts all its capabilities under one roof instead of splitting them across different specialized versions.
The release marks a significant shift in how the company approaches model development, embracing a “do everything well” philosophy rather than creating separate models for different tasks, as OpenAI does.
This isn’t Claude 4.0. Instead, it’s just a meaningful but incremental update to the 3.5 Sonnet version. The naming convention suggests the October release might have internally been considered Claude 3.6, though Anthropic never labeled it as such publicly.
Enthusiasts and early testers have been pleased with Claude’s coding and agentic capabilities. Some tests confirm Anthropic’s claims that the model beats any other SOTA LLM in coding capabilities.
However, the pricing structure puts Claude 3.7 Sonnet at a premium compared to market alternatives. API access costs $3 per million input tokens and $15 per million output tokens—substantially higher than competitive offerings from Google, Microsoft, and OpenAI.
The model is a much-needed update, however, what Anthropic has in capability, it lacks in features.
It cannot browse the web, cannot generate images, and doesn’t have the research features that OpenAI, Grok, and Google Gemini offer in their chatbots.
But life isn’t just about coding. We tested the model on different scenarios—probably leaning more towards the use cases a regular user would have in mind—and compared it against the best models in each field, including creative writing, political bias, math, coding, and more.
Here is how it stacks up and our thoughts about its performance—but TL;DR, we were pleased.
Creative writing: The king is back
Claude 3.7 Sonnet just snatched back the creative writing crown from Grok-3, whose reign at the top lasted barely a week.
In our creative writing tests—designed to measure how well these models craft engaging stories that actually make sense—Claude 3.7 delivered narratives with more human-like language and better overall structure than its competitors.
Think of these tests as measuring how useful these models might be for scriptwriters or novelists working through writer’s block.
While the gap between Grok-3, Claude 3.5, and Claude 3.7 isn’t massive, the difference proved enough to give Anthropic’s new model a subjective edge.
Claude 3.7 Sonnet crafted more immersive language with a better narrative arc throughout most of the story. However, no model seems to have mastered the art of sticking the landing—Claude’s ending felt rushed and somewhat disconnected from the well-crafted buildup.
In fa,ct some readers may even argue it made little sense based on how the story was developing.
Grok-3 actually handled its conclusion slightly better despite falling short in other storytelling elements. This ending problem isn’t unique to Claude—all the models we tested demonstrated a strange ability to frame compelling narratives but then stumbled when wrapping things up.
Curiously, activating Claude’s extended thinking feature (the much-hyped reasoning mode) actually backfired spectacularly for creative writing.
The resulting stories felt like a major step backward, resembling output from earlier models like GPT-3.5—short, rushed, repetitive, and often nonsensical.
So, if you want to role-play, create stories, or write novels, you may want to leave that extended reasoning feature turned off.
You can read our prompt and all the stories in our GitHub repository.
Summarization and information retrieval: It summarizes too much
When it comes to handling lengthy documents, Claude 3.7 Sonnet proves it can tackle the heavy lifting.
We fed it a 47-page IMF document, and it analyzed and summarized the content without making up quotes—which is a major improvement over Claude 3.5.
Claude’s summary was ultra-con