Articles

Why LLMs Couldn't Count the R's in "Strawberry" — and How Spelling It Out Fixed It

An AI that writes essays and debugs code couldn't count the r's in "strawberry." The reason reveals how LLMs really see text — and the fix is simply spelling it out.

June 11, 2026 · 5 min read · llm, tokenization, ai, machine learning, prompt engineering, chain-of-thought, nlp, chatgpt, strawberry problem, ai limitations

A small robot gazes up at an oversized glowing strawberry, under the caption "The fruit that humbled billion-parameter models."

For a while, one of the most famous “gotchas” on the internet was embarrassingly simple: ask a state-of-the-art language model how many r’s are in the word strawberry, and watch a system capable of writing essays, debugging code, and passing professional exams confidently answer “two.”

How could something so smart fail at something a first-grader can do? The answer reveals something fundamental about how these models actually see text — and the fix turns out to be surprisingly low-tech.

A friendly robot holds a magnifying glass over the word "strawberry," which is split into two jigsaw pieces reading "straw" and "berry" — illustrating how a tokenizer breaks the word into chunks rather than letters.

LLMs Don’t See Letters — They See Tokens

When you type “strawberry” into a chat window, the model never receives the letters s-t-r-a-w-b-e-r-r-y. Before your text reaches the neural network, it passes through a tokenizer, which chops text into chunks called tokens. A token might be a whole word, a fragment of a word, or a piece of punctuation.

So “strawberry” might arrive as something like:

straw + berry, or
str + aw + berry

depending on the tokenizer. The model then works with numeric IDs representing those chunks — not the characters inside them. Asking it to count letters is a bit like asking someone to count the bricks in a wall when all they’ve ever been shown is a photo of the finished house.

Tokenization isn’t a design flaw — it’s a deliberate trade-off. Processing text in chunks instead of individual characters makes models dramatically faster and lets them fit far more text into their context window. Letter-counting is just one of the rare tasks where that trade-off bites.

This is also why LLMs historically struggled with related character-level tasks: reversing a word, finding the third letter, counting syllables, or playing Wordle well. The information about individual characters exists only implicitly in the model’s training, not explicitly in its input.

The Fix: Just Spell the Word Out

Here’s the interesting part. People noticed early on that if you gave an older model the word pre-broken into letters — s, t, r, a, w, b, e, r, r, y — it could suddenly count just fine. Each letter became its own token, so the model could finally “see” what it was counting.

Modern models internalized this trick. When you ask a current model to count letters, it typically does something like this in its own response or reasoning:

Spell the word out explicitly: s-t-r-a-w-b-e-r-r-y
Scan the spelled-out version for the target letter
Count the matches and answer

The crucial insight: once the model writes the word letter-by-letter, those individual characters become separate tokens in its own context. The model has essentially converted the problem into a form it can perceive — the same fix users used to apply manually, now applied automatically.

A before-and-after comparison: on the left, a confused robot squints at "strawberry" as a single tokenized block; on the right, a confident robot gives a thumbs-up beside the word spelled out as individual letter tiles, S-T-R-A-W-B-E-R-R-Y.

Is It a Tool, a Script, or Just… Text?

A common assumption is that models now call a Python script or an external tool to count letters. Some AI systems genuinely can do that — many chat products have code-execution sandboxes, and running word.count("r") is trivially reliable.

But in most everyday cases, no tool is involved at all. The model simply writes the spelling into its visible answer or its internal chain-of-thought, and that act of writing is what makes the letters countable. It’s self-help, not outsourcing:

Tool-based approach: the model generates code, an external interpreter runs it, and the result comes back. Maximum reliability, but requires infrastructure.
Spell-it-out approach: the model decomposes the word in plain text and counts from its own output. No infrastructure needed — just a learned habit of breaking the problem down.

What This Episode Says About LLMs

The strawberry saga is a small story with a big moral. It shows that:

LLM failures often reveal architecture, not intelligence. The models weren’t “dumb” — they were blind to a level of detail their input format hid from them.
Decomposition is a superpower. Breaking a problem into smaller, explicitly visible steps fixed letter counting, and the same principle underlies chain-of-thought reasoning, which improved LLM performance across math, logic, and planning.
Humans and models converged on the same workaround. Users discovered the spell-it-out trick through experimentation; later models learned to do it themselves. The fix migrated from the prompt into the model’s behavior.

There’s something almost charming about it: one of the most advanced technologies of our time was tripped up by a fruit, and the cure was the same thing a patient teacher tells a child — slow down and spell it out.