The Story of UTF-8: How Ken Thompson Saved the Internet from Character Encoding Hell
TL;DR: Before UTF-8, the world was drowning in incompatible character encodings. ASCII only covered English, and the dozens of competing standards for other languages turned international text into garbled nonsense (mojibake). In 1992, Ken Thompson and Rob Pike designed UTF-8 on a placemat at a New Jersey diner, creating an encoding that was backward-compatible with ASCII, self-synchronizing, and capable of representing every character in Unicode. Today, UTF-8 is used by over 98% of all web pages. This is the story of how it happened.
Table of Contents
- The ASCII Era: 128 Characters and Nothing More
- The Tower of Babel: Competing Encodings
- Mojibake: When Encodings Collide
- Unicode: One Character Set to Rule Them All
- The Problem with UCS-2 and UTF-16
- A Diner in New Jersey
- How UTF-8 Actually Works
- Why UTF-8 is Brilliant
- UTF-8 Takes Over the World
- Lessons for Protocol Design
- References
The ASCII Era: 128 Characters and Nothing More
In 1963, the American Standards Association (now ANSI) published ASCII -- the American Standard Code for Information Interchange. It defined 128 characters using 7 bits: 26 uppercase letters, 26 lowercase letters, 10 digits, 33 control characters, and a handful of punctuation marks and symbols.
ASCII was elegant and efficient for its purpose. The problem was right there in the name: American. It worked perfectly for English and was completely useless for most of the world's languages.
No accented characters for French or Spanish. No umlauts for German. Nothing for Greek, Arabic, Hebrew, Chinese, Japanese, Korean, Hindi, Thai, or the hundreds of other writing systems used by billions of people.
The 8th bit (ASCII only used 7 of the 8 bits in a byte) was left undefined, and that single unused bit became the most contested piece of real estate in computing history.
The Tower of Babel: Competing Encodings
Different regions and vendors rushed to fill the gap by defining their own code pages -- extensions that used values 128-255 (the 8th bit) for additional characters.
The result was chaos:
| Encoding | Region/Language | Characters 128-255 |
|---|---|---|
| ISO 8859-1 (Latin-1) | Western European | French, German, Spanish accents |
| ISO 8859-5 | Cyrillic | Russian, Bulgarian, Serbian |
| ISO 8859-6 | Arabic | Arabic script |
| ISO 8859-7 | Greek | Greek alphabet |
| ISO 8859-15 | Western European | Latin-1 + Euro sign (EUR) |
| Windows-1252 | Western European (Windows) | Similar to Latin-1, with extras |
| Shift_JIS | Japanese | Kanji, Hiragana, Katakana |
| EUC-KR | Korean | Hangul syllables |
| GB2312 / GBK | Chinese (Simplified) | CJK characters |
| Big5 | Chinese (Traditional) | CJK characters |
| KOI8-R | Russian | Cyrillic |
This is a partial list. There were hundreds of encodings in active use. The same byte value could represent completely different characters depending on which encoding you assumed.
Byte 0xC4 in ISO 8859-1 is A (A with diaeresis). In ISO 8859-5, it's D (Cyrillic De). In ISO 8859-7, it's Delta. Same byte, three different characters.
For East Asian languages, the situation was even more complex. Chinese, Japanese, and Korean required thousands of characters, far more than 128 extra slots. Multi-byte encodings like Shift_JIS and Big5 were developed, each with their own rules for when a byte was the start of a multi-byte sequence. They were incompatible with each other and fragile -- if you lost even one byte, the rest of the text became garbled.
Mojibake: When Encodings Collide
When text encoded in one system is displayed using a different encoding, you get mojibake -- the Japanese term (文字化け, literally "character transformation") for garbled text. Everyone who used a computer in the 1990s and 2000s encountered it:
- An email from a German colleague shows
üinstead ofu(umlaut u) - A Japanese website displays as an avalanche of random symbols
- A database stores customer names as
Renéinstead ofRene - CSV files exported from Excel turn accented characters into question marks
Mojibake wasn't just annoying -- it was a data integrity problem. Names were corrupted in databases. Legal documents became illegible. Software that worked perfectly in one country broke catastrophically in another.
The root cause was simple but devastating: there was no universal agreement on how to map bytes to characters. Every system made assumptions, and those assumptions broke the moment data crossed a border.
Unicode: One Character Set to Rule Them All
In the late 1980s, two parallel efforts set out to solve this problem by creating a single character set that included every writing system on Earth.
The Unicode Consortium, founded in 1988 by engineers from Xerox and Apple, and the ISO 10646 working group both aimed to create a universal character set. They eventually merged their efforts, and today Unicode is the canonical reference.
Unicode assigns a unique code point to every character. Code points are written as U+ followed by a hexadecimal number:
U+0041= A (Latin Capital Letter A)U+00FC= u (Latin Small Letter U with Diaeresis)U+4E16= 世 (CJK Unified Ideograph, meaning "world")U+1F600= (Grinning Face emoji)
As of Unicode 16.0, there are over 154,000 characters covering 168 modern and historic scripts, plus thousands of symbols, emojis, and control characters. The maximum code point is U+10FFFF, giving a theoretical space of 1,114,112 characters.
But Unicode is a character set, not an encoding. It tells you that U+00FC is the letter u, but it doesn't tell you how to represent that code point as bytes in memory or on disk. That's the job of an encoding, and the choice of encoding turned out to be enormously consequential.
The Problem with UCS-2 and UTF-16
The earliest Unicode encoding was UCS-2: simply use 2 bytes (16 bits) for every character. This was clean and simple -- every character was exactly 2 bytes, so string indexing and length calculations were trivial.
There were two problems.
First, 16 bits only gives you 65,536 possible values, and Unicode already had more characters than that. UCS-2 evolved into UTF-16, which used "surrogate pairs" (two 16-bit code units) for characters beyond U+FFFF. This broke the "every character is 2 bytes" guarantee and added complexity.
Second, and more critically: UCS-2 and UTF-16 were incompatible with ASCII. The ASCII string "Hello" in UCS-2 looks like:
00 48 00 65 00 6C 00 6C 00 6F
All those null bytes (00) would be interpreted as string terminators by C programs, which use null-terminated strings. Existing software -- decades of it -- would break. File paths, environment variables, configuration files, network protocols: anything that assumed ASCII-compatible byte strings would fail catastrophically.
This wasn't a theoretical concern. It was a dealbreaker for adoption across the Unix ecosystem. The world needed an encoding that was backward-compatible with ASCII.
A Diner in New Jersey
On September 2, 1992, Ken Thompson and Rob Pike were attending an ISO standards meeting related to the Plan 9 operating system they were developing at Bell Labs. That evening, dissatisfied with the existing proposals for a new multi-byte encoding (particularly the draft that would become FSS-UTF, or what some called UTF-1), they went to dinner at a diner in New Jersey.
Over dinner, they sketched out a new encoding on a placemat. By the time they finished eating, they had designed what would become UTF-8.
The story is recounted by Rob Pike:
"We designed it on a placemat at a diner. Ken wrote the first implementation that night. The next day, we converted the Plan 9 system to use it. By the end of the week, we had a working system."
Thompson went home that night and implemented the encoding. The next morning, he had working code. Within days, Plan 9 was running on UTF-8. They submitted the design to the ISO working group, which adopted it with minor modifications.
The elegance of what they designed on that placemat is hard to overstate.
How UTF-8 Actually Works
UTF-8 is a variable-length encoding that uses 1 to 4 bytes per character. The number of bytes depends on the code point's value:
| Code Point Range | Bytes | Byte Pattern | Bits for Code Point |
|---|---|---|---|
U+0000 to U+007F |
1 | 0xxxxxxx |
7 |
U+0080 to U+07FF |
2 | 110xxxxx 10xxxxxx |
11 |
U+0800 to U+FFFF |
3 | 1110xxxx 10xxxxxx 10xxxxxx |
16 |
U+10000 to U+10FFFF |
4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
21 |
Let's encode a few characters by hand.
The letter 'A' (U+0041):
Code point 0x41 = binary 1000001. This fits in 7 bits, so it's a single byte: 01000001 = 0x41. Identical to ASCII. Every ASCII character encodes to the exact same byte in UTF-8.
The letter 'u' (U+00FC):
Code point 0xFC = binary 11111100. This needs 8 bits, so it requires 2 bytes. Using the pattern 110xxxxx 10xxxxxx:
11111100 → 110 00011 10 111100
↓ ↓
0xC3 0xBC
So u in UTF-8 is the two bytes C3 BC.
The character '世' (U+4E16):
Code point 0x4E16 = binary 0100 111000 010110. This needs 16 bits, so it requires 3 bytes. Using the pattern 1110xxxx 10xxxxxx 10xxxxxx:
0100 111000 010110 → 1110 0100 10 111000 10 010110
↓ ↓ ↓
0xE4 0xB8 0x96
So '世' in UTF-8 is E4 B8 96.
Why UTF-8 is Brilliant
The design choices Thompson and Pike made on that placemat were not just clever -- they were deeply informed by decades of systems programming experience.
ASCII compatibility. Every valid ASCII file is automatically a valid UTF-8 file. This meant the entire existing corpus of ASCII text, every configuration file, every script, every protocol -- worked without modification. The migration cost was nearly zero for ASCII-only content.
Self-synchronization. If you land in the middle of a UTF-8 stream, you can find the start of the next character by looking for a byte that doesn't begin with 10. Continuation bytes always start with 10, leading bytes never do. You never need to scan backward to the beginning of the file.
No null bytes. Except for the actual null character U+0000, no UTF-8 encoding contains a 0x00 byte. C string functions work without modification.
No byte-order issues. Unlike UTF-16 and UTF-32, which require a Byte Order Mark (BOM) to indicate endianness, UTF-8 is a byte-stream encoding. There's no endianness to worry about. The same bytes work on big-endian and little-endian machines.
Sorting order preserved. Byte-level sorting of UTF-8 strings produces the same order as sorting by Unicode code point values. This means existing byte-comparison tools and algorithms work correctly on UTF-8 text.
Error detection. Invalid byte sequences are immediately detectable. There are no ambiguous encodings, no sequences that could be interpreted as either one character or two. This makes UTF-8 robust against corruption.
Space efficiency for Latin text. English and other ASCII-heavy languages use just 1 byte per character, the same as ASCII. Western European languages use 1-2 bytes. CJK characters use 3 bytes, which is comparable to their existing multi-byte encodings.
UTF-8 Takes Over the World
Despite its elegance, UTF-8's dominance wasn't immediate. In the 1990s, many systems bet on UTF-16 instead:
- Java (1995) used UTF-16 internally for
Stringandchar - Windows NT (1993) adopted UTF-16 as its internal encoding (the "W" APIs)
- JavaScript (1995) used UTF-16 for strings
- Python 2 had a muddled approach, with separate
str(bytes) andunicodetypes
But on the web, in filesystems, in databases, in APIs, and in the broader Unix ecosystem, UTF-8 steadily won:
- Linux adopted UTF-8 as the default locale encoding
- macOS uses UTF-8 for its filesystem (HFS+ normalized form, APFS preserves it)
- HTML5 specification recommends UTF-8 and warns against other encodings
- JSON requires UTF-8 (per RFC 8259)
- TOML, YAML, Rust, Go, Swift all default to or require UTF-8
The numbers tell the story. According to W3Techs, UTF-8 was used by:
- 28% of web pages in 2009
- 60% in 2013
- 80% in 2016
- 98%+ by 2025
The encoding that two engineers designed on a placemat over dinner is now the de facto standard for virtually all text on the internet.
Lessons for Protocol Design
The success of UTF-8 offers timeless lessons for anyone designing protocols, file formats, or data representations:
Backward compatibility wins. UTF-8's ASCII compatibility eliminated the migration barrier. Systems didn't need to be rewritten to handle basic text. The upgrade path was gradual and painless.
Simplicity at the byte level matters. Self-synchronization, no null bytes, no endianness issues -- these properties made UTF-8 easy to implement correctly and hard to implement incorrectly. The number of subtle bugs in UTF-16 surrogate pair handling versus UTF-8 implementations speaks volumes.
Design for the ecosystem, not just the specification. Thompson and Pike understood that the Unix ecosystem was built on byte streams and null-terminated strings. They designed UTF-8 to fit into that world rather than demanding the world change. Meet systems where they are.
The right design at the right time changes everything. UTF-8 wasn't the first Unicode encoding, and it wasn't the most "pure" representation. But it was the most practical, and practicality is what drives adoption.
Ken Thompson and Rob Pike didn't just design a character encoding that night in New Jersey. They solved one of computing's most persistent interoperability problems in a way so clean that the solution has lasted over thirty years and shows no signs of being replaced. The placemat is long gone, but the encoding lives on in every web page you visit, every API you call, and every emoji you send.
References
- Pike, R. & Thompson, K. (1993). "Hello World, or Καλημέρα κόσμε, or こんにちは世界." Proceedings of the USENIX Winter 1993 Technical Conference.
- Yergeau, F. (2003). RFC 3629: UTF-8, a transformation format of ISO 10646. Internet Engineering Task Force.
- The Unicode Consortium. (2024). The Unicode Standard, Version 16.0. https://www.unicode.org/versions/Unicode16.0.0/
- Pike, R. (2003). "UTF-8 History." https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
- Kuhn, M. (2024). "UTF-8 and Unicode FAQ for Unix/Linux." https://www.cl.cam.ac.uk/~mgk25/unicode.html
- Davis, M. (2008). "Moving to Unicode 5.1." Google Official Blog.
- W3Techs. (2025). "Usage statistics of character encodings for websites." https://w3techs.com/technologies/overview/character_encoding