The Story of UTF-8: How Ken Thompson Saved the Internet from Character Encoding Hell

TL;DR: Before UTF-8, the world was drowning in incompatible character encodings. ASCII only covered English, and the dozens of competing standards for other languages turned international text into garbled nonsense (mojibake). In 1992, Ken Thompson and Rob Pike designed UTF-8 on a placemat at a New Jersey diner, creating an encoding that was backward-compatible with ASCII, self-synchronizing, and capable of representing every character in Unicode. Today, UTF-8 is used by over 98% of all web pages. This is the story of how it happened.


Table of Contents

  1. The ASCII Era: 128 Characters and Nothing More
  2. The Tower of Babel: Competing Encodings
  3. Mojibake: When Encodings Collide
  4. Unicode: One Character Set to Rule Them All
  5. The Problem with UCS-2 and UTF-16
  6. A Diner in New Jersey
  7. How UTF-8 Actually Works
  8. Why UTF-8 is Brilliant
  9. UTF-8 Takes Over the World
  10. Lessons for Protocol Design
  11. References

The ASCII Era: 128 Characters and Nothing More

In 1963, the American Standards Association (now ANSI) published ASCII -- the American Standard Code for Information Interchange. It defined 128 characters using 7 bits: 26 uppercase letters, 26 lowercase letters, 10 digits, 33 control characters, and a handful of punctuation marks and symbols.

ASCII was elegant and efficient for its purpose. The problem was right there in the name: American. It worked perfectly for English and was completely useless for most of the world's languages.

No accented characters for French or Spanish. No umlauts for German. Nothing for Greek, Arabic, Hebrew, Chinese, Japanese, Korean, Hindi, Thai, or the hundreds of other writing systems used by billions of people.

The 8th bit (ASCII only used 7 of the 8 bits in a byte) was left undefined, and that single unused bit became the most contested piece of real estate in computing history.


The Tower of Babel: Competing Encodings

Different regions and vendors rushed to fill the gap by defining their own code pages -- extensions that used values 128-255 (the 8th bit) for additional characters.

The result was chaos:

Encoding Region/Language Characters 128-255
ISO 8859-1 (Latin-1) Western European French, German, Spanish accents
ISO 8859-5 Cyrillic Russian, Bulgarian, Serbian
ISO 8859-6 Arabic Arabic script
ISO 8859-7 Greek Greek alphabet
ISO 8859-15 Western European Latin-1 + Euro sign (EUR)
Windows-1252 Western European (Windows) Similar to Latin-1, with extras
Shift_JIS Japanese Kanji, Hiragana, Katakana
EUC-KR Korean Hangul syllables
GB2312 / GBK Chinese (Simplified) CJK characters
Big5 Chinese (Traditional) CJK characters
KOI8-R Russian Cyrillic

This is a partial list. There were hundreds of encodings in active use. The same byte value could represent completely different characters depending on which encoding you assumed.

Byte 0xC4 in ISO 8859-1 is A (A with diaeresis). In ISO 8859-5, it's D (Cyrillic De). In ISO 8859-7, it's Delta. Same byte, three different characters.

For East Asian languages, the situation was even more complex. Chinese, Japanese, and Korean required thousands of characters, far more than 128 extra slots. Multi-byte encodings like Shift_JIS and Big5 were developed, each with their own rules for when a byte was the start of a multi-byte sequence. They were incompatible with each other and fragile -- if you lost even one byte, the rest of the text became garbled.


Mojibake: When Encodings Collide

When text encoded in one system is displayed using a different encoding, you get mojibake -- the Japanese term (文字化け, literally "character transformation") for garbled text. Everyone who used a computer in the 1990s and 2000s encountered it:

  • An email from a German colleague shows ü instead of u (umlaut u)
  • A Japanese website displays as an avalanche of random symbols
  • A database stores customer names as René instead of Rene
  • CSV files exported from Excel turn accented characters into question marks

Mojibake wasn't just annoying -- it was a data integrity problem. Names were corrupted in databases. Legal documents became illegible. Software that worked perfectly in one country broke catastrophically in another.

The root cause was simple but devastating: there was no universal agreement on how to map bytes to characters. Every system made assumptions, and those assumptions broke the moment data crossed a border.


Unicode: One Character Set to Rule Them All

In the late 1980s, two parallel efforts set out to solve this problem by creating a single character set that included every writing system on Earth.

The Unicode Consortium, founded in 1988 by engineers from Xerox and Apple, and the ISO 10646 working group both aimed to create a universal character set. They eventually merged their efforts, and today Unicode is the canonical reference.

Unicode assigns a unique code point to every character. Code points are written as U+ followed by a hexadecimal number:

  • U+0041 = A (Latin Capital Letter A)
  • U+00FC = u (Latin Small Letter U with Diaeresis)
  • U+4E16 = 世 (CJK Unified Ideograph, meaning "world")
  • U+1F600 = (Grinning Face emoji)

As of Unicode 16.0, there are over 154,000 characters covering 168 modern and historic scripts, plus thousands of symbols, emojis, and control characters. The maximum code point is U+10FFFF, giving a theoretical space of 1,114,112 characters.

But Unicode is a character set, not an encoding. It tells you that U+00FC is the letter u, but it doesn't tell you how to represent that code point as bytes in memory or on disk. That's the job of an encoding, and the choice of encoding turned out to be enormously consequential.


The Problem with UCS-2 and UTF-16

The earliest Unicode encoding was UCS-2: simply use 2 bytes (16 bits) for every character. This was clean and simple -- every character was exactly 2 bytes, so string indexing and length calculations were trivial.

There were two problems.

First, 16 bits only gives you 65,536 possible values, and Unicode already had more characters than that. UCS-2 evolved into UTF-16, which used "surrogate pairs" (two 16-bit code units) for characters beyond U+FFFF. This broke the "every character is 2 bytes" guarantee and added complexity.

Second, and more critically: UCS-2 and UTF-16 were incompatible with ASCII. The ASCII string "Hello" in UCS-2 looks like:

00 48 00 65 00 6C 00 6C 00 6F

All those null bytes (00) would be interpreted as string terminators by C programs, which use null-terminated strings. Existing software -- decades of it -- would break. File paths, environment variables, configuration files, network protocols: anything that assumed ASCII-compatible byte strings would fail catastrophically.

This wasn't a theoretical concern. It was a dealbreaker for adoption across the Unix ecosystem. The world needed an encoding that was backward-compatible with ASCII.


A Diner in New Jersey

On September 2, 1992, Ken Thompson and Rob Pike were attending an ISO standards meeting related to the Plan 9 operating system they were developing at Bell Labs. That evening, dissatisfied with the existing proposals for a new multi-byte encoding (particularly the draft that would become FSS-UTF, or what some called UTF-1), they went to dinner at a diner in New Jersey.

Over dinner, they sketched out a new encoding on a placemat. By the time they finished eating, they had designed what would become UTF-8.

The story is recounted by Rob Pike:

"We designed it on a placemat at a diner. Ken wrote the first implementation that night. The next day, we converted the Plan 9 system to use it. By the end of the week, we had a working system."

Thompson went home that night and implemented the encoding. The next morning, he had working code. Within days, Plan 9 was running on UTF-8. They submitted the design to the ISO working group, which adopted it with minor modifications.

The elegance of what they designed on that placemat is hard to overstate.


How UTF-8 Actually Works

UTF-8 is a variable-length encoding that uses 1 to 4 bytes per character. The number of bytes depends on the code point's value:

Code Point Range Bytes Byte Pattern Bits for Code Point
U+0000 to U+007F 1 0xxxxxxx 7
U+0080 to U+07FF 2 110xxxxx 10xxxxxx 11
U+0800 to U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx 16
U+10000 to U+10FFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 21

Let's encode a few characters by hand.

The letter 'A' (U+0041):

Code point 0x41 = binary 1000001. This fits in 7 bits, so it's a single byte: 01000001 = 0x41. Identical to ASCII. Every ASCII character encodes to the exact same byte in UTF-8.

The letter 'u' (U+00FC):

Code point 0xFC = binary 11111100. This needs 8 bits, so it requires 2 bytes. Using the pattern 110xxxxx 10xxxxxx:

11111100    110 00011  10 111100
                           
              0xC3        0xBC

So u in UTF-8 is the two bytes C3 BC.

The character '世' (U+4E16):

Code point 0x4E16 = binary 0100 111000 010110. This needs 16 bits, so it requires 3 bytes. Using the pattern 1110xxxx 10xxxxxx 10xxxxxx:

0100 111000 010110    1110 0100  10 111000  10 010110
                                              
                        0xE4       0xB8       0x96

So '世' in UTF-8 is E4 B8 96.


Why UTF-8 is Brilliant

The design choices Thompson and Pike made on that placemat were not just clever -- they were deeply informed by decades of systems programming experience.

ASCII compatibility. Every valid ASCII file is automatically a valid UTF-8 file. This meant the entire existing corpus of ASCII text, every configuration file, every script, every protocol -- worked without modification. The migration cost was nearly zero for ASCII-only content.

Self-synchronization. If you land in the middle of a UTF-8 stream, you can find the start of the next character by looking for a byte that doesn't begin with 10. Continuation bytes always start with 10, leading bytes never do. You never need to scan backward to the beginning of the file.

No null bytes. Except for the actual null character U+0000, no UTF-8 encoding contains a 0x00 byte. C string functions work without modification.

No byte-order issues. Unlike UTF-16 and UTF-32, which require a Byte Order Mark (BOM) to indicate endianness, UTF-8 is a byte-stream encoding. There's no endianness to worry about. The same bytes work on big-endian and little-endian machines.

Sorting order preserved. Byte-level sorting of UTF-8 strings produces the same order as sorting by Unicode code point values. This means existing byte-comparison tools and algorithms work correctly on UTF-8 text.

Error detection. Invalid byte sequences are immediately detectable. There are no ambiguous encodings, no sequences that could be interpreted as either one character or two. This makes UTF-8 robust against corruption.

Space efficiency for Latin text. English and other ASCII-heavy languages use just 1 byte per character, the same as ASCII. Western European languages use 1-2 bytes. CJK characters use 3 bytes, which is comparable to their existing multi-byte encodings.


UTF-8 Takes Over the World

Despite its elegance, UTF-8's dominance wasn't immediate. In the 1990s, many systems bet on UTF-16 instead:

  • Java (1995) used UTF-16 internally for String and char
  • Windows NT (1993) adopted UTF-16 as its internal encoding (the "W" APIs)
  • JavaScript (1995) used UTF-16 for strings
  • Python 2 had a muddled approach, with separate str (bytes) and unicode types

But on the web, in filesystems, in databases, in APIs, and in the broader Unix ecosystem, UTF-8 steadily won:

  • Linux adopted UTF-8 as the default locale encoding
  • macOS uses UTF-8 for its filesystem (HFS+ normalized form, APFS preserves it)
  • HTML5 specification recommends UTF-8 and warns against other encodings
  • JSON requires UTF-8 (per RFC 8259)
  • TOML, YAML, Rust, Go, Swift all default to or require UTF-8

The numbers tell the story. According to W3Techs, UTF-8 was used by:

  • 28% of web pages in 2009
  • 60% in 2013
  • 80% in 2016
  • 98%+ by 2025

The encoding that two engineers designed on a placemat over dinner is now the de facto standard for virtually all text on the internet.


Lessons for Protocol Design

The success of UTF-8 offers timeless lessons for anyone designing protocols, file formats, or data representations:

Backward compatibility wins. UTF-8's ASCII compatibility eliminated the migration barrier. Systems didn't need to be rewritten to handle basic text. The upgrade path was gradual and painless.

Simplicity at the byte level matters. Self-synchronization, no null bytes, no endianness issues -- these properties made UTF-8 easy to implement correctly and hard to implement incorrectly. The number of subtle bugs in UTF-16 surrogate pair handling versus UTF-8 implementations speaks volumes.

Design for the ecosystem, not just the specification. Thompson and Pike understood that the Unix ecosystem was built on byte streams and null-terminated strings. They designed UTF-8 to fit into that world rather than demanding the world change. Meet systems where they are.

The right design at the right time changes everything. UTF-8 wasn't the first Unicode encoding, and it wasn't the most "pure" representation. But it was the most practical, and practicality is what drives adoption.

Ken Thompson and Rob Pike didn't just design a character encoding that night in New Jersey. They solved one of computing's most persistent interoperability problems in a way so clean that the solution has lasted over thirty years and shows no signs of being replaced. The placemat is long gone, but the encoding lives on in every web page you visit, every API you call, and every emoji you send.


References

  1. Pike, R. & Thompson, K. (1993). "Hello World, or Καλημέρα κόσμε, or こんにちは世界." Proceedings of the USENIX Winter 1993 Technical Conference.
  2. Yergeau, F. (2003). RFC 3629: UTF-8, a transformation format of ISO 10646. Internet Engineering Task Force.
  3. The Unicode Consortium. (2024). The Unicode Standard, Version 16.0. https://www.unicode.org/versions/Unicode16.0.0/
  4. Pike, R. (2003). "UTF-8 History." https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
  5. Kuhn, M. (2024). "UTF-8 and Unicode FAQ for Unix/Linux." https://www.cl.cam.ac.uk/~mgk25/unicode.html
  6. Davis, M. (2008). "Moving to Unicode 5.1." Google Official Blog.
  7. W3Techs. (2025). "Usage statistics of character encodings for websites." https://w3techs.com/technologies/overview/character_encoding