Why Every Emoji, Letter, and Symbol You Type Is Basically Just Math

Hello! I'm so excited for this article. We're going to learn about an important concept: character encoding. It acts as a bridge between human language and computer language.

Let's say you and I speak different languages. I understand only numbers, and you understand written symbols. How can we talk to each other? This is exactly the challenge people faced when computers were first created.

Fundamentally, computers only understand numbers, specifically a pattern of electrical signals that represent 0s (off) and 1s (on). Still, we can use computers with human-understandable language like symbols, characters, and letters.

Clearly, something acts as a bridge or translator, converting human-understandable symbols to binary numbers and vice-versa.

This process is called character encoding, which acts as the translator.

Ok wait we first need to understand what is a character before understanding character encoding, so lets do that.

The Foundations of Human Communication

Before explaining encoding, let's first understand what we're encoding.

A "character" is the smallest meaningful unit in a written system. In English, characters include the letters A through Z, digits 0-9, punctuation marks like periods and commas, and special symbols like @ and #.

We need to realize that different writing systems have different characters:

Latin script has 26 basic letters
Hindi (Devanagari script) has 47 primary characters (33 consonants and 14 vowels) plus various modifiers and conjunct forms
Chinese has tens of thousands of unique characters
Russian Cyrillic has 33 letters
Japanese uses thousands of kanji characters plus hiragana and katakana syllabaries

One more thing to note: we humans make sense of these symbols visually. We see an "A" or any other symbol and immediately recognize it, but that's not the case with computers because they can't "see" or understand these symbols directly.

Ok, now that we know what do we mean by a "character", let's focus on the word "encoding". Let’s start with understanding why do we even need encoding at all?

Why Encoding is Necessary

As we know, computers work using electricity. Specifically, they control the flow of electrical current through millions of tiny switches called transistors. These switches can be in one of two states: on or off, which we represent as 1 or 0 respectively.

This is the system computers use, just like we humans use different systems like English and Hindi. Computers use a binary (two-state) system.

Computers can only store and process patterns or sequences of 0s and 1s. They can't directly store the curve line of an "S" or the three horizontal strokes of an "E." They can only store numbers.

This creates a huge problem: how do we represent human text using only numbers that computers can process? This is exactly what character encoding solves.

Ok, we understood Character encoding is necessary because computers only understand binary (0s and 1s), while humans communicate with complex visual symbols. We need a translation system between these two languages. Now, we can start understanding the encodings or solutions to solve the issue.

Morse Code: The First Practical Character Encoding

Back in the 1830s and 1840s, people were facing the same challenge when trying to send messages over telegraph wires. The telegraph could only transmit pulses—either the circuit was connected (on) or disconnected (off).

Samuel Morse and Alfred Vail developed Morse code to solve this problem. They created a system where each letter of the alphabet was represented by a unique combination of short pulses (dots) and long pulses (dashes):

A: ·− (dot-dash)
B: −··· (dash-dot-dot-dot)
E: · (a single dot)
S: ··· (three dots)

This was essentially a character encoding system—it translated human-readable characters into patterns that could be transmitted electronically and then decoded back into letters by the receiver.

This actually influenced computer encoding systems.

ASCII: The First Universal Standard

ASCII (American Standard Code for Information Interchange) was created in 1963 and was revolutionary because it became the first widely adopted standard for character encoding between different computers.

ASCII used 7 bits per character, allowing for $2^7 = 128$ different characters. Let's see some common characters and their respective binary sequences:

'A' = 65 (binary: 1000001)
'B' = 66 (binary: 1000010)
'a' = 97 (binary: 1100001)
'1' = 49 (binary: 0110001)
Space = 32 (binary: 0100000)

The design of ASCII was quite thoughtful:

Control characters were assigned values 0-31
Most punctuation marks got values 32-64
Uppercase letters ran from 65-90
Lowercase letters ran from 97-122

Notice the pattern: the lowercase 'a' (97) is exactly 32 greater than uppercase 'A' (65). This pattern holds for all letters, making it easy to convert between uppercase and lowercase by simply adding or subtracting 32.

Okay, we understand how ASCII helps us represent characters as binary sequences, but let's take an example to understand the entire flow of how this conversion happens.

How ASCII Works in Practice

When you type the word "Hello" on a keyboard using an ASCII-based system:

The computer registers that you pressed the 'H' key.
It looks up the ASCII code for 'H', which is 72.
It stores the number 72 in binary: 1001000.
It continues for each letter: 'e' (101), 'l' (108), 'l' (108), 'o' (111).
The complete word "Hello" becomes the sequence: 72, 101, 108, 108, 111.
In binary: 1001000, 1100101, 1101100, 1101100, 1101111.

These binary numbers would be stored in computer memory, transmitted over networks, or saved to files. When another computer needs to display this text, it does the reverse process, converting each number back to its corresponding character.

Ok, so we understood that ASCII created the first standardized character encoding, allowing different computers to exchange text information. It assigned a unique number (0-127) to each English character, punctuation mark, and control symbol.

But there's a big problem with ASCII, can you guess?

ASCII Extended: The First Limitations

ASCII was great, but its 128-character limit became a serious problem as computing spread globally. It couldn't represent characters with accents (like é or ñ), much less non-Latin scripts like Cyrillic, Greek, Arabic, or Asian writing systems. To address this limitation, ASCII Extended was developed.

Since computers were increasingly built to handle data in 8-bit bytes (which can store values from 0-255), it was natural to extend ASCII by using the 8th bit. This allowed for an additional 128 characters (values 128-255).

Because of this, different countries and regions created their own extensions, resulting in a collection of incompatible encoding systems:

Code page 437: The original IBM PC character set with box-drawing characters and some European letters
ISO 8859-1 (Latin-1): Western European languages
ISO 8859-2: Central and Eastern European languages
ISO 8859-5: Cyrillic script
Windows-1252: Microsoft's slightly modified version of Latin-1

Each of these encodings used the same values (0-127) for standard ASCII characters but assigned different characters to the extended values (128-255).

This creates another problem that we call "Mojibake."

The Problem of Mojibake

When text encoded in one system was interpreted using another encoding, the result was garbled text known as "mojibake" (a Japanese term meaning "character transformation"). Imagine sending a message in one secret code, but the person receiving it thinks it's in a completely different code!

For example, if the German word "Grüße" (greetings) was encoded using ISO 8859-1 and then viewed on a system using ISO 8859-5 (Cyrillic), the "ü" and "ß" would appear as completely different, likely nonsensical, characters. This is because the same number (in the range 128-255) represents different symbols in each encoding.

This was especially problematic for:

Email messages sent between different countries
Websites viewed on computers with different language settings
Documents shared between different operating systems
Software used in international contexts

SO, Extended ASCII attempted to solve the limitation of representing non-English characters, but it created fragmentation with different countries adopting incompatible standards. This led to text displaying incorrectly when viewed with the wrong encoding system.

Unicode: A Character Set for All Languages

To fix the issue of Extended ASCII, engineers from Apple and Xerox started to work on a new approach. Their goal was to create a single character encoding standard that could represent every character from every writing system ever used by humans.

What is Unicode

Unicode isn't actually an encoding system but it's a character set that assigns a unique identification number, known as a code point, to every character.

These code points are generally written with a prefix of "U+" followed by a hexadecimal number:

Latin capital letter A: U+0041
Greek capital letter alpha (Α): U+0391
Hebrew letter alef (א): U+05D0
Arabic letter alef (ا): U+0627
Devanagari letter A (अ): U+0905
Chinese character for "person" (人): U+4EBA
Emoji grinning face (😀): U+1F600

You can check out all the code points here

Initially, Unicode used 16 bits per character, which allowed for 65,536 different characters. However, it soon became clear that 16 bits wouldn't be enough for all the world's writing systems.

It's now expanded and has code points ranging from U+0000 to U+10FFFF, giving over 1.1 million characters.

Unicode contains about 150,000 characters covering 150 modern and historic writing systems, plus symbols, emojis, and other special characters. So, it's safe to say that we probably won't run out of characters anytime soon!

Unicode solved the fragmentation problem by creating a single universal character set that could represent all human writing systems. It assigns each character a unique code point, regardless of the language or script it comes from.

Unicode Transformation Formats (UTFs):

You need to remember that computers can only store data in 0s and 1s. While Unicode provides a way to represent characters for multiple writing systems, it doesn't explain how we store these code points in computer memory or files.

To solve this, we got multiple encoding systems for Unicode. These encoding methods define how to convert Unicode code points into binary sequences that computers can store and process.

UTF-32: Simple but Inefficient

The simplest approach to encoding Unicode is UTF-32, which uses exactly 4 bytes (32 bits) for every character. This makes processing simple—each character takes the same amount of space, and you can easily jump to the nth character in a string by multiplying n by 4.

But, UTF-32 is very inefficient for most text. Since the vast majority of commonly used characters have code points that fit in 2 bytes or even 1 byte, using 4 bytes for everything wastes a lot of space. It's like using a huge truck to deliver a single letter!

UTF-16: A Compromise

UTF-16 tries to balance simplicity and efficiency by using 2 bytes (16 bits) for the most common characters (those in the Basic Multilingual Plane, with code points up to U+FFFF), and 4 bytes for the less common characters.

This was the encoding used by early implementations of Unicode, including Windows NT, Java, and JavaScript. But it still had drawbacks—it wasn't compatible with ASCII, and it had complications related to "byte order" (whether the most significant byte comes first or last).

UTF-8: The Elegant Solution

UTF-8, designed by Ken Thompson and Rob Pike in 1992, has become the dominant encoding on the modern web and in many operating systems. Its design is remarkably elegant:

It uses a variable number of bytes per character:
- 1 byte for code points 0-127 (ASCII characters)
- 2 bytes for code points 128-2047 (most Latin-script languages, Greek, Cyrillic, Hebrew, Arabic, etc.)
- 3 bytes for code points 2048-65535 (most Chinese, Japanese, and Korean characters)
- 4 bytes for code points above 65535 (rare characters, historical scripts, emojis)
The bytes are structured to make error detection and synchronization possible:
- Single-byte characters start with a 0 bit: 0xxxxxxx
- The first byte of a multi-byte sequence indicates its length with the number of leading 1 bits, followed by a 0: 110xxxxx for 2 bytes, 1110xxxx for 3 bytes, etc.
- Continuation bytes always start with the pattern 10xxxxxx
It's backward compatible with ASCII—any ASCII text is already valid UTF-8, without any changes. This is a huge advantage!

Let's break down how the character "é" (Latin small letter e with acute accent), with Unicode code point U+00E9, is encoded in UTF-8:

First, we convert the code point to binary: U+00E9 = 11101001
Since this value (233 in decimal) is greater than 127, we need more than one byte. The value fits within the range of 128-2047, so we need 2 bytes.
The 2-byte pattern in UTF-8 is:
```
 110xxxxx 10xxxxxx
```
where the x's will be replaced by our actual bits.
We need to fit our 11 bits (00000111101001) into these positions. Working from right to left:
- First, we split our binary value: 00000111101001 becomes 0000011 and 1101001
- The last 6 bits (101001) go into the second byte's xxxxxx positions: 10101001
- The remaining bits (0000011) are placed in the first byte's xxxxx positions: 11000011
So, the UTF-8 encoding of "é" is the two bytes: 11000011 10101001
- In hexadecimal, that's C3 A9.

UTF-8 has become the dominant encoding method for Unicode because of its elegant design. It uses a variable number of bytes, is backward compatible with ASCII, and efficiently represents characters from all languages.

How Text Gets from Keyboard to Screen

Let's trace the journey of character encoding through a simple example—typing the letter 'A' on your keyboard and seeing it appear on screen:

Input: You press the 'A' key on your keyboard.
Keyboard controller: Sends a scan code to the computer.
Operating system: Translates the scan code to a character based on your keyboard layout.
Text processing: The application receives this as the character 'A'.
Unicode mapping: The application maps 'A' to Unicode code point U+0041.
Encoding: If the text needs to be stored or transmitted, it's encoded (likely as UTF-8).
Storage: The encoded bytes are written to memory or disk.
Rendering: When displayed, the process reverses—the bytes are read, decoded back to the code point U+0041, and rendered as the glyph 'A' using a font.

Conclusion: The Evolution of Character Encoding

The journey from ASCII to Unicode and UTF is really fascinating. As technology spread worldwide, the need to represent diverse writing systems became crucial.

ASCII served us well for English text, but its limitations became apparent as computing went global. Extended ASCII attempted to address this but created fragmentation with incompatible standards.

Unicode solved this fragmentation by creating a universal character set that could represent all human writing systems. The UTF encoding formats, particularly UTF-8, provided efficient ways to implement Unicode in actual computer systems.

UTF-8 has become the dominant encoding standard because:

It's backward compatible with ASCII
It efficiently represents characters from all languages
It uses a variable number of bytes, saving storage space
It's designed for error detection and synchronization

Today, character encoding continues to evolve as new symbols and writing systems are added to Unicode. The next time you type in any language or use an emoji, remember the complex system of character encoding that makes it possible for computers to understand these human symbols.