Your Chatbot Isn’t Reading Words—It’s Counting Tokens

Tokenization is the gateway through which raw text transforms into a format usable by large language models (LLMs) like GPT. It acts as the bridge between human-readable content and numerical data that models process. Before a model can understand or generate coherent text, it must break input into smaller units called tokens.

In GPT architectures, tokenization is fundamental to the model's performance and capabilities, influencing efficiency, context window usage, and output quality. Tokenization is the process of breaking text into smaller units called tokens, which can represent words, subwords, characters, or even special symbols. These tokens are the basic building blocks that the model processes. The way text is tokenized directly impacts how efficiently the model can handle data, how much information fits within its context window, and the quality of the responses it generates.

The context window is the maximum number of tokens the model can process in a single operation, including both the input and the generated output. For instance, a model with a 32,000-token context window must fit everything—your input text, system instructions, and the model's response—within this limit. Efficient tokenization reduces the number of tokens required to represent a given text, enabling you to include more content or get longer, richer outputs without exceeding the limit. Poor tokenization, on the other hand, can inflate token counts unnecessarily, wasting valuable space in the context window and limiting the model's usability for longer tasks.

Once text is tokenized, each token is converted into a numerical embedding—a mathematical representation in a high-dimensional space (often hundreds or thousands of dimensions). This embedding captures the meaning and relationships of the token in the context of the entire vocabulary. For example, tokens for similar words like "run" and "running" will be placed closer in this space than unrelated tokens like "run" and "table." These embeddings enable the model to understand the sequence of tokens and predict the most likely next token during text generation. This process is what allows GPT to produce coherent, contextually relevant outputs, whether responding to a query, completing a sentence, or generating creative content.

In essence, tokenization is not just a preprocessing step—it is a critical enabler for GPT models to function efficiently and deliver high-quality results.

How Tokenization Works

Tokenization is not a one-size-fits-all process; it varies depending on predefined rules or algorithms designed to break text into manageable units called tokens.

Here's a deeper look into how it works:

Splitting

This involves dividing text into smaller units such as words, subwords, or characters. Modern LLMs often rely on subword tokenization because it offers a balance between efficiency and robustness. This balance arises because subword tokenization can handle rare or unknown words by breaking them into smaller, more common components, while still encoding frequent words as single tokens.

Example:
Consider the wordunhappiness. Using subword tokenization, it might break into:

un, happi, and ness.

This approach ensures:

Efficiency: Common components like un and ness are reused across different words, reducing vocabulary size.
Robustness: Rare words like unhappiness can still be processed by breaking them into known subcomponents, avoiding out-of-vocabulary issues.

In practical terms, this allows models to generalize better across diverse text inputs without excessively bloating the vocabulary.

Encoding

Encoding assigns a unique integer to each token based on a predefined vocabulary—a collection of all possible tokens that a model recognizes. In the context of GPT and similar models, the vocabulary is created during training and represents the set of subwords (or characters) that the model uses to understand and generate text.

For example, in GPT:

The word hello might be a single token with an integer representation like 1356.
A rare term like micropaleontology might be broken into subword tokens like micro, paleo, and ntology, each with its own integer.

For ChatGPT, this means that when a user inputs text, the tokenizer maps the input string into a sequence of integers based on the model’s vocabulary. This sequence is then processed by the neural network. The vocabulary size impacts the model's memory usage and computational efficiency, striking a balance between handling complex language constructs and keeping the system performant.

Decoding

Decoding is the reverse process: converting a sequence of token integers back into human-readable text. For subword tokenization, this involves reassembling subwords into complete words where possible.

How it works:

The model generates a sequence of integers (tokens) during output.
The decoder looks up each integer in the vocabulary and retrieves its corresponding subword or character.
Subwords are concatenated to form words, applying rules to ensure coherence (e.g., removing unnecessary spaces around subwords).

Example:
Suppose the model generates the tokens forun, happi, and ness. Decoding reconstructs this into unhappiness by concatenating the subwords. Proper handling of spaces ensures that un is not treated as a separate word.

This system allows subword-based models to efficiently generate text while maintaining the ability to represent rare or complex terms correctly.

Why should I care as a developer?

While the ChatGPT API handles tokenization automatically, developers use tiktoken directly to gain finer control over their applications. It allows for pre-checking and managing token limits, ensuring that input text and responses fit within the model’s constraints. This is especially important for avoiding errors when working with long conversations or documents. Additionally, developers can optimize token usage to reduce API costs by trimming or summarizing inputs.

tiktoken also helps in debugging tokenization issues, providing transparency into how text is tokenized and decoded. For handling long inputs, tiktoken can split text into smaller chunks, enabling the processing of large documents in parts. Lastly, for advanced use cases, such as embedding or token-level manipulations, tiktoken ensures precise control over how tokens are generated and processed.

import openai
import tiktoken

openai.api_key = "your-api-key"

# Initialize tokenizer for GPT-4
encoding = tiktoken.get_encoding("cl100k_base")

# Function to count tokens
def count_tokens(text):
    return len(encoding.encode(text))

# Example input
user_input = "Explain the theory of relativity in detail with examples."
conversation_history = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the theory of relativity?"}
]

# Combine inputs for token counting
conversation_text = "".join([msg["content"] for msg in conversation_history]) + user_input

# Pre-check input token limit (Use Case 1)
token_limit = 4096
if count_tokens(conversation_text) > token_limit:
    print("Trimming conversation to fit within token limit.")
    conversation_history = conversation_history[1:]  # Trim oldest message

# Optimize input by summarizing if too long (Use Case 2)
def summarize_if_needed(text, max_tokens=500):
    if count_tokens(text) > max_tokens:
        print("Input too long, summarizing...")
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "Summarize the following text."},
                {"role": "user", "content": text}
            ],
            max_tokens=200
        )
        return response.choices[0].message["content"]
    return text

long_text = "A very long text input that exceeds the desired token limit ... (more text)"
optimized_text = summarize_if_needed(long_text, max_tokens=500)

# Debug tokenization (Use Case 3)
tokens = encoding.encode("OpenAI's ChatGPT is amazing!")
print("Tokenized:", tokens)
for token in tokens:
    print(f"Token ID: {token}, Token: '{encoding.decode([token])}'")

# Handle long documents by splitting into chunks (Use Case 4)
def split_into_chunks(text, chunk_size):
    tokens = encoding.encode(text)
    for i in range(0, len(tokens), chunk_size):
        yield encoding.decode(tokens[i:i + chunk_size])

document = "A very long document... (more text)"
chunks = list(split_into_chunks(document, chunk_size=1000))

# Process each chunk separately
responses = []
for chunk in chunks:
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": chunk}],
        max_tokens=300
    )
    responses.append(response.choices[0].message["content"])
full_response = " ".join(responses)

# Advanced token manipulation (Use Case 5)
custom_text = "Analyze the sentiment of this text."
tokens = encoding.encode(custom_text)
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": encoding.decode(tokens)}],
    max_tokens=100
)
print("Final Response:", response.choices[0].message["content"])

Highlights in Code:

Pre-checking token limits ensures input fits within the model's constraints, avoiding truncation or errors.
Input summarization reduces token usage when working with long text.
Debugging tokenization provides transparency into how tokens are encoded and decoded.
Splitting long documents into manageable chunks processes large inputs efficiently.
Advanced token manipulation demonstrates precise control over tokenized inputs.

This approach ensures both optimal performance and cost efficiency when building on the ChatGPT API.

3. Exploring `tiktoken`: GPT’s Tokenizer

OpenAI's tiktoken library is designed to tokenize efficiently and understand the constraints of GPT models. Let’s explore how it works:

Usage Example

Here’s a Python example of how to tokenize text using tiktoken. I like to use https://colab.research.google.com/ for running my python notebooks.

import tiktoken

# Choose a model-specific tokenizer
encoding = tiktoken.get_encoding("o200k_base")

# Input text
text = "Tokenization is crucial for GPT models."

# Tokenize text
tokens = encoding.encode(text)
print("Tokens:", tokens)

# Decode back to text
decoded_text = encoding.decode(tokens)
print("Decoded Text:", decoded_text)

Output

Tokens: [4421, 2860, 382, 19008, 395, 174803, 7015, 13]
Decoded Text: Tokenization is crucial for GPT models.

4. Developer Levers for Tokenization in ChatGPT APIs

When using ChatGPT APIs, understanding tokenization helps optimize:

Input Efficiency:
- Keep inputs concise. Every token costs processing power and impacts the token limit (e.g., 8k for GPT-3.5, 32k for GPT-4).
- Reuse context effectively by retaining only essential information.
Model Selection:
- Different GPT models have varying tokenization behaviors. Models with larger context windows enable broader input but incur higher costs.
Prompt Structuring:
- Experiment with formatting and phrasing to minimize unnecessary tokens. Reducing verbosity can maximize token budget without losing meaning.
Fine-Tuning and Token Manipulation:
- Use token-level operations like padding and truncation to ensure uniform inputs across applications.

5. Sample Implementation in Python

Here’s a practical example for calculating token usage when querying the ChatGPT API:

import tiktoken
import openai

def calculate_tokens(api_input, model="gpt-3.5-turbo"):
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(api_input)
    return len(tokens)

# Example API call with token usage calculation
api_input = "Explain the significance of tokenization in LLMs."
model = "gpt-4"

token_count = calculate_tokens(api_input, model)
print(f"Token count for input: {token_count}")

response = openai.ChatCompletion.create(
    model=model,
    messages=[{"role": "user", "content": api_input}]
)

print("API Response:", response['choices'][0]['message']['content'])

This code helps monitor token usage, which is crucial for cost and performance optimization.

6. Takeaways

Understanding tokenization is essential for engineers building AI applications because it directly impacts how textual data is processed by language models. Tokenization involves breaking down raw text into smaller, meaningful units—such as words, subwords, or characters—which are the basic inputs for these models. This process allows developers to precisely manage input sizes, optimize cost by reducing unnecessary token usage, and improve model performance by ensuring that the text is segmented in a way that retains contextual meaning. Moreover, incorporating tokenization directly into client-side code can streamline operations by reducing data transmission overhead and latency, enabling more efficient caching and faster pre-processing. By mastering tokenization, engineers can build AI systems that are both robust and cost-effective, ultimately enhancing the responsiveness and scalability of their applications.

Your Chatbot Isn’t Reading Words—It’s Counting Tokens

Too Long; Didn't Read

How Tokenization Works

Why should I care as a developer?

Highlights in Code:

3. Exploring `tiktoken`: GPT’s Tokenizer

Usage Example

Output

4. Developer Levers for Tokenization in ChatGPT APIs

5. Sample Implementation in Python

6. Takeaways

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

Your Chatbot Isn’t Reading Words—It’s Counting Tokens

Too Long; Didn't Read

How Tokenization Works

Why should I care as a developer?

Highlights in Code:

3. Exploring tiktoken: GPT’s Tokenizer

Usage Example

Output

4. Developer Levers for Tokenization in ChatGPT APIs

5. Sample Implementation in Python

6. Takeaways

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics

3. Exploring `tiktoken`: GPT’s Tokenizer