Your App Is Slow Because You Don’t Know What a Primary Key Is

Fundamentals and Storage

I have spent more than ten years working as a backend developer optimizing MySQL databases and I have witnessed numerous applications experience performance problems because developers lacked knowledge about primary keys. In this article, I am going to explain both fundamental and internal aspects of MySQL primary keys which determine application performance outcomes.

What Are Primary Keys, Really?

The simple definition of a primary key is one or multiple columns that serve to uniquely identify each row in a table structure. But a primary key represents much more than that. A primary key in MySQL is:

A unique identifier for each record
Automatically indexed
Cannot contain NULL values
Limited to one per table
The foundation of your table's physical storage structure (this is crucial)

The creation of a primary key results in more than just data constraints because it determines how MySQL stores and retrieves data from disk.

The Critical Role of Primary Keys in MySQL

The InnoDB storage engine has become the default in MySQL since version 5.5. It uses primary keys for both logical and physical data storage and access operations. This concept is called "clustered indexing", and it represents a fundamental MySQL concept that you must understand.

Your table data within InnoDB follows a physical disk order based on primary key values. The performance implications of this arrangement are significant.

Consider this example:

CREATE TABLE customers
(
   customer_id INT          NOT NULL AUTO_INCREMENT,
   name        VARCHAR(100) NOT NULL,
   email       VARCHAR(100) NOT NULL,
   PRIMARY KEY (customer_id)
);

The database engine of MySQL arranges newly inserted rows into this table according to the customer_id order when placed on disk storage. This means:

The physical storage of records with adjacent primary key values occurs near each other on the disk.
The efficiency of primary key range queries remains at a peak level
Non-primary-key indexes store the primary key values instead of row pointers

How InnoDB Stores Primary Keys Under the Hood

Let's dive deeper into how MySQL actually stores data.

B+Tree Storage Structure

All indexes including the primary key use a B+Tree data structure within InnoDB. A B+Tree is a balanced tree structure that enables quick searching together with sequential access and efficient insertions and deletions.

For a primary key index:

The internal nodes within the data structure store only the key values.
The actual row data, together with key values, resides in leaf nodes
Leaf nodes link together through a doubly-linked list structure, which enables efficient range scanning operations

(Note: This is a conceptual diagram; actual implementation details may vary).

A primary key access enables MySQL to locate a row with minimal disk reads, which typically perform as O(log n) operations relative to the number of rows.

Page Structure

InnoDB stores data through pages, which typically measure 16 KB each. Each page can contain multiple rows, depending on the row size.

The definition of a primary key leads MySQL to perform the following actions:

The MySQL database engine places rows with similar primary key values within the same page.
MySQL stores data pages following the sequence of primary key values.
The data organization through clustering results in highly efficient range query execution.

To optimize large tables, it is crucial to understand the storage method that uses pages. If your rows are 4 KB each, you'll have about 4 rows per page. But if your rows are only 100 bytes, you can fit around 160 rows per page, dramatically reducing the number of disk I/O operations needed.

Hidden Primary Keys: What Happens When You Don't Specify One

Many people mistakenly believe that primary keys are optional elements in MySQL. While you can create a table without explicitly defining a primary key, MySQL's InnoDB engine will create a hidden 6-byte primary key called a "Row ID" for tables without one.

Consider this table:

CREATE TABLE notes
(
   content    TEXT,
   created_at TIMESTAMP
);

The InnoDB system automatically generates an internal hidden primary key when no primary key definition exists. The approach presents major disadvantages that should be considered.

The hidden key cannot be accessed directly by your application.
The internal key lacks any meaning that relates to your data.
The physical data arrangement becomes outside your control when you do not define a primary key.

The practice of developing tables without primary keys is a common mistake I have observed quite a few times. These tables function properly, but they will eventually cause problems when the table expands in size. All tables need an explicitly defined primary key.

Auto-Increment Primary Keys: Benefits and Hidden Costs

The use of auto-increment primary keys exists in almost every MySQL database.

CREATE TABLE orders
(
   order_id    INT      NOT NULL AUTO_INCREMENT,
   customer_id INT      NOT NULL,
   order_date  DATETIME NOT NULL,
   PRIMARY KEY (order_id)
);

The advantages are clear:

Automatically generated unique values
Sequential values provide superior performance for adding new rows at the end of the table.
The small size of these keys enables them to occupy less space within secondary indexes
Avoids fragmentation from random inserts

These indices have several hidden implications that need attention.

Insertion Bottlenecks in High-Write Systems

The use of auto-increment primary keys creates bottlenecks in systems that experience high concurrency. The table's "end" section develops into a competitive area because new rows are consistently inserted there. The previous versions before 8.0 MySQL maintained table-level auto-increment locks throughout the entire statement execution, which could result in lock contention. The auto-increment lock mode in MySQL 8.0 operates more efficiently by releasing the lock instantly after value generation,n yet it does not eliminate all contention risks.

The "Hot Spot" Problem

The index experiences maximum I/O operations when all new insertions take place at its terminal position. Page lock contention, along with reduced performance, becomes a problem when systems experience numerous concurrent writes.

UUID Alternatives and Their Trade-offs

Some developers choose UUIDs as a substitute solution:

CREATE TABLE sessions
(
   session_id CHAR(36)  NOT NULL,
   user_id    INT       NOT NULL,
   created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
   PRIMARY KEY (session_id)
);

The distribution of writes across the index by UUIDs reduces contention but leads to random insertion points, which cause index fragmentation and degrade range scan performance. Secondary indexes require more space because UUIDs occupy larger storage.

Secondary Indexes and Optimization Strategies

We explored how MySQL's InnoDB storage engine physically organizes data based on primary keys. Now, let's dive deeper into how your primary key choice affects secondary indexes, query performance, and overall database efficiency.

The Crucial Relationship Between Primary Keys and Secondary Indexes

The design of your primary key remains a major performance factor that influences every secondary index in your MySQL table. Database performance relies fundamentally on this relationship.

How Secondary Indexes Work in InnoDB

InnoDB stores secondary indexes with a different structure from the primary key index.

At the leaf level of the primary key index, you will find your actual row data
Secondary indexes store both the indexed columns with their associated primary key values.
Two separate lookups are necessary to retrieve full row data when accessing it through a secondary index.
1. The first step requires searching for the entry within the secondary index.
2. The complete row lookup occurs by using the primary key value obtained from the secondary index search.

This two-step process is called a "bookmark lookup", and that forms an essential part of performance optimization. Consider this example:

CREATE TABLE products
(
   product_id INT            NOT NULL AUTO_INCREMENT,
   sku        VARCHAR(50)    NOT NULL,
   name       VARCHAR(100)   NOT NULL,
   price      DECIMAL(10, 2) NOT NULL,
   PRIMARY KEY (product_id),
   INDEX      idx_sku (sku)
);

When you query by SKU:

SELECT * FROM products WHERE sku = 'ABC123';

MySQL:

The system performs an index search on idx_sku for 'ABC123'
The system retrieves the product_id value stored in the index.
The system uses product_id to locate the entire row in the primary key index.

The selection of your primary key affects all secondary indexes for this reason:

The size of your primary key directly affects the size of your secondary indexes.
The number of secondary indexes you use will increase the total storage space required by your primary key selection.
Each secondary index needs to contain the primary key data.

The Hidden Cost of Large Primary Keys

Let's quantify this with an example. Imagine these two primary key choices:

-- Option 1: INT primary key (4 bytes)  
CREATE TABLE orders
(
   order_id INT NOT NULL AUTO_INCREMENT,
   PRIMARY KEY (order_id),  
...  
);

-- Option 2: UUID primary key (16 bytes)  
CREATE TABLE orders
(
   order_id BINARY(16) NOT NULL,
   PRIMARY KEY (order_id),  
...  
);

With 5 secondary indexes and 10 million rows:

The storage savings from each secondary index amount to approximately 120MB because 4 bytes times 10 million rows times 5 indexes equals 200MB less storage.
Secondary index lookups need to process 16-byte primary keys instead of the standard 4 bytes when implementing this option.

The implementation of integer primary keys instead of UUIDs in high-performance systems might result in a significant decrease in database size and a performance boost during read operations.

Composite Primary Keys: When and How to Use Them

A composite primary key consists of multiple columns which function to make each row unique. These keys deliver strong functionality yet need careful consideration.

Anatomy of a Composite Primary Key

CREATE TABLE order_items
(
   order_id    INT            NOT NULL,
   line_number INT            NOT NULL,
   product_id  INT            NOT NULL,
   quantity    INT            NOT NULL,
   price       DECIMAL(10, 2) NOT NULL,
   PRIMARY KEY (order_id, line_number)
);

The order_id along with the line_number combination serves as the primary key in this example. This has several implications:

The physical sorting process starts with order_id, followed by the line_number
Physical storage of all items within a single order occurs together.
The database will execute order_id range queries with exceptional performance.
The primary key becomes inefficient when queries focus exclusively on line_number.

Natural Ordering Benefits

Composite primary keys, designed properly, offer built-in ordering benefits. In time-series data, for example:

CREATE TABLE temperature_readings
(
   sensor_id   INT           NOT NULL,
   timestamp   DATETIME      NOT NULL,
   temperature DECIMAL(5, 2) NOT NULL,
   PRIMARY KEY (sensor_id, timestamp)
);

This structure means:

The system stores all measurements from a single sensor together.
The physical organization of sensor data within each record follows a time-based sorting.
The retrieval of time-based data for specific sensors operates at high efficiency.

Column Order Matters

The correct order of columns in a composite primary key plays a vital role. InnoDB uses the leftmost prefix during efficient lookups while following the "leftmost prefix rule" principle.

The index enables efficient execution of queries that contain sensor_id in their filter conditions
The index enables highly efficient operations for queries that contain conditions for sensor_id and timestamp.
The primary key becomes inefficient when queries apply filtering only to timestamp values.

A system optimization through primary key reorganization allowed me to cut query times from seconds down to milliseconds without modifying any other system components.

Surrogate vs. Natural Keys: Making the Right Choice

The fundamental design choice in database development involves a decision between using surrogate keys and natural keys:

Surrogate keys consist of artificial identification values, such as auto-increment ID,s that function exclusively for record identification.
Natural keys are inherent attributes of the data that can uniquely identify records

The Case for Surrogate Keys

CREATE TABLE countries
(
   country_id   INT          NOT NULL AUTO_INCREMENT,
   country_code CHAR(2)      NOT NULL,
   name         VARCHAR(100) NOT NULL,
   PRIMARY KEY (country_id),
   UNIQUE INDEX (country_code)
);

Advantages:

Compact (usually 4-8 bytes)
Never need to change
No business meaning to protect from change
Consistent format across all tables
Typically sequential, minimizing fragmentation

The Case for Natural Keys

CREATE TABLE countries
(
   country_code CHAR(2)      NOT NULL,
   name         VARCHAR(100) NOT NULL,
   PRIMARY KEY (country_code)
);

Advantages:

No additional storage overhead
The database system uses this approach to maintain data integrity throughout its structure.
More meaningful in query results

When Each Makes Sense

I established these guidelines after optimizing databases for many years.

Use surrogate keys when:

The natural key candidates might change (email addresses and phone numbers updates).
Natural candidates are not appropriate (large text fields for example).
You need to maintain relationships even if identifying attributes change.
Tables are frequently joined to many others

Consider natural keys when:

Natural identifiers maintain permanent values because they consist of ISO codes or certain scientific IDs.
Storage efficiency is paramount
The natural key is frequently used in queries.
The data contains established industry standards for identification (ISBN for books and other items).

Real-World Primary Key Optimization Strategies

Through my experience with many MySQL database optimizations, I have established these practical methods for different situations:

High-Write Workloads: Reducing Contention

Standard auto-increment keys create performance bottlenecks within systems that process many insertions. Options to consider:

1. Sharded Auto-Increment Keys

-- Each application server uses a different offset  
-- Server 1 uses: 1, 101, 201, ...  
-- Server 2 uses: 2, 102, 202, ...  
INSERT INTO events (id, event_data) VALUES (LAST_INSERT_ID() + 100, 'event data');

This reduces contention while maintaining most benefits of sequential IDs.

2. Time-Based Sorting with Composite Keys

CREATE TABLE events
(
   day_partition DATE   NOT NULL,
   event_id      BIGINT NOT NULL AUTO_INCREMENT,
   event_data    JSON,
   PRIMARY KEY (day_partition, event_id),
   UNIQUE KEY (event_id)
);

This approach:

The Partitions system distributes data storage across multiple physical areas according to date values
Maintains chronological ordering
The approach decreases conflicts that occur at the index's active end.

Read-Heavy Analytics: Optimizing for Query Patterns

A well-designed composite key system in data warehousing and analytics leads to significant performance improvements.

CREATE TABLE user_actions
(
   user_id     INT      NOT NULL,
   action_time DATETIME NOT NULL,
   action_type TINYINT  NOT NULL,
   action_data JSON,
   PRIMARY KEY (user_id, action_time)
);

All user actions are stored together so this structure makes user history queries extremely efficient and presorted by time.

Time-Series Data: Special Considerations

For IoT sensors, monitoring systems, and other time-series workloads:

CREATE TABLE metrics
(
   metric_type TINYINT NOT NULL,
   device_id   INT     NOT NULL,
   timestamp   DATETIME(3) NOT NULL,
   value       FLOAT   NOT NULL,
   PRIMARY KEY (metric_type, device_id, timestamp)
);

This design:

Groups related metrics together physically
MySQL arranges data chronologically inside each group.
The design optimizes performance by providing quick access to device X data based on its recent updates.

Common Primary Key Anti-Patterns to Avoid

During my professional experience, I have witnessed multiple primary key anti-patterns that consistently result in performance issues.

1. Random UUID Primary Keys Without Proper Indexing

Random UUIDs cause:

Fragmentation because of random write locations
Poor range scan performance

When using UUIDs, you should select ordered versions such as ULID or UUID v6/v7 because they maintain time-based ordering properties.

2. Multi-Column Natural Keys That Change

The system I encountered used (client_code, year, document_number) as its primary key structure for invoices. The client code modification triggered cascading updates, which modified millions of rows distributed across multiple tables.

3. Using VARCHAR Primary Keys When a Numeric Type Would Suffice

The process of comparing strings requires more time than comparing numeric values. The conversion of VARCHAR customer codes to INT surrogate keys might result in a huge enhancement of join performance.

4. No Primary Key at All

Every table needs an explicit primary key defined. The InnoDB system generates an invisible primary key when left to its own devices.

Prevents you from referencing it in foreign keys
The physical storage order becomes unpredictable when this method is used
Makes secondary indexes less efficient

Advanced Strategies and Refactoring

Primary Keys and Table Partitioning: Critical Considerations

Table partitioning enables the division of big tables into smaller sections that remain logically connected. The primary key strategy needs to match exactly with the partitioning strategy.

The Partitioning Key Constraint

MySQL requires all unique keys, including primary keys, to contain the partitioning key as one of their components. Unique constraints need to be verifiable within a single partition because of this requirement.

-- This works because order_date is part of the primary key  
CREATE TABLE orders
(
   order_id    INT            NOT NULL,
   order_date  DATE           NOT NULL,
   customer_id INT            NOT NULL,
   amount      DECIMAL(10, 2) NOT NULL,
   PRIMARY KEY (order_date, order_id)
) PARTITION BY RANGE (TO_DAYS(order_date)) (  
PARTITION p_2022_q1 VALUES LESS THAN (TO_DAYS('2022-04-01')),  
PARTITION p_2022_q2 VALUES LESS THAN (TO_DAYS('2022-07-01')),  
PARTITION p_2022_q3 VALUES LESS THAN (TO_DAYS('2022-10-01')),  
PARTITION p_2022_q4 VALUES LESS THAN (TO_DAYS('2023-01-01'))  
);

In this design:

order_date serves as the primary key that appears in the partitioning expression.
The partitioning expression enables MySQL to direct queries to particular partitions during date-based filtering operations.
Each partition maintains its original ordering through the use of the composite key.

Optimizing Partition Pruning

During query execution MySQL uses Partition pruning to bypass partitions that are not needed in the process. The design of a good primary key supports efficient pruning operations.

-- This query can use partition pruning  
SELECT *
FROM orders
WHERE order_date BETWEEN '2022-04-01' AND '2022-06-30'
  AND customer_id = 1001;

-- This query cannot use partition pruning  
SELECT *
FROM orders
WHERE order_id = 5000;

The initial query removes three partitions from consideration thus it focuses its search on p_2022_q2. The second query needs to scan all partitions because order_id by itself fails to identify the partition location of the data.Throughout my career, I have observed how appropriate partition design improves query speeds from minutes to sub-second for large analytical tables when partitioning schemes match the primary key and query patterns.

Refactoring Problematic Primary Key Designs

Database management professionals consider changing primary keys in production to be one of their most difficult tasks. Here are battle-tested approaches:

The Shadow Table Method

The method reduces system downtime during primary key structure modifications.

Create a new table with the desired primary key structure:

CREATE TABLE products_new
(
   product_id INT            NOT NULL AUTO_INCREMENT, -- New surrogate key  
   sku        VARCHAR(50)    NOT NULL,                -- Old primary key  
   name       VARCHAR(255)   NOT NULL,
   price      DECIMAL(10, 2) NOT NULL,
   PRIMARY KEY (product_id),
   UNIQUE KEY (sku)
);

Populate the new table with existing data:

INSERT INTO products_new (sku, name, price)  
SELECT sku, name, price FROM products;

Create triggers to keep tables synchronized during the transition:

DELIMITER
//

CREATE TRIGGER products_after_insert
   AFTER INSERT
   ON products
   FOR EACH ROW
BEGIN
   INSERT INTO products_new (sku, name, price)
   VALUES (NEW.sku, NEW.name, NEW.price);
END //

CREATE TRIGGER products_after_update
   AFTER UPDATE
   ON products
   FOR EACH ROW
BEGIN
   UPDATE products_new
   SET name  = NEW.name,
       price = NEW.price
   WHERE sku = NEW.sku;
END //

CREATE TRIGGER products_after_delete
   AFTER DELETE
   ON products
   FOR EACH ROW
BEGIN
   DELETE
   FROM products_new
   WHERE sku = OLD.sku;
END //

DELIMITER;

4. Update application code to use the new table structure

Once validation is complete, rename tables and drop triggers:

RENAME TABLE products TO products_old, products_new TO products;
DROP TRIGGER products_after_insert;
DROP TRIGGER products_after_update;
DROP TRIGGER products_after_delete;

6. Update dependent foreign keys and adjust application code as needed

This approach enables primary key modifications on active production systems without causing major disruptions.

Conclusion: Principles for Primary Key Design

These fundamental design principles will help you make decisions about MySQL primary keys after studying them from basic to complex techniques.

Your primary key design should start with the query pattern since it determines the most common data access method. And actually this is true for all indexes.
Design your primary key to handle current requirements while preparing for future business expansion.
Determine the priority between writing and reading performance in your application to make informed key design choices.
The physical organization of InnoDB data depends on your primary key selection because it determines how data will be stored.
Realistic data testing reveals that primary key performance changes fundamentally when data reaches large volumes.

The time spent designing primary keys for MySQL databases during my career has proven to deliver substantial benefits when systems expand in size. The primary key functions as a database constraint while serving as the essential base that determines performance and scalability and maintenance requirements.I hope this article provides principles and techniques that enable you to create primary keys that will function effectively now and accommodate upcoming requirements.