Fundamentals and Storage
I have spent more than ten years working as a backend developer optimizing MySQL databases and I have witnessed numerous applications experience performance problems because developers lacked knowledge about primary keys. In this article, I am going to explain both fundamental and internal aspects of MySQL primary keys which determine application performance outcomes.
What Are Primary Keys, Really?
The simple definition of a primary key is one or multiple columns that serve to uniquely identify each row in a table structure. But a primary key represents much more than that. A primary key in MySQL is:
- A unique identifier for each record
- Automatically indexed
- Cannot contain NULL values
- Limited to one per table
- The foundation of your table's physical storage structure (this is crucial)
The creation of a primary key results in more than just data constraints because it determines how MySQL stores and retrieves data from disk.
The Critical Role of Primary Keys in MySQL
The InnoDB storage engine has become the default in MySQL since version 5.5. It uses primary keys for both logical and physical data storage and access operations. This concept is called "clustered indexing", and it represents a fundamental MySQL concept that you must understand.
Your table data within InnoDB follows a physical disk order based on primary key values. The performance implications of this arrangement are significant.
Consider this example:
CREATE TABLE customers
(
customer_id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(100) NOT NULL,
email VARCHAR(100) NOT NULL,
PRIMARY KEY (customer_id)
);
The database engine of MySQL arranges newly inserted rows into this table according to the customer_id
order when placed on disk storage. This means:
- The physical storage of records with adjacent primary key values occurs near each other on the disk.
- The efficiency of primary key range queries remains at a peak level
- Non-primary-key indexes store the primary key values instead of row pointers
How InnoDB Stores Primary Keys Under the Hood
Let's dive deeper into how MySQL actually stores data.
B+Tree Storage Structure
All indexes including the primary key use a B+Tree data structure within InnoDB. A B+Tree is a balanced tree structure that enables quick searching together with sequential access and efficient insertions and deletions.
For a primary key index:
- The internal nodes within the data structure store only the key values.
- The actual row data, together with key values, resides in leaf nodes
- Leaf nodes link together through a doubly-linked list structure, which enables efficient range scanning operations
(Note: This is a conceptual diagram; actual implementation details may vary).
A primary key access enables MySQL to locate a row with minimal disk reads, which typically perform as O(log n) operations relative to the number of rows.
Page Structure
InnoDB stores data through pages, which typically measure 16 KB each. Each page can contain multiple rows, depending on the row size.
The definition of a primary key leads MySQL to perform the following actions:
- The MySQL database engine places rows with similar primary key values within the same page.
- MySQL stores data pages following the sequence of primary key values.
- The data organization through clustering results in highly efficient range query execution.
To optimize large tables, it is crucial to understand the storage method that uses pages. If your rows are 4 KB each, you'll have about 4 rows per page. But if your rows are only 100 bytes, you can fit around 160 rows per page, dramatically reducing the number of disk I/O operations needed.
Hidden Primary Keys: What Happens When You Don't Specify One
Many people mistakenly believe that primary keys are optional elements in MySQL. While you can create a table without explicitly defining a primary key, MySQL's InnoDB engine will create a hidden 6-byte primary key called a "Row ID" for tables without one.
Consider this table:
CREATE TABLE notes
(
content TEXT,
created_at TIMESTAMP
);
The InnoDB system automatically generates an internal hidden primary key when no primary key definition exists. The approach presents major disadvantages that should be considered.
- The hidden key cannot be accessed directly by your application.
- The internal key lacks any meaning that relates to your data.
- The physical data arrangement becomes outside your control when you do not define a primary key.
The practice of developing tables without primary keys is a common mistake I have observed quite a few times. These tables function properly, but they will eventually cause problems when the table expands in size. All tables need an explicitly defined primary key.
Auto-Increment Primary Keys: Benefits and Hidden Costs
The use of auto-increment primary keys exists in almost every MySQL database.
CREATE TABLE orders
(
order_id INT NOT NULL AUTO_INCREMENT,
customer_id INT NOT NULL,
order_date DATETIME NOT NULL,
PRIMARY KEY (order_id)
);
The advantages are clear:
- Automatically generated unique values
- Sequential values provide superior performance for adding new rows at the end of the table.
- The small size of these keys enables them to occupy less space within secondary indexes
- Avoids fragmentation from random inserts
These indices have several hidden implications that need attention.
Insertion Bottlenecks in High-Write Systems
The use of auto-increment primary keys creates bottlenecks in systems that experience high concurrency. The table's "end" section develops into a competitive area because new rows are consistently inserted there. The previous versions before 8.0 MySQL maintained table-level auto-increment locks throughout the entire statement execution, which could result in lock contention. The auto-increment lock mode in MySQL 8.0 operates more efficiently by releasing the lock instantly after value generation,n yet it does not eliminate all contention risks.
The "Hot Spot" Problem
The index experiences maximum I/O operations when all new insertions take place at its terminal position. Page lock contention, along with reduced performance, becomes a problem when systems experience numerous concurrent writes.
UUID Alternatives and Their Trade-offs
Some developers choose UUIDs as a substitute solution:
CREATE TABLE sessions
(
session_id CHAR(36) NOT NULL,
user_id INT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (session_id)
);
The distribution of writes across the index by UUIDs reduces contention but leads to random insertion points, which cause index fragmentation and degrade range scan performance. Secondary indexes require more space because UUIDs occupy larger storage.
Secondary Indexes and Optimization Strategies
We explored how MySQL's InnoDB storage engine physically organizes data based on primary keys. Now, let's dive deeper into how your primary key choice affects secondary indexes, query performance, and overall database efficiency.
The Crucial Relationship Between Primary Keys and Secondary Indexes
The design of your primary key remains a major performance factor that influences every secondary index in your MySQL table. Database performance relies fundamentally on this relationship.
How Secondary Indexes Work in InnoDB
InnoDB stores secondary indexes with a different structure from the primary key index.
-
At the leaf level of the primary key index, you will find your actual row data
-
Secondary indexes store both the indexed columns with their associated primary key values.
-
Two separate lookups are necessary to retrieve full row data when accessing it through a secondary index.
- The first step requires searching for the entry within the secondary index.
- The complete row lookup occurs by using the primary key value obtained from the secondary index search.
This two-step process is called a "bookmark lookup", and that forms an essential part of performance optimization. Consider this example:
CREATE TABLE products
(
product_id INT NOT NULL AUTO_INCREMENT,
sku VARCHAR(50) NOT NULL,
name VARCHAR(100) NOT NULL,
price DECIMAL(10, 2) NOT NULL,
PRIMARY KEY (product_id),
INDEX idx_sku (sku)
);
When you query by SKU:
SELECT * FROM products WHERE sku = 'ABC123';
MySQL:
- The system performs an index search on
idx_sku
for 'ABC123' - The system retrieves the
product_id
value stored in the index. - The system uses
product_id
to locate the entire row in the primary key index.
The selection of your primary key affects all secondary indexes for this reason:
- The size of your primary key directly affects the size of your secondary indexes.
- The number of secondary indexes you use will increase the total storage space required by your primary key selection.
- Each secondary index needs to contain the primary key data.
The Hidden Cost of Large Primary Keys
Let's quantify this with an example. Imagine these two primary key choices:
-- Option 1: INT primary key (4 bytes)
CREATE TABLE orders
(
order_id INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY (order_id),
...
);
-- Option 2: UUID primary key (16 bytes)
CREATE TABLE orders
(
order_id BINARY(16) NOT NULL,
PRIMARY KEY (order_id),
...
);
With 5 secondary indexes and 10 million rows:
- The storage savings from each secondary index amount to approximately 120MB because 4 bytes times 10 million rows times 5 indexes equals 200MB less storage.
- Secondary index lookups need to process 16-byte primary keys instead of the standard 4 bytes when implementing this option.
The implementation of integer primary keys instead of UUIDs in high-performance systems might result in a significant decrease in database size and a performance boost during read operations.
Composite Primary Keys: When and How to Use Them
A composite primary key consists of multiple columns which function to make each row unique. These keys deliver strong functionality yet need careful consideration.
Anatomy of a Composite Primary Key
CREATE TABLE order_items
(
order_id INT NOT NULL,
line_number INT NOT NULL,
product_id INT NOT NULL,
quantity INT NOT NULL,
price DECIMAL(10, 2) NOT NULL,
PRIMARY KEY (order_id, line_number)
);
The order_id
along with the line_number
combination serves as the primary key in this example. This has several implications:
- The physical sorting process starts with
order_id,
followed by theline_number
- Physical storage of all items within a single order occurs together.
- The database will execute
order_id
range queries with exceptional performance. - The primary key becomes inefficient when queries focus exclusively on
line_number
.
Natural Ordering Benefits
Composite primary keys, designed properly, offer built-in ordering benefits. In time-series data, for example:
CREATE TABLE temperature_readings
(
sensor_id INT NOT NULL,
timestamp DATETIME NOT NULL,
temperature DECIMAL(5, 2) NOT NULL,
PRIMARY KEY (sensor_id, timestamp)
);
This structure means:
- The system stores all measurements from a single sensor together.
- The physical organization of sensor data within each record follows a time-based sorting.
- The retrieval of time-based data for specific sensors operates at high efficiency.
Column Order Matters
The correct order of columns in a composite primary key plays a vital role. InnoDB uses the leftmost prefix during efficient lookups while following the "leftmost prefix rule" principle.
- The index enables efficient execution of queries that contain
sensor_id
in their filter conditions - The index enables highly efficient operations for queries that contain conditions for
sensor_id
andtimestamp
. - The primary key becomes inefficient when queries apply filtering only to
timestamp
values.
A system optimization through primary key reorganization allowed me to cut query times from seconds down to milliseconds without modifying any other system components.
Surrogate vs. Natural Keys: Making the Right Choice
The fundamental design choice in database development involves a decision between using surrogate keys and natural keys:
- Surrogate keys consist of artificial identification values, such as auto-increment ID,s that function exclusively for record identification.
- Natural keys are inherent attributes of the data that can uniquely identify records
The Case for Surrogate Keys
CREATE TABLE countries
(
country_id INT NOT NULL AUTO_INCREMENT,
country_code CHAR(2) NOT NULL,
name VARCHAR(100) NOT NULL,
PRIMARY KEY (country_id),
UNIQUE INDEX (country_code)
);
Advantages:
- Compact (usually 4-8 bytes)
- Never need to change
- No business meaning to protect from change
- Consistent format across all tables
- Typically sequential, minimizing fragmentation
The Case for Natural Keys
CREATE TABLE countries
(
country_code CHAR(2) NOT NULL,
name VARCHAR(100) NOT NULL,
PRIMARY KEY (country_code)
);
Advantages:
- No additional storage overhead
- The database system uses this approach to maintain data integrity throughout its structure.
- More meaningful in query results
When Each Makes Sense
I established these guidelines after optimizing databases for many years.
Use surrogate keys when:
- The natural key candidates might change (email addresses and phone numbers updates).
- Natural candidates are not appropriate (large text fields for example).
- You need to maintain relationships even if identifying attributes change.
- Tables are frequently joined to many others
Consider natural keys when:
- Natural identifiers maintain permanent values because they consist of ISO codes or certain scientific IDs.
- Storage efficiency is paramount
- The natural key is frequently used in queries.
- The data contains established industry standards for identification (ISBN for books and other items).
Real-World Primary Key Optimization Strategies
Through my experience with many MySQL database optimizations, I have established these practical methods for different situations:
High-Write Workloads: Reducing Contention
Standard auto-increment keys create performance bottlenecks within systems that process many insertions. Options to consider:
1. Sharded Auto-Increment Keys
-- Each application server uses a different offset
-- Server 1 uses: 1, 101, 201, ...
-- Server 2 uses: 2, 102, 202, ...
INSERT INTO events (id, event_data) VALUES (LAST_INSERT_ID() + 100, 'event data');
This reduces contention while maintaining most benefits of sequential IDs.
2. Time-Based Sorting with Composite Keys
CREATE TABLE events
(
day_partition DATE NOT NULL,
event_id BIGINT NOT NULL AUTO_INCREMENT,
event_data JSON,
PRIMARY KEY (day_partition, event_id),
UNIQUE KEY (event_id)
);
This approach:
- The Partitions system distributes data storage across multiple physical areas according to date values
- Maintains chronological ordering
- The approach decreases conflicts that occur at the index's active end.
Read-Heavy Analytics: Optimizing for Query Patterns
A well-designed composite key system in data warehousing and analytics leads to significant performance improvements.
CREATE TABLE user_actions
(
user_id INT NOT NULL,
action_time DATETIME NOT NULL,
action_type TINYINT NOT NULL,
action_data JSON,
PRIMARY KEY (user_id, action_time)
);
All user actions are stored together so this structure makes user history queries extremely efficient and presorted by time.
Time-Series Data: Special Considerations
For IoT sensors, monitoring systems, and other time-series workloads:
CREATE TABLE metrics
(
metric_type TINYINT NOT NULL,
device_id INT NOT NULL,
timestamp DATETIME(3) NOT NULL,
value FLOAT NOT NULL,
PRIMARY KEY (metric_type, device_id, timestamp)
);
This design:
- Groups related metrics together physically
- MySQL arranges data chronologically inside each group.
- The design optimizes performance by providing quick access to device X data based on its recent updates.
Common Primary Key Anti-Patterns to Avoid
During my professional experience, I have witnessed multiple primary key anti-patterns that consistently result in performance issues.
1. Random UUID Primary Keys Without Proper Indexing
Random UUIDs cause:
- Fragmentation because of random write locations
- Poor range scan performance
When using UUIDs, you should select ordered versions such as ULID or UUID v6/v7 because they maintain time-based ordering properties.
2. Multi-Column Natural Keys That Change
The system I encountered used (client_code
, year
, document_number
) as its primary key structure for invoices. The client code modification triggered cascading updates, which modified millions of rows distributed across multiple tables.
3. Using VARCHAR Primary Keys When a Numeric Type Would Suffice
The process of comparing strings requires more time than comparing numeric values. The conversion of VARCHAR customer codes to INT surrogate keys might result in a huge enhancement of join performance.
4. No Primary Key at All
Every table needs an explicit primary key defined. The InnoDB system generates an invisible primary key when left to its own devices.
- Prevents you from referencing it in foreign keys
- The physical storage order becomes unpredictable when this method is used
- Makes secondary indexes less efficient
Advanced Strategies and Refactoring
Primary Keys and Table Partitioning: Critical Considerations
Table partitioning enables the division of big tables into smaller sections that remain logically connected. The primary key strategy needs to match exactly with the partitioning strategy.
The Partitioning Key Constraint
MySQL requires all unique keys, including primary keys, to contain the partitioning key as one of their components. Unique constraints need to be verifiable within a single partition because of this requirement.
-- This works because order_date is part of the primary key
CREATE TABLE orders
(
order_id INT NOT NULL,
order_date DATE NOT NULL,
customer_id INT NOT NULL,
amount DECIMAL(10, 2) NOT NULL,
PRIMARY KEY (order_date, order_id)
) PARTITION BY RANGE (TO_DAYS(order_date)) (
PARTITION p_2022_q1 VALUES LESS THAN (TO_DAYS('2022-04-01')),
PARTITION p_2022_q2 VALUES LESS THAN (TO_DAYS('2022-07-01')),
PARTITION p_2022_q3 VALUES LESS THAN (TO_DAYS('2022-10-01')),
PARTITION p_2022_q4 VALUES LESS THAN (TO_DAYS('2023-01-01'))
);
In this design:
order_date
serves as the primary key that appears in the partitioning expression.- The partitioning expression enables MySQL to direct queries to particular partitions during date-based filtering operations.
- Each partition maintains its original ordering through the use of the composite key.
Optimizing Partition Pruning
During query execution MySQL uses Partition pruning to bypass partitions that are not needed in the process. The design of a good primary key supports efficient pruning operations.
-- This query can use partition pruning
SELECT *
FROM orders
WHERE order_date BETWEEN '2022-04-01' AND '2022-06-30'
AND customer_id = 1001;
-- This query cannot use partition pruning
SELECT *
FROM orders
WHERE order_id = 5000;
The initial query removes three partitions from consideration thus it focuses its search on p_2022_q2. The second query needs to scan all partitions because order_id by itself fails to identify the partition location of the data.Throughout my career, I have observed how appropriate partition design improves query speeds from minutes to sub-second for large analytical tables when partitioning schemes match the primary key and query patterns.
Refactoring Problematic Primary Key Designs
Database management professionals consider changing primary keys in production to be one of their most difficult tasks. Here are battle-tested approaches:
The Shadow Table Method
The method reduces system downtime during primary key structure modifications.
- Create a new table with the desired primary key structure:
CREATE TABLE products_new
(
product_id INT NOT NULL AUTO_INCREMENT, -- New surrogate key
sku VARCHAR(50) NOT NULL, -- Old primary key
name VARCHAR(255) NOT NULL,
price DECIMAL(10, 2) NOT NULL,
PRIMARY KEY (product_id),
UNIQUE KEY (sku)
);
- Populate the new table with existing data:
INSERT INTO products_new (sku, name, price)
SELECT sku, name, price FROM products;
- Create triggers to keep tables synchronized during the transition:
DELIMITER
//
CREATE TRIGGER products_after_insert
AFTER INSERT
ON products
FOR EACH ROW
BEGIN
INSERT INTO products_new (sku, name, price)
VALUES (NEW.sku, NEW.name, NEW.price);
END //
CREATE TRIGGER products_after_update
AFTER UPDATE
ON products
FOR EACH ROW
BEGIN
UPDATE products_new
SET name = NEW.name,
price = NEW.price
WHERE sku = NEW.sku;
END //
CREATE TRIGGER products_after_delete
AFTER DELETE
ON products
FOR EACH ROW
BEGIN
DELETE
FROM products_new
WHERE sku = OLD.sku;
END //
DELIMITER;
4. Update application code to use the new table structure
- Once validation is complete, rename tables and drop triggers:
RENAME TABLE products TO products_old, products_new TO products;
DROP TRIGGER products_after_insert;
DROP TRIGGER products_after_update;
DROP TRIGGER products_after_delete;
6. Update dependent foreign keys and adjust application code as needed
This approach enables primary key modifications on active production systems without causing major disruptions.
Conclusion: Principles for Primary Key Design
These fundamental design principles will help you make decisions about MySQL primary keys after studying them from basic to complex techniques.
- Your primary key design should start with the query pattern since it determines the most common data access method. And actually this is true for all indexes.
- Design your primary key to handle current requirements while preparing for future business expansion.
- Determine the priority between writing and reading performance in your application to make informed key design choices.
- The physical organization of InnoDB data depends on your primary key selection because it determines how data will be stored.
- Realistic data testing reveals that primary key performance changes fundamentally when data reaches large volumes.
The time spent designing primary keys for MySQL databases during my career has proven to deliver substantial benefits when systems expand in size. The primary key functions as a database constraint while serving as the essential base that determines performance and scalability and maintenance requirements.I hope this article provides principles and techniques that enable you to create primary keys that will function effectively now and accommodate upcoming requirements.