In SQL, LIMIT
and OFFSET
allow you to retrieve just a portion of the rows that are generated by the rest of the query. This is useful when you need to display results page by page in a web application. It sounds good, but there is a major issue: as the number of skipped rows increases, query times become slower.
Let's say we have a PostgreSQL table called tasks
with 100 million rows in it.
First, let's attempt to retrieve the 10th page of 10 tasks using the classic OFFSET method:
SELECT * FROM tasks ORDER BY id LIMIT 10 OFFSET 10
Now take a look at the query plan, which is the result of EXPLAIN <query>
:
Limit (cost=1.34..2.11 rows=10 width=628)
-> Index Scan using tasks_pkey on tasks
(cost=0.57..5439151.65 rows=100000000 width=628)
Before going forward, let's see what the PostgreSQL documentation has to say:
Estimated start-up cost. This is the time expended before the output phase can begin, e.g., time to do the sorting in a sort node.
Estimated total cost. This is stated on the assumption that the plan node is run to completion, i.e., all available rows are retrieved. In practice a node's parent node might stop short of reading all available rows.
Estimated number of rows output by this plan node. Again, the node is assumed to be run to completion.
- Estimated average width of rows output by this plan node (in bytes).
We can then break down the query plan using this knowledge:
Limit: It returns 10 rows. The cost of executing is estimated to be between 1.34 and 2.11. The width of the result is 628 bytes for each row.
Index Scan: It uses an index, which is good in terms of performance. The cost is estimated to be between 0.57 and 5,439,151.65. However, since we have a Limit step, in reality, it will never use all 10 million rows, so the actual cost is closer to a minimal one. The width is the same as in the Limit step.
In summary, it's not so bad. However, avoid jumping to conclusions.
Let's try to get the 1 millionth page of 10 tasks using the same approach:
SELECT * FROM tasks ORDER BY id LIMIT 10 OFFSET 1000000
Limit (cost=77270.24..77271.02 rows=10 width=628)
-> Index Scan using tasks_pkey on tasks
(cost=0.57..5439162.17 rows=100000000 width=628)
As you can see now, the Limit step has become much worse in this case. It costs at least 77,270.24, which is significant, but the Index Scan has remained almost the same. Therefore, the more rows we have to skip using OFFSET, the heavier the query becomes. Why is it so? The answer is in the documentation:
The rows skipped by an OFFSET clause still have to be computed inside the server; therefore a large OFFSET might be inefficient.
An alternative approach is to use "WHERE id > N" instead of OFFSET:
SELECT * FROM tasks WHERE id > 1000000 ORDER BY id LIMIT 10
Limit (cost=0.57..1.36 rows=10 width=628)
-> Index Scan using tasks_pkey on tasks
(cost=0.57..5504715.15 rows=100000000 width=628)
Index Cond: (id > 1000000)
As you can see, in this case, we have an additional index condition (id > 1,000,000) that helps us filter out rows that do not meet the criteria while scanning the index. This condition improves the performance of the query by narrowing down the search space.
The main disadvantage of this approach is that you must save the last ID on a page between requests. This requirement could be a determining factor in deciding not to use this approach.
When implementing pagination in SQL, avoid using OFFSET for large data sets as it can slow down your query. Instead, use the "WHERE id > N" approach to improve performance. Keep in mind that this approach requires updating your architecture in some cases. Choose the approach that suits you better.