In this article, I’m going to demonstrate performance differences between two ways of iterating over the records in a MySQL database table with millions of records. In a high volume analytics system, tables with millions of records are quite common and iterating over the full table or a subset of these tables becomes often necessary — whether it’s to perform computations, run a migration, or create parallelized background jobs on the records. At AirPR, we have many database tables with 100s of millions of records, and it becomes important to write efficient code for iterations because there is often an order of magnitude difference between a good and not-so-good approach.
The standard approach provided natively by ActiveRecord is the find_each
method.
For the purposes of this exercise, I created anemployees
table to which I added about 5 million rows of data¹.
There is also a salaries
table with the following columns that stores the salaries of those employees across different time ranges. This table contains about 3 million records.
Let us measure the performance of iterating through this table using find_each
DEFAULT_BATCH_SIZE = 1000
time = Benchmark.realtime doEmployee.select(:emp_no, :first_name, :last_name).find_each(batch_size: DEFAULT_BATCH_SIZE) do |employee|endend
=> 100.6963519999990
The underlying queries that ActiveRecord makes look like this:
Employee Load (2.1ms) SELECT `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` ORDER BY `employees`.`emp_no` ASC LIMIT 1000Employee Load (1.9ms) SELECT `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` WHERE (`employees`.`emp_no` > 11000) ORDER BY `employees`.`emp_no` ASC LIMIT 1000Employee Load (1.8ms) SELECT `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` WHERE (`employees`.`emp_no` > 12000) ORDER BY `employees`.`emp_no` ASC LIMIT 1000
...
Employee Load (1.3ms) SELECT `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` WHERE (`employees`.`emp_no` > 5127997) ORDER BY `employees`.`emp_no` ASC LIMIT 1000
Notice how ActiveRecord keeps track of theid
from the previous iteration and uses it in a where condition in the next one. This is called value based pagination and is generally the preferred approach for pagination (over other methods like offset based pagination)².
I propose we try a different iterating technique now:
time = Benchmark.realtime dofirst_id = Employee.first.idlast_id = Employee.last.id(first_id..last_id).step(DEFAULT_BATCH_SIZE).each do |value|
Employee.where('employees.emp\_no >= ?', value).
where('employees.emp\_no < ?', value + DEFAULT\_BATCH\_SIZE).
order('employees.emp\_no ASC').
select(:emp\_no, :first\_name, :last\_name).each do |employee|
end
endend
=> 101.34066200000234
In this method, we divvy up the total number of rows into batches using where
conditions on the primary key to iterate through all the records in the table. Notice how the performance is practically the same between the two methods. This is how the underlying queries look:
Employee Load (1.1ms) SELECT `employees`.* FROM `employees` ORDER BY `employees`.`emp_no` ASC LIMIT 1Employee Load (1.1ms) SELECT `employees`.* FROM `employees` ORDER BY `employees`.`emp_no` DESC LIMIT 1Employee Load (1.5ms) SELECT `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` WHERE (employees.emp_no > 10001) AND (employees.emp_no <= 11001)Employee Load (1.9ms) SELECT `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` WHERE (employees.emp_no > 11001) AND (employees.emp_no <= 12001)
...
Employee Load (1.8ms) SELECT `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` WHERE (employees.emp_no >= 5128001) AND (employees.emp_no < 5129001)
This approach works best if the ids are in order because the iteration wouldn’t have to iterate & skip a lot of missing records in that case³.
Now, let’s compare performance of these two methods when we add some more complexity to the query.
In this new scenario, say, we want to iterate through all employees whose salary was above 80,000 at any point during their employment with the company. The find_each
method would look something like this:
time = Benchmark.realtime doEmployee.select(:emp_no, :first_name, :last_name).joins(:salaries).where('salary > 80000').find_each(batch_size: DEFAULT_BATCH_SIZE) do |employee|endend
=> 1181.770457000006
On the other hand, the id iterator method for performing the same operation results in an order of magnitude improvement in performance.
time = Benchmark.realtime do
first_id = Employee.first.idlast_id = Employee.last.id
(first_id..last_id).step(DEFAULT_BATCH_SIZE).each do |value|Employee.where('employees.emp_no >= ?', value).where('employees.emp_no < ?', value + DEFAULT_BATCH_SIZE).joins(:salaries).where('salary > 80000').order('employees.emp_no ASC').select(:emp_no, :first_name, :last_name).each do |employee|endend
end
=> 72.75677799998084
The above results indicate that using the find_each
approach results in a much worse performance⁴. The ID iterator approach is about 15x faster than naive find_each
. The reason for this becomes clear when you inspect the queries that are made by the two approaches.
The find_each
method makes this type of query:
SELECT `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` INNER JOIN `salaries` ON `salaries`.`emp_no` = `employees`.`emp_no` WHERE (salary > 80000) ORDER BY `employees`.`emp_no` ASC LIMIT 1000
An EXPLAIN on this query reveals the following:
1 SIMPLE salaries ALL salary,emp_no NULL NULL NULL 2837536 Using where; Using temporary; Using filesort1 SIMPLE employees eq_ref PRIMARY PRIMARY 4 employees.salaries.emp_no 1 Using index
which indicates that neither the index on salary nor the index on emp_no is being used to filter the salaries table.
The id iterator method makes this type of query:
SELECT
An EXPLAIN on this query shows that the query optimizer uses the index on emp_no in the salaries table:
1 SIMPLE salaries range salary,emp_no emp_no 4 NULL 1 Using index condition; Using where1 SIMPLE employees eq_ref PRIMARY PRIMARY 4 employees.salaries.emp_no 1 Using index
which reveals why the find_each
method is so much slower than the iterator method.
The lesson here is always use EXPLAINs to understand what the MySQL query optimizer actually does so that you can create the most optimized queries⁵. Based on analyzing the results of the EXPLAIN, a decision can be made on which approach needs to be taken for iterations on large tables.
The easiest way to see the queries that ActiveRecord generates would be to turn on DEBUG logging. It is recommended to turn this on in development so you can catch performance issues early.ActiveRecord::Base.logger = Logger.new(STDOUT)
Alternatively, you can use to_sql
on an ActiveRecord::Relation
to see beforehand what query it’s going to make. Employee.where(“gender = ‘M’”).to_sql
¹ I started out from this sample dataset, and deleted everything but the employees
and salaries
table. And then I duplicated records in the employees
table to reach 5 million rows.
² This link has a good comparison of the Value based vs Offset based pagination.
³ If AUTO_INCREMENT
option is turned on for the primary key, the records are automatically in incremental order.
⁴ The performance degrades even more on larger tables. When you reach 100s of millions of rows, it becomes even more important to understand the underlying queries because it might result in 100x or 1000x difference.
⁵ Take the time to read (and master) the official MySQL documentation on EXPLAIN output format, so its clear what’s good and what’s not good.
⁶ This link has a good description on the performance impact of creating indices. It’s important to understand that writes on a table with a lot of indices will be slower, so use them wisely.