Bulk SQL Statements with ActiveRecord - delete_all
New Rails developers tend to fall in love with ActiveRecord. More specifically they fall in love with the "magic" of ActiveRecord - the ability to use simple methods to represent some unknown SQL. The problem is, they don't worry about what ActiveRecord is doing under the covers and what the resulting queries actually look like. This can lead to slow methods that require a large amount of memory.
One instance where you may see this is in deleting large sets of data.
Let's say we have an online bookstore with a Book
model that looks something like this:
# == Schema Information
#
# Table name: books
#
# id :integer not null, primary key
# title :string(255)
# isbn :string(255)
# author :string(255)
# quantity :integer
# created_at :datetime
# updated_at :datetime
#
class Book < ActiveRecord::Base
end
We also have an OrderGenerator
class that uses the quantity
field to find books we need to re-order from the distributor:
class OrderGenerator
def self.book_list
Book.where("quantity < 20").pluck(:isbn)
end
end
The problem is, there are books that go out of print and are no longer available for reorder. Our current system does not account for this case and our distributor is getting annoyed that we keep requesting the same out of print books on every order.
We decide to create a cron job that will run regularly to remove any Book
records that have a quantity
of zero and have not been restocked in the last month.
We code up an initial solution that looks something like
class Book
def self.remove_with_destroy
books = where(quantity: 0).where("updated_at < '#{Time.now - 1.month}'")
books.each do |book|
book.destroy
end
end
end
This solution will do what we need but, while simple enough, there are a few issues we should be concerned with.
Setting Up The Tests
Before I begin I want to discuss how I set up the environment for the benchmarks to come.
Inspired by a great tips in Justin Weiss' book, Practicing Rails, I decided that in order to test this properly I should make an actual Rails application based on my example.
Since I work with large, established Rails applications on a daily basis, I tend to forget how easy it is to create a simple, new project for testing a few things out. Between rails new
, rails g scaffold
, and rake db:seed
my testing environment was up in a matter of minutes.
Creating a large number of Records
To emulate having hundreds of thousands of books I used the seed.rb
file to create 300,000 Book
records
# db/seeds.rb
(1..300_000).each do |i|
Book.create(
author: "Author #{i}",
title: "Book #{i}",
isbn: "#{i}",
quantity: i % 2
)
end
The logic to set the quantity
field checks if the current index is divisible by 2. This results in half of the books having a quantity of 0 (and the other half 1).
Since I was creating everything at the time of testing I would need to change the updated_at
field to be older than one month ago to represent the idea we have not been able to restock these book for a long time.
Book.where(quantity: 0).limit(50_000).update_all(updated_at: Time.now - 2.months)
I chose to update 50,000 records. I felt updating all books with a quantity of 0 breaks the idea that we only want to remove a subset of our out of stock products. I also thought 50,000 records were enough where we should see substantial run times.
I added this to my seeds file as well so I could easily reproduce the entire setup between benchmarks.
Now that I have an environment where I can easily and reliably create a large number of records I can write the method I want to test, seed the database, and benchmark the runtime for each solution.
destroy vs. delete
As shown before, our current solution is to loop through each book and use the destroy
method:
class Book
def self.remove_with_destroy
books = where(quantity: 0).where("updated_at < #{Time.now - 1.month}")
books.each do |book|
book.destroy
end
end
end
destroy
The destroy
method will instantiate an instance of the ActiveRecord::Model
(the Book
in this case) in order to allow callbacks to be handled.
The ability to make sure callbacks are fired off when deleting an object can be very useful, especially when dealing with deleting associations or handling any custom before_destroy
callbacks. However, in our case, we have no assoications or callbacks to be concerned with so this ends up being unncessary overhead.
Another issue with this method is that we end up running separate queries for each deletion. This problem is where the inspiraton for this post came from. While it is hard to know that destroy
will run callbacks on you objects without looking into the documentation, we can easily watch our logs when running methods to see what qeuries are being made.
This is a tiny subset of the output:
SQL (0.2ms) DELETE FROM "books" WHERE "books"."id" = ? [["id", 1000593]]
(0.5ms) commit transaction
(0.0ms) begin transaction
SQL (0.2ms) DELETE FROM "books" WHERE "books"."id" = ? [["id", 1000595]]
(0.5ms) commit transaction
(0.0ms) begin transaction
SQL (0.2ms) DELETE FROM "books" WHERE "books"."id" = ? [["id", 1000598]]
(0.5ms) commit transaction
(0.0ms) begin transaction
SQL (0.2ms) DELETE FROM "books" WHERE "books"."id" = ? [["id", 1000599]]
(0.5ms) commit transaction
(0.0ms) begin transaction
SQL (0.2ms) DELETE FROM "books" WHERE "books"."id" = ? [["id", 1000605]]
(0.6ms) commit transaction
(0.0ms) begin transaction
For each of the 50,000 records we are deleting, we will create a transaction and execute a delete query. While each step is quick, most less than half a millisecond, this will add up when dealing with 50,000 records. We can see the results of a benchmark of a full run below, the total time was almost 100 seconds.
user system total real
remove_with_destroy 20.090000 21.970000 42.060000 ( 96.173008)
delete
As mentioned before, destroy
will instantiate an instace of the object about to be deleted in order to run any callbacks. In our case, we are not worried about deleting associated models and we do not have any callbacks, so instantiating an object and running the callbacks ends up requiring more time and memory than necessary.
The documentation for destroy
actually addresses this concern and suggests an alternative, using the delete
method:
The object is instantiated first, therefore all callbacks and filters are fired off before the object is deleted. This method is less efficient than ActiveRecord#delete but allows cleanup methods and other actions to be run.
The delete method no longer instantiates an instance of the Book
class but rather simply sends a DELETE
query for the record in question.
To make this change we simply replace our use of destroy
with delete
class Book
def self.remove_with_delete
books = where(quantity: 0).where("updated_at < #{Time.now - 1.month}")
books.each do |book|
book.delete
end
end
end
When we run the benchmarks now we see a significant improvement
user system total real
remove_with_destroy 12.960000 21.140000 34.100000 ( 49.324933)
If we keep an eye on the logs as our method runs we will again see what should be a red flag - we still end up with a query for each record we are deleting.
SQL (0.5ms) DELETE FROM "books" WHERE "books"."id" = 1900739
SQL (0.6ms) DELETE FROM "books" WHERE "books"."id" = 1900740
SQL (0.5ms) DELETE FROM "books" WHERE "books"."id" = 1900742
SQL (0.6ms) DELETE FROM "books" WHERE "books"."id" = 1900746
SQL (0.8ms) DELETE FROM "books" WHERE "books"."id" = 1900748
delete_all
Luckily there is a solution that will stop us from having to instantiate any objects and will handle all of the deletion in a single query. That solution is delete_all
.
class Book
def self.remove_with_delete_all
where(quantity: 0).where("updated_at < '#{Time.now - 1.month}'").delete_all
end
end
As promised, this deletes everything in a single query
SQL (152.4ms) DELETE FROM "books" WHERE "books"."quantity" = 0 AND (updated_at < '2015-03-04 12:30:09 -0500')
The execution time of the query is 152.4ms or 0.1524 seconds. If we look at the results of our benchmark
user system total real
remove_with_delete_all 0.060000 0.040000 0.100000 ( 0.154572)
we noticethat the total exectuion time is only miliseconds more than the query took. This is because there was no additional overhead of creating objects, running callbacks or dealing with associations.
Conclusion
In conclusion, if you are not worried about callbacks or managing associations delete_all
will result in speedy deletion of large datasets.
user system total real
remove_with_destroy 20.090000 21.970000 42.060000 ( 96.173008)
remove_with_destroy 12.960000 21.140000 34.100000 ( 49.324933)
remove_with_delete_all 0.060000 0.040000 0.100000 ( 0.154572)
More importantly - keep an eye on what your code is doing. As developers we need to own our code and keep it running smoothly. It is difficult to make sure things are running well when we aren't looking for signs that they aren't. When navigating around your site, keep open your server logs or use tools like MiniProfiler or RailsPanel for Chrome. When running methods in the console be sure to run with an active query logger running. Only we can prevent poor performing applications!