Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BackfillTransactionSyncBatchJob sql queries are inefficient #247

Open
AdnanTheExcellent opened this issue Oct 23, 2023 · 3 comments
Open

Comments

@AdnanTheExcellent
Copy link
Collaborator

AdnanTheExcellent commented Oct 23, 2023

I have been testing the backfills a bit over the last couple months on production and I have some thoughts on efficiency. Lets take a look at the code for the backfill batch job:

  def perform(transaction_sync_batch)
    complete_orders = ::Spree::Order.complete.where(shipment_state: 'shipped')
    if transaction_sync_batch.start_date
      complete_orders = complete_orders.where("completed_at >= ?", transaction_sync_batch.start_date.beginning_of_day)
    end
    if transaction_sync_batch.end_date
      complete_orders = complete_orders.where("completed_at <= ?", transaction_sync_batch.end_date.end_of_day)
    end
    complete_orders.find_each do |order|
      next if order.taxjar_order_transactions.any?
      SuperGood::SolidusTaxjar::ReportTransactionJob.perform_later(order, transaction_sync_batch)
    end
  end

The first line:

complete_orders = ::Spree::Order.complete.where(shipment_state: 'shipped')

By itself, that's an extremely expensive sql query. If a store has hundreds of thousands, or millions of orders, this could take a VERY long time, or freeze the thread. This has happened to me. There needs to be some lower bound to the date. Either require the start_date or have a date in the admin for when Taxjar started. For instance, we've been using spree / solidus since Dec 2010, but started using Taxjar April 2021. We backfilled to Feb 2021 when we started. A good default date would be Feb 2021 for the taxjar_start_date. That would prevent 10+ years of orders from getting queried.

next if order.taxjar_order_transactions.any?

this is an N+1 query since it fetches taxjar_order_transactions for each order in the loop. This can be resolved with an includes and where

what I propose:

  # put a lower bound on the start_date calendar select = to SuperGood::SolidusTaxjar.configuration.taxjar_start_date
  start_date = if transaction_sync_batch.start_date
                 transaction_sync_batch.start_date.beginning_of_day
               else
                 SuperGood::SolidusTaxjar.configuration.taxjar_start_date
               end
  complete_orders = ::Spree::Order.complete.
    includes(:taxjar_order_transactions).
    where("taxjar_order_transactions.order_id": nil).
    where(shipment_state: 'shipped').
    where("completed_at >= ?", start_date)
  if transaction_sync_batch.end_date
    complete_orders = complete_orders.where("completed_at <= ?", transaction_sync_batch.end_date.end_of_day)
  end
  complete_orders.each do |order|
    SuperGood::SolidusTaxjar::ReportTransactionJob.perform_later(order, transaction_sync_batch)
  end

thoughts?

@benjaminwil
Copy link
Member

Thanks for the detailed report and your proposal. 👍 I agree that there are a bunch of straightforward performance improvements we should make to this.

@senemsoy
Copy link
Collaborator

By itself, that's an extremely expensive sql query. If a store has hundreds of thousands, or millions of orders, this could take a VERY long time, or freeze the thread.

completed_orders collection should not be loaded until we call find_each on it, which will only load batches of 1000 records at a time.

We've confirmed this by testing if completed_orders.loaded? is true in a Rails console, and that wasn't the case. Here's a screenshot showing the result of that check right above the complete_orders.find_each do |order| line:
Screenshot 2023-10-25 at 10 49 32 AM

Either require the start_date or have a date in the admin for when Taxjar started.

The assumption we made is that if the start date is not set when are backfilling transactions, the user wants to backfill all existing orders. For anyone who wants to narrow the set of orders, the expectation is that they will set the start and the end dates for backfilling.

@AdnanTheExcellent
Copy link
Collaborator Author

AdnanTheExcellent commented Oct 26, 2023

Either require the start_date or have a date in the admin for when Taxjar started.

The assumption we made is that if the start date is not set when are backfilling transactions, the user wants to backfill all existing orders. For anyone who wants to narrow the set of orders, the expectation is that they will set the start and the end dates for backfilling.

You're right, if someone gives a start date. However, Taxjar only allows you to backfill up to the current calendar year or a few months for free. After that, you have to pay per year. It could be dangerous to offer an empty date as the default option. I would personally require to put a backfill date and set it as that so the user knows they have to make an explicit choice on the date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants