Skip to content

Decrease batch size for seeds #35

Closed
@tomreitz

Description

@tomreitz

Thanks for your work on this adapter, it's great.

The batch size to use when loading seeds is specified here as 100000.

I'm not sure how the 100k value was determined, but I ran into the following SQLite error when trying to load a seed file of 70k rows, 5 cols, 2.37MB (a dictionary of common words in 14 languages):

17:46:24  7 of 7 ERROR loading seed file main.synth_words ................................ [ERROR in 2.54s]
...
17:46:24  Database Error in seed synth_words (seeds/synth_words.csv)
17:46:24    too many SQL variables

I resolved the error by manually changing the batch size in macros/materializations/seed/seed.sql from 100k down to 10k.

SQLite's limits include SQLITE_MAX_VARIABLE_NUMBER

which defaults to 999 for SQLite versions prior to 3.32.0 (2020-05-22) or 32766 for SQLite versions after 3.32.0.

(I'm running SQLite 3.31.1.) It seems that the number of variables used in the seed query scales with batch size.

Sidebar: dbt itself specifies a maximum file size for seeds but this is only used to determine whether or not to hash the file. dbt discourages using very large seed files, but does not impose a limit as far as I can tell.

Would it be possible to decrease the batch size from 100k to say 10k or so? and/or use a configurable seed_batch_size parameter?

My guess is that the optimal value probably depends on both the number of rows and columns in the seed file, so it may be difficult to hard-code and justify any specific batch size. But something smaller, while resulting in more queries against the database, should still be fairly performant.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions