Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasette with many and large databases > Memory use #1880

Open
amitkoth opened this issue Nov 2, 2022 · 4 comments
Open

Datasette with many and large databases > Memory use #1880

amitkoth opened this issue Nov 2, 2022 · 4 comments

Comments

@amitkoth
Copy link

amitkoth commented Nov 2, 2022

Datasette maintains an in-memory SQLite database with details of the the databases, tables and columns for all of the attached databases.

The above is from the docs ^. There's two problems here - the number of datasette "instances" in a single server/VM and the size of the database itself. We want the opposite of in-memory, including what happens on SQLlite - documented in https://www.sqlite.org/inmemorydb.html

From the context in #1150 - does it mean datasette is memory-bound to the size of the dataset - which might be a deal-breaker for many large-scale use cases?

In an extreme case - let's say a single server had 100 SQLlite databases, which would enable 100 "instances" of datasette to run, one per client (e.g. in a SaaS multi-tenant environment). How could we achieve all these goals:

  1. Allow any one of these 100 databases to grow to say 2Tb in size
  2. Have one datasette instance, which connects to 1 of the 100 instances, based on incoming credentials/tenant ID
  3. Minimize memory use entirely - both by datasette and SQLlite, such that almost all operations are executed in real-time on-disk with little to no memory consumption per-tenant, or per-database.

Any ideas appreciated - we're looking to use this in a SaaS type of setting - many instances, single server.

@simonw great work on datasette, in general! Possibly related to #1480 but we don't want use any kind of serverless infra - this is a long-running VM/server.

@amitkoth amitkoth changed the title Datasette with many and large databases Datasette with many and large databases > Memory use Nov 2, 2022
@amitkoth
Copy link
Author

amitkoth commented Nov 2, 2022

Follow on question - is all memory use @simonw - for both datasette and SQLlite confined to the "query time" itself i.e. the memory use is relevant only to a particular transaction or query - and then subsequently released?

@simonw
Copy link
Owner

simonw commented Nov 11, 2022

I think you may have misunderstood this feature. This is talking about the _internal in-memory database, which maintains a set of tables that list the databases and tables that are attached to Datasette.

They're not a copy of the data itself - just a list of table names, column names and database names.

You can see what that database looks like by signing in as root - running datasette --root and clicking the link. Or you can see an example here:

For the example instance that looks like this:

image

The two most interesting tables in there are these ones:

image

CleanShot 2022-11-10 at 22 11 23@2x

As you can see, it's just the table schema itself and the columns that make up the tables. Even if you have hundreds of databases connected each with hundreds of tables this should still only add up to a few MB of RAM.

@simonw
Copy link
Owner

simonw commented Nov 11, 2022

The _internal database is intended to help Datasette handle much larger attached databases. Right now Datasette attempts to show every database on the https://latest.datasette.io/ index page and every table on the https://latest.datasette.io/fixtures database index page - but these are not paginated. If you had a database containing 1,000 tables the database index page would get pretty slow.

So I want to be able to paginate (and search) those. But to paginate them it's useful to have them in a database table itself, since then I can paginate using SQL.

My plan for _internal is to use it to implement those advanced browsing features. I've not completed this work yet though. See this issue for more details on that:

@amitkoth
Copy link
Author

I appreciate your response @simonw - thanks!

I'll clarify what we need further - let's imagine we have 2000 SQLLite databases (for 2000 tenants), but we only want to run one datasette instance for each of those tenants to query/use datasette against their own database only. This means the "connection" between datasette and the SQLLite database would be dynamic, based on the tenantID that's required on an incoming request.

Is there any specific config or other considerations in this use case, to minimize memory use on a single, efficient VM and serve queries to all these tenants?

cc @MuAdham

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants