-
-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--dirs option for scanning directories for SQLite databases #672
base: master
Are you sure you want to change the base?
Conversation
One design issue: how to pick neat unique names for database files in a file hierarchy? Here's what I have so far: Lines 231 to 237 in fe6f9e6
For these files:
It made these names:
Maybe this is good enough? Needs some tests. |
This can take a LONG time to run, and at the moment it's blocking and prevents Datasette from starting up. It would be much better if this ran in a thread, or an asyncio task. Probably have to be a thread because there's no easy |
Another problem: if any of the found databases use SpatiaLite then Datasette will fail to start at all. It should skip them instead. The |
I tried running the Makes sense - I had a thread that added an item to that dictionary right while the homepage was attempting to run this code: datasette/datasette/views/index.py Lines 24 to 27 in efa54b4
|
So I need to ensure the Mainly I need to ensure that it is locked during iterations over it, then unlocked at the end. Trickiest part is probably ensuring there is a test that proves this is working - I feel like I got lucky encountering that |
... or maybe I can cheat and wrap the access to |
... cheating like this seems to work:
Python built-in operations are supposedly threadsafe, so in this case I can grab a copy of the list atomically (I think) and then safely iterate over it. Seems to work in my testing. Wish I could prove it with a unit test though. |
Interesting new problem: hitting Ctrl+C no longer terminates the problem provided that https://stackoverflow.com/questions/49992329/the-workers-in-threadpoolexecutor-is-not-really-daemon has clues. The workers are only meant to exit when their worker queues are empty. But... I want to run the worker every 10 seconds. How do I do that without having it loop forever and hence never quit? |
It think the fix is to use an old-fashioned |
I've figured out how to tell if a database is safe to open or not: select sql from sqlite_master where sql like 'CREATE VIRTUAL TABLE%'; This returns the SQL definitions for virtual tables. The bit after Run this against a SpatiaLite database and you get the following: CREATE VIRTUAL TABLE SpatialIndex USING VirtualSpatialIndex()
CREATE VIRTUAL TABLE ElementaryGeometries USING VirtualElementary() Run it against an Apple Photos CREATE VIRTUAL TABLE RidList_VirtualReader using RidList_VirtualReaderModule
CREATE VIRTUAL TABLE Array_VirtualReader using Array_VirtualReaderModule
CREATE VIRTUAL TABLE LiGlobals_VirtualBufferReader using VirtualBufferReaderModule
CREATE VIRTUAL TABLE RKPlace_RTree using rtree (modelId,minLongitude,maxLongitude,minLatitude,maxLatitude) For a database with FTS4 you get: CREATE VIRTUAL TABLE "docs_fts" USING FTS4 (
[title], [content], content="docs"
) FTS5: CREATE VIRTUAL TABLE [FARA_All_Registrants_fts] USING FTS5 (
[Name], [Address_1], [Address_2],
content=[FARA_All_Registrants]
) So I can use this to figure out all of the |
d.path for d in list(self.databases.values()) if d.path is not None | ||
} | ||
for dir in self.dirs: | ||
for filepath in Path(dir).glob("**/*.db"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to support .sqlite
files here too
str(filepath) | ||
.replace("../", "") | ||
.replace("/", "_") | ||
.replace(".db", ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also consider .sqlite
- using pathlib .stem
here would be better.
.replace(".db", ""), | ||
Database(self, str(filepath), is_mutable=True), | ||
) | ||
time.sleep(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be configurable - 10 is an OK default, but cases like simonw/jupyterserverproxy-datasette-demo#2 where it's running in a Binder session would benefit from faster updates.
Here's the new utility function I should be using to verify database files that I find: datasette/datasette/utils/__init__.py Lines 773 to 787 in 6aa516d
|
While running it against a nested directory with a TON of databases I kept seeing errors like this:
And
|
I also eventually get this error:
|
I think I can tell what the current file limit is like so:
So maybe I should have Datasette refuse to open more database files than that number minus 5 (to give me some spare room for opening CSS files etc). |
Refs #417.