Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--dirs option for scanning directories for SQLite databases #672

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

simonw
Copy link
Owner

@simonw simonw commented Feb 14, 2020

Refs #417.

@simonw simonw added the feature label Feb 14, 2020
@simonw
Copy link
Owner Author

simonw commented Feb 14, 2020

One design issue: how to pick neat unique names for database files in a file hierarchy?

Here's what I have so far:

datasette/datasette/app.py

Lines 231 to 237 in fe6f9e6

self.add_database(
str(filepath)
.replace("../", "")
.replace("/", "_")
.replace(".db", ""),
Database(self, filepath, is_mutable=True),
)

For these files:

../travel-old.db
../sf-tree-history/trees.db
../library-of-congress/records-from-df.db

It made these names:

travel-old
sf-tree-history_trees
library-of-congress_records-from-df

Maybe this is good enough? Needs some tests.

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2020

This can take a LONG time to run, and at the moment it's blocking and prevents Datasette from starting up.

It would be much better if this ran in a thread, or an asyncio task. Probably have to be a thread because there's no easy async version of pathlib.Path.glob() that I've seen.

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2020

Another problem: if any of the found databases use SpatiaLite then Datasette will fail to start at all.

It should skip them instead.

The select * from sqlite_master check apparently isn't quite enough to catch this case.

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2020

I tried running the scan_dirs() method in a thread and got an interesting error while trying to load the homepage: RuntimeError: OrderedDict mutated during iteration

Makes sense - I had a thread that added an item to that dictionary right while the homepage was attempting to run this code:

async def get(self, request, as_format):
databases = []
for name, db in self.ds.databases.items():
table_names = await db.table_names()

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2020

So I need to ensure the ds.databases data structure is manipulated in a thread-safe manner.

Mainly I need to ensure that it is locked during iterations over it, then unlocked at the end.

Trickiest part is probably ensuring there is a test that proves this is working - I feel like I got lucky encountering that RuntimeError as early as I did.

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2020

... or maybe I can cheat and wrap the access to self.ds.databases.items() in list(), so I'm iterating over an atomically-created list of those things instead? I'll try that first.

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2020

... cheating like this seems to work:

for name, db in list(self.ds.databases.items()):

Python built-in operations are supposedly threadsafe, so in this case I can grab a copy of the list atomically (I think) and then safely iterate over it.

Seems to work in my testing. Wish I could prove it with a unit test though.

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2020

Interesting new problem: hitting Ctrl+C no longer terminates the problem provided that scan_dirs() thread is still running.

https://stackoverflow.com/questions/49992329/the-workers-in-threadpoolexecutor-is-not-really-daemon has clues. The workers are only meant to exit when their worker queues are empty.

But... I want to run the worker every 10 seconds. How do I do that without having it loop forever and hence never quit?

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2020

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2020

It think the fix is to use an old-fashioned threading module daemon thread directly. That should exit cleanly when the program exits.

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2020

I've figured out how to tell if a database is safe to open or not:

select sql from sqlite_master where sql like 'CREATE VIRTUAL TABLE%';

This returns the SQL definitions for virtual tables. The bit after using tells you what they need.

Run this against a SpatiaLite database and you get the following:

CREATE VIRTUAL TABLE SpatialIndex USING VirtualSpatialIndex()
CREATE VIRTUAL TABLE ElementaryGeometries USING VirtualElementary()

Run it against an Apple Photos photos.db file (found with find ~/Library | grep photos.db) and you get this (partial list):

CREATE VIRTUAL TABLE RidList_VirtualReader using RidList_VirtualReaderModule
CREATE VIRTUAL TABLE Array_VirtualReader using Array_VirtualReaderModule
CREATE VIRTUAL TABLE LiGlobals_VirtualBufferReader using VirtualBufferReaderModule
CREATE VIRTUAL TABLE RKPlace_RTree using rtree (modelId,minLongitude,maxLongitude,minLatitude,maxLatitude)

For a database with FTS4 you get:

CREATE VIRTUAL TABLE "docs_fts" USING FTS4 (
    [title], [content], content="docs"
)

FTS5:

CREATE VIRTUAL TABLE [FARA_All_Registrants_fts] USING FTS5 (
                [Name], [Address_1], [Address_2],
                content=[FARA_All_Registrants]
            )

So I can use this to figure out all of the using pieces and then compare them to a list of known support ones.

d.path for d in list(self.databases.values()) if d.path is not None
}
for dir in self.dirs:
for filepath in Path(dir).glob("**/*.db"):
Copy link
Owner Author

@simonw simonw Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to support .sqlite files here too

str(filepath)
.replace("../", "")
.replace("/", "_")
.replace(".db", ""),
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also consider .sqlite - using pathlib .stem here would be better.

.replace(".db", ""),
Database(self, str(filepath), is_mutable=True),
)
time.sleep(10)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be configurable - 10 is an OK default, but cases like simonw/jupyterserverproxy-datasette-demo#2 where it's running in a Binder session would benefit from faster updates.

@simonw
Copy link
Owner Author

simonw commented Mar 26, 2020

Here's the new utility function I should be using to verify database files that I find:

def check_connection(conn):
tables = [
r[0]
for r in conn.execute(
"select name from sqlite_master where type='table'"
).fetchall()
]
for table in tables:
try:
conn.execute("PRAGMA table_info({});".format(escape_sqlite(table)),)
except sqlite3.OperationalError as e:
if e.args[0] == "no such module: VirtualSpatialIndex":
raise SpatialiteConnectionProblem(e)
else:
raise ConnectionProblem(e)

@simonw
Copy link
Owner Author

simonw commented Mar 26, 2020

While running it against a nested directory with a TON of databases I kept seeing errors like this:

Traceback (most recent call last):
  File "/Users/simonw/Dropbox/Development/datasette/datasette/utils/asgi.py", line 121, in route_path
    return await view(new_scope, receive, send)
  File "/Users/simonw/Dropbox/Development/datasette/datasette/utils/asgi.py", line 193, in view
    request, **scope["url_route"]["kwargs"]
  File "/Users/simonw/Dropbox/Development/datasette/datasette/views/index.py", line 58, in get
    tables[table]["num_relationships_for_sorting"] = count
KeyError: 'primary-candidates-2018/rep_candidates'

And

Traceback (most recent call last):
  File "/Users/simonw/Dropbox/Development/datasette/datasette/utils/asgi.py", line 121, in route_path
    return await view(new_scope, receive, send)
  File "/Users/simonw/Dropbox/Development/datasette/datasette/utils/asgi.py", line 193, in view
    request, **scope["url_route"]["kwargs"]
  File "/Users/simonw/Dropbox/Development/datasette/datasette/views/index.py", line 58, in get
    tables[table]["num_relationships_for_sorting"] = count
KeyError: 'space_used'

@simonw
Copy link
Owner Author

simonw commented Mar 26, 2020

I also eventually get this error:

Traceback (most recent call last):
  File "/Users/simonw/Dropbox/Development/datasette/datasette/utils/asgi.py", line 121, in route_path
    return await view(new_scope, receive, send)
  File "/Users/simonw/Dropbox/Development/datasette/datasette/utils/asgi.py", line 336, in inner_static
    await asgi_send_file(send, full_path, chunk_size=chunk_size)
  File "/Users/simonw/Dropbox/Development/datasette/datasette/utils/asgi.py", line 303, in asgi_send_file
    async with aiofiles.open(str(filepath), mode="rb") as fp:
  File "/Users/simonw/.local/share/virtualenvs/datasette-oJRYYJuA/lib/python3.7/site-packages/aiofiles/base.py", line 78, in __aenter__
  File "/Users/simonw/.local/share/virtualenvs/datasette-oJRYYJuA/lib/python3.7/site-packages/aiofiles/threadpool/__init__.py", line 35, in _open
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py", line 57, in run
OSError: [Errno 24] Too many open files: '/Users/simonw/Dropbox/Development/datasette/datasette/static/app.css'

@simonw
Copy link
Owner Author

simonw commented Mar 26, 2020

I think I can tell what the current file limit is like so:

In [1]: import resource                                                                                                                                                                   

In [2]: resource.getrlimit(resource.RLIMIT_NOFILE)                                                                                                                                        
Out[2]: (256, 9223372036854775807)

So maybe I should have Datasette refuse to open more database files than that number minus 5 (to give me some spare room for opening CSS files etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant