Description
The src/licensedcode/data directory contains 68K+ files and 64k just for the rules.
These rule files are not used much at runtime because they are baked into the index in a compressed form that is used at runtime. The same applies to the licenses files that are fully included in the index in an object form.
These are only needed when the index is rebuilt.
Another issue is that handling so many files makes any filesystem operation (unbearably) slow including during development time and at installation time.
It also creates side issues as #2427 (comment) and linkedin/shiv#224
I suggest some of these to fix the issue:
- we can half the number of files by combining the YAML data file and the license/rule text file in a single file with minimal code changes either a as combined YAML or YAML front matter https://jekyllrb.com/docs/front-matter/
- we can split the files in multiple sub-directories to limit the number of files to some sensible number (say under 5K per dir)
- at runtime, and as part of the build we could replace the many files by a single (larger) file that could be a big JSON or YAML file or a zip with the actual files accessed as a filesystem https://github.com/PyFilesystem/pyfilesystem2/blob/master/fs/zipfs.py or path https://github.com/jaraco/zipp
Combining either these three actions or just the last two should make this OK and workable both for development, installation and runtime.