Dila Open Data

Warning: this repository allows you to download and use datasets provided by the Dila (Direction de l'information légale et administrative) in a simple way. However, the datasets have terms of use that you must respect. Please refer to the Dila website for more information on the terms of use of the datasets.

Example

An example of usage is to find all the decisions concerning "CESEDA" (CESEDA is the code for the French Code of Entry and Stay of Foreigners and the Right of Asylum) in the JADE dataset (Jurisprudence administrative de l'État).

# Run this once to download, extract and index (slow)
dilarxiv --tarballs --fond JADE --extract --index
# Run this to search for CESEDA in the JADE dataset (fast)
dilarxiv --query "CESEDA" --save result-list.txt
# Run this to turn the result list into a CSV file with text and metadata
# it creates a `result-list.txt.csv` file
dilarxiv --csv result-list.txt

The command prints 10 results and saves the full list of results in a file called result-list.txt. Note that the above query is too precise to get meaningful results and one should rather use the following command:

dilarxiv --query "CESEDA OR \"code de l'entrée et du séjour des étrangers et du droit d'asile\""

Which will return all the decisions concerning the CESEDA or the expanded name of the code (which is the most common way to refer to it in the decisions).

Note: the JADE dataset, once extracted is about 8GB. The indexing takes about 10 minutes on a not-so-recent laptop, and the resulting index is about 4GB. So to run the search, you need more than 12GB of free space on your disk. The estimated time to run the whole pipeline is about 20 minutes. After that, searches should feel instantaneous (way below 1 second).

Usage

This repository is a wrapper around two very different ways to interact with the Dila datasets. On the one hand, it allows to download and index the datasets yourself. On the other hand, it allows to use the API provided by the Dila to access the datasets. If you are interested in a few results, then the API is the best way to go. If you are interested in a lot of results, then it is better to download the datasets yourself.

Use the datasets

To download and index the datasets yourself use the following command:

dilarxiv --tarballs

This will download all the datasets provided by the Dila. If you are interested in a specific dataset, you can use the --fond option (any number of times) to specify the datasets you are interested in. For example, to download the CASS dataset, you can use the following command:

dilarxiv --tarballs --fond CASS

Note that datasets are available on the open data portal of the Dila. Therefore, it is possible to only download specific archives and not whole datasets.

To automatically extract the datasets, you can use the --extract option. This assumes that there is a tarball folder available, for instance because you have just downloaded the datasets using the --tarballs option.

dilarxiv --extract

Now, the extracted content is available in the extracted folder. The content is organized in many subfolders, ultimately containing XML files. To index the datasets, you can use the --index option. This will create a index folder with the internal structure of the index allowing for fast searches.

dilarxiv --index

Warning: indexing can be quite time / cpu consuming.

Now, to search for documents in the index, you can use the --query option. This will perform a fulltext search and return the actual paths of the files of interest.

dilarxiv --query "search term"

By default, the answer is just a list of ten results. If you want to built an actual list of all the results, you can use the --save option that will create a text file with one line per result.

dilarxiv --query "search term" --save result-list.txt

If you want to turn the result list into a CSV file with text and metadata, you can use the --csv option. This will create a CSV file with columns for the metadata and the textual content of the documents. Note that some columns may contain nulls. The CSV file is created in the same folder as the result list, with the extra .csv extension.

dilarxiv --csv result-list.txt

The CSV file will have the following name result-list.txt.csv.

Use the API (testing phase)

To use the API, you need to create an account on the PISTE website that hosts the APIs. Following the instructions, you will get an API key that must be stored in a file called client-secret.txt together with an identifier that should be stored in a file called client-id.txt.

dilapi --query "ceseda" --start-year 2020 --end-year 2023 --fond "CETAT"

By default, one gets exactly the results as answered by the API in the JSON format streamed to stdout. If you want to save the results in a file, you can use the --output option that will create a text file with one line per result. It is also possible to obtain the full contents of a list of results by running the following command:

dilapi --texts results.json

It will create a full-texts folder with one file per result in the results.json file. The files are named <uid>.txt, and contain the full text of the decision/article/document.

How to install

The easiest way to install the software is to download one of the prebuilt binaries from github. If this is not possible, rebuilding the software should be easy:

Clone the repository
Run cargo build --release to build the software
Run cargo install --path . to install the software

A relatively recent version of Rust is required to build the software.

Status

Useful resources

Description of the filters available in the API (French, xlsx file)
Example queries and usage of the API (French, docx file)
Definition of the terms used in the API (French, docx file)
Terms of use of the API (French, pdf file)
Terms of use of the Legifrance API (French, pdf file)
Open data license (French, pdf file)

Notes

This repository is a proof of concept and will never be maintained or put into production. The coding style is terrible, there is no documentation, no tests, and minimal error handling. Please do not use this unless you understand what you are doing.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github/workflows		.github/workflows
examples		examples
src		src
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dila Open Data

Example

Usage

Use the datasets

Use the API (testing phase)

How to install

Status

Useful resources

Notes

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

AliaumeL/legifrance-rs

Folders and files

Latest commit

History

Repository files navigation

Dila Open Data

Example

Usage

Use the datasets

Use the API (testing phase)

How to install

Status

Useful resources

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages