This is a fast parser and CSV-file generator for the Kaggle challenge "COVID-19 Open Research Dataset Challenge (CORD-19)". It transforms more than 68.000 JSON-files (8GB) into a single CSV file in one minute on a modern laptop.
For more information, see this blog post.
This is an Accelerator project, meaning that computations are fast, parallel, and reproducible. The Accelerator is an open source project from eBay.
-
Clone this repository. Then,
cdinto the createdKaggle-CORD19-data-parserdirectory. -
Download the dataset
CORD-19-research-challenge.zipfrom Kaggle here. Unzip it.The default configuration assumes it is unzipped into a directory named
data/CORD-19-research-challenge, somkdir -p data/CORD-19-research-challenge unzip CORD-19-research-challenge.zip -d data/CORD-19-research-challenge -
Create a "workdir", where all output will be stored, for example
mkdir -p workdirs/cord -
Create a "results" directory where results will be linked.
mkdir results
-
Set up a virtual environment and install the Accelerator
python3 -m venv venv source venv/bin/activate pip install accelerator -
Read (and perhaps) modify the file
accelerator.conf, in particular- set the number of
slices, i.e. number of processes to run in parallel (for example8), - set the
workdirspath to where output will be stored (for exampleworkdirs/cord), and - set the
input directoryto the location of the unzipped CORD dataset (for exampledata/CORD-19-research-challenge).
Make sure that the paths exists and are correct.
- set the number of
-
The Accelerator is a client-server application, so use two terminal emulator windows. Make sure to activate the virtual environment in both of them.
In the "server" terminal, type
ax serverIn the "client" teminal, type
ax runor
ax run --fullpathThe program will now execute. It will print information about the build process and location of files.
The source code is found in the build script dev/build.py, which
calls the method dev/a_import.py.
Copyright 2020 Anders Berkeman and Carl Drougge
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.