-
Notifications
You must be signed in to change notification settings - Fork 2
Our Data Processing Scripts
As you know, Bibliotools is a tool developed by the Sciences Po Médialab. This repository contains a new application, ScienceScapeS, developed by the Codokans at King's College London, which partly uses some of the old Bibliotools3.0 scripts you can find by clicking the following link:
https://github.com/medialab/bibliotools3.0
As an addendum to the previous set of instructions, here are some quick ways to process your input from the Web of Science.
It is possible to use the refactored version of Bibliotools3.0 without the ScienceScapeS website. To do so, please locate the back-end folder in our repository: https://github.com/wonjoonSeol/ScienceScape/tree/master/bibliotools3
This folder contains everything you need for processing input from the Web of Science.
Log in to WOS with your institution's account, or your personal account. Look up whatever you need, and download the data files in the following format: Tab-delimited and UTF-8.
Please place your .txt files (there can be many) in the data-wos folder of the previously indicated directory. If no such folder exists (i.e. you have taken the scripts directly from the website's repo) then create one using your preferred command or tool. You may now proceed to the next step.
In the scripts folder of the directory, you will find a script named graph_gen.py. This is a Python script that can be run directly from your terminal, by using the keyword python. Please make sure that you are in the correct directory before attempting to run the script (that is, you have changed the current directory to scripts beforehand).
A critical step before running this script file is to make sure you've downloaded and correctly installed all dependencies. A guide to doing this can be found here.
In order to make a coherent use of the Bibliotools scripts, you need to define time spans. Time spans will help our application separate your references, articles and data be separated according to their date of publication. The time spans may be interleaved and you may enter whatever year you please (that is, a time span could be 2010-2033 or 1700-1800).
Provided your time spans are in the correct format, that is StartYear-EndYear, you may now run the following command to ignite the process:
python graph_gen.py -bound Span1 Span2 ........ SpanN
Example: this is a valid running_scripts command:
python graph_gen.py -bound 2000-2005 2007-2019 2002-2018
During the execution of the command, you will see many printed notices on the command line. There are five steps to our process (originally inspired by and partly defined by the Bibliotools3.0 scripts):
1. Merging your corpus files (functionality completed by the refactored script merging_corpus.py): during this step, your files are being merged into one big file (we know how frustrating it is to have a limit on the number of files you download, but sadly this is out of our hands).
After this step, you will observe a file years_distribution.csv created in the reports directory, which contains the current distribution of years. If you realise that this distribution would not work well with your time spans, you can abort the script and redefine them as previously demonstrated.
Just for information, corpus means group of texts, which is a word heavily used in French but not that well known in English.
2. Parsing your files (parsers.py, utility.py and parse_and_group.py): during this step, each element in your data will be separated according to your time span definitions. As for the Bibliotools3.0 scripts previously developed, references, subjects, authors, institutions, keywords and countries will be parsed.
At the end of the process, you will observe many .dat files in the span folders. Feel free to explore the folders and files, to learn more about the directory structure.
3. Referencing and graph-making (most of the remaining scripts)
At the end of the execution, you will find some graph files in the output folder. They have the format specified in the config.py file.
By the way, these scripts are entirely customisable, and most of the configuration happens in
config.py. You may change the output directories as you wish in this file, and the other files will use these instead.
To visualise these graph files outside of our application, you can use any graph reading software (Gephi is your best option in our opinion, https://gephi.org)
We hope you enjoyed running our scripts as much as we enjoyed refactoring them. If you encounter any problem with these scripts or you are not happy with the output, please log an issue so somebody can take care of the debugging.
Something wrong with this set of instructions? Log an issue to help us fix it.