pubcrawler
downloads all files of a specified type from an
organisation’s website and then extracts metadata from each document
using LLMs. The library’s main use-case (as demonstrated in the core
scripts) is downloading PDFs from think tanks/policy organisations and
mapping authorship, publishing output and institutional affliations.
You will need an OpenAI API key to run all scripts. The key must be
stored within the .env
file in this directory.
To explore the research output for an institution of your choosing, you will need to configure the following parameters. Here is an example of the parameters required to scrape Autonomy’s site for PDFs and then processing them with GPT-4:
url = 'https://autonomy.work/' # organisation's main webpage
directory_name = 'autonomy' # choose a name for the folder to save output data to
file_type = 'pdf' # filetype to download
model = 'gpt-4' # openai model for processing text
org_name = 'autonomy' # name of your organisation (used to filter out irrelevant results)
To run the software you can pass the parameters like so:
python -m pubcrawler.cli.core -cs url='https://autonomy.work/' directory_name='autonomy' file_type='pdf' model='gpt-4' org_name='autonomy'
Alternatively you can manually set these values in const.ipynb
and
then run:
nbdev_export
pip install -e .
python -m pubcrawler.cli.core
proj.core.run_all()
List of commands
!python -m pubcrawler.cli
Execute commands
!python -m pubcrawler.cli.core
You can find the manual for each command using -h