Skip to content

fpezzuti/neural_crawling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Neural Prioritisation for Web Crawling

This repository contains the source code used for the experiments presented in the paper "Neural Prioritisation for Web Crawling" by Francesca Pezzuti, Sean MacAvaney and Nicola Tonellotto, published at ICTIR2025 - PDF.

Usage

Installation

You can install the requirements using pip:

pip install -r requirements.txt

Supported datasets

Web collection:

  • ClueWeb22-B (eng):

Query sets:

Pre-processing

To preprocess the msm-ws query set:

  1. make sure that queries are stored under "/data/queries/msmarco-ws/msmarco-ws-queries.tsv"
  2. make sure that qrels are stored under "./../data/qrels/msmarco-ws/cleaned-msmarco-ws-qrels.tsv"

Then, run:

python preproc_querysets.py

To preprocess ClueWeb22-B run:

python preproc_cw22b.py

Crawling

BreadthFirstSearch (BFS)

To crawl with BFS, run the following command using the default config file.

python crawl.py --max_pages -1 --verbosity 1 --exp_name v1 --frontier_type bfs

DepthFirstSearch (DFS)

To crawl with DFS, run the following command using the default config file.

python crawl.py --max_pages -1 --verbosity 1 --exp_name v1 --frontier_type dfs

QFirst

To crawl with QFirst, run the following command using the default config file.

python crawl.py --max_pages -1 --verbosity 1 --exp_name first --frontier_type quality

QOracle

To crawl with QOracle, run the following command using the default config file.

python crawl.py --max_pages -1 --verbosity 1 --exp_name v1 --frontier_type oracle-quality

QMin

To crawl with QMin, run the following command using the default config file.

python crawl.py --max_pages -1 --verbosity 1 --exp_name min --frontier_type quality --updates_enabled 1

Ranking with BM25

To index and rank documents with BM25 the set of documents crawled up to time t=limit by a crawler whose experimental name is 'crawler' use:

python index.py --periodic 1 --limit 2_500_000 --exp_name crawler --benchmark msmarco-ws

By default, this python script automatically retrieves the top k scoring documents for queries for the specified query set and writes the results in the runs directory; however, this option can be disabled by launching the script with --evaluate False.

Re-ranking with MonoElectra

To re-rank documents with MonoElectra:

CUDA_VISIBLE_DEVICES=0 python rerank.py --exp_name crawler --subexp_name limit_2500000 --benchmark msmarco-ws

Evaluation

To evaluate crawling and retrieval effectiveness, use the Jupyter Notebook plot-metrics.ipynb.

About

"Neural Prioritisation for Web Crawling", ICTIR 2025.

Resources

License

Stars

Watchers

Forks