This repository contains tools used to scrape SpecsQA dataset for DualGraph project from the Samsung UK store webpage.
Install dependencies with:
poetry install --with devThe first step, is to obtain links to all the product categories. Given the links, the task is then to obtain all the links to individual product variants. This involves iterating over all the product cards (corresponding to product ranges) shown in a given category, and obtaining all possible parameter combinations (corresponding to products). All of this is done with the get_links.py script, which outputs a JSON file containing the product links along with metadata.
poetry run python -m scraper.get_links --output_file links.jsonHaving the product links, the corresponding htmls have to be scraped, according to their respective layouts (we distinguish three types: A, B and C).
poetry run python -m scraper.async_scrape --links_file links.json --output_dir htmlsThe scraping code is asynchronous, and uses multiple drivers (multiple instances of the internet browser). The number of these can be manipulated using the NUM_DRIVERS constant in the code. Theoretically using more drivers will yield faster scraping, although care must be taken to ensure that the browsers do not fall into the idle state, in which case they will fail to load all the required dynamic content. The best way to do so, seems to be to have all the windows actually visible on the screen. RAM is the other obvious limiting factor.
It should be noted that the scraping code is quite fragile with respect to page layout changes. The version in this repository has been used to obtain data about around 3k products in mid-November 2025. The general logic:
- asyncio scraping of categories,
- then product cards within the categories
- then product variants within product cards
- then product specification tables within product variant pages
should be fairly robust, but minor CSS changes might require tweaking the relevant constants.