Releases · metalbobinou/BnFDebatsParlementairesDownloader-python

15 Aug 11:44

metalbobinou

v1.022

632b8a7

v1.022 Exception in HTTP response re-re-managed Latest

Latest

An import was missing (and very rare to check it...)

Assets 2

10 Aug 17:54

metalbobinou

v1.021

4e9c0ef

v1.021 Exception in HTTP response re-managed

An exception that was happening... still happens.

Now it should be corrected (but still, it's impossible to reproduce) thanks to the http.client.IncompleteRead exception.

Here was the trace :

    data = response.read()
  File "/usr/lib/python3.10/http/client.py", line 460, in read
    return self._read_chunked(amt)
  File "/usr/lib/python3.10/http/client.py", line 598, in _read_chunked
    raise IncompleteRead(b''.join(value))
http.client.IncompleteRead: IncompleteRead(0 bytes read)

Assets 2

08 Aug 18:24

metalbobinou

v1.02

457501e

v1.02 Exception in HTTP response

When the HTTP response is empty (or any other case...) happens, it is now correctly managed : the script is stopped and saves its last correct state.

Assets 2

28 Jul 20:18

metalbobinou

v1.01

933f35a

v1.01 PDF downloader corrected + Final message written

v1.01 : correction of v1.0

PDF downloader : correction for correct error handling (a variable that was used in the JPEG downloader was still called by error... making the script completely crash when an error happened, without error recovery)
A new message is written when each step is correctly finished : when an error occurs, the regular error message is written, but when the input list (or input parameters) are completely processed, a new greeting message is written

Assets 2

26 Jul 15:23

metalbobinou

v1.0

1edaa84

v1.0 - Basic scrapping

A scrapper for Gallica (BnF) that generates URLs with dates, tries to resolve them in order to get the Ark ID (identification of a document), eventually collect multiple Ark ID when a date contains multiple documents, and finally, download all of the JPEG and PDF of each Ark ID.

The scrapper is able to make error recovery by recording where it failed in each list.
However : only 1 instance can be run at a time .
Precisely : 1 instance can be run simultaneously in the same folder (you can copy/paste the source code and your lists in multiple different folders and launch them in parallel... because the file for error recovery always has the same name => TODO : create one tmp file per input filename)
Keep in mind that the BnF/Gallica does not well support multiple connection... the error recovery is here to relaunch the script when the BnF server has a problem and close the connection.

There is a small criteria for detecting the end of a processing : the input filename is appended with a "_final.txt".
For the JPEG/PDF download, a temporary folder is created ( *_WIP_JPEG / *_WIP_PDF ) and renamed with the date (*JPEG[date]).

Multiple exceptions are currently managed : network failures (even a timeout is set when the server doesn't respond) and disk full.
When launching the scrapper, you must keep a trace of the log !

python src/script.py > logX_Y.log 2>&1

With this, you can't miss what's happening (just use a tail or tail -n 25 on the log to see what failed).

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: metalbobinou/BnFDebatsParlementairesDownloader-python

v1.022 Exception in HTTP response re-re-managed

v1.021 Exception in HTTP response re-managed

v1.02 Exception in HTTP response

v1.01 PDF downloader corrected + Final message written

v1.0 - Basic scrapping