Skip to content

Releases: metalbobinou/BnFDebatsParlementairesDownloader-python

v1.022 Exception in HTTP response re-re-managed

15 Aug 11:44
632b8a7
Compare
Choose a tag to compare

An import was missing (and very rare to check it...)

v1.021 Exception in HTTP response re-managed

10 Aug 17:54
4e9c0ef
Compare
Choose a tag to compare

An exception that was happening... still happens.

Now it should be corrected (but still, it's impossible to reproduce) thanks to the http.client.IncompleteRead exception.

Here was the trace :

    data = response.read()
  File "/usr/lib/python3.10/http/client.py", line 460, in read
    return self._read_chunked(amt)
  File "/usr/lib/python3.10/http/client.py", line 598, in _read_chunked
    raise IncompleteRead(b''.join(value))
http.client.IncompleteRead: IncompleteRead(0 bytes read)

v1.02 Exception in HTTP response

08 Aug 18:24
457501e
Compare
Choose a tag to compare

When the HTTP response is empty (or any other case...) happens, it is now correctly managed : the script is stopped and saves its last correct state.

v1.01 PDF downloader corrected + Final message written

28 Jul 20:18
933f35a
Compare
Choose a tag to compare

v1.01 : correction of v1.0

  • PDF downloader : correction for correct error handling (a variable that was used in the JPEG downloader was still called by error... making the script completely crash when an error happened, without error recovery)

  • A new message is written when each step is correctly finished : when an error occurs, the regular error message is written, but when the input list (or input parameters) are completely processed, a new greeting message is written

v1.0 - Basic scrapping

26 Jul 15:23
Compare
Choose a tag to compare

A scrapper for Gallica (BnF) that generates URLs with dates, tries to resolve them in order to get the Ark ID (identification of a document), eventually collect multiple Ark ID when a date contains multiple documents, and finally, download all of the JPEG and PDF of each Ark ID.

The scrapper is able to make error recovery by recording where it failed in each list.
However : only 1 instance can be run at a time .
Precisely : 1 instance can be run simultaneously in the same folder (you can copy/paste the source code and your lists in multiple different folders and launch them in parallel... because the file for error recovery always has the same name => TODO : create one tmp file per input filename)
Keep in mind that the BnF/Gallica does not well support multiple connection... the error recovery is here to relaunch the script when the BnF server has a problem and close the connection.

There is a small criteria for detecting the end of a processing : the input filename is appended with a "_final.txt".
For the JPEG/PDF download, a temporary folder is created ( *_WIP_JPEG / *_WIP_PDF ) and renamed with the date (*JPEG[date]).

Multiple exceptions are currently managed : network failures (even a timeout is set when the server doesn't respond) and disk full.
When launching the scrapper, you must keep a trace of the log !

python src/script.py > logX_Y.log 2>&1

With this, you can't miss what's happening (just use a tail or tail -n 25 on the log to see what failed).