-
Couldn't load subscription status.
- Fork 33
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Currently, DP+ downloads the resource as follows:
- checks the Header content-length if its below MAX_CONTENT_LENGTH to see if the file is too large, filesize checking is not done if PREVIEW_ROWS is true
- does chunked download in CHUNK_SIZE bytes
- while doing chunked downloads, checks if the actual file size is actually below MAX_CONTENT_LENGTH (unless PREVIEW_ROWS is true)
- checks the hash of the downloaded file if it has changed, if it hasn't , it will skip pushing the file into the datastore
- supports only HTTP, HTTPS and FTP url schemes
Improve downloading by:
- adding SFTP and S3 url schemes to start with, and then incrementally add other libcloud providers as required
- SFTP keys will be managed in the Datapusher+ Management Interface
- if PREVIEW_ROWS is true do not download the entire file, only download the first PREVIEW_ROWS_SAMPLE_SIZE and see if you have enough PREVIEW_ROWS, and then keep adding to the sample by PREVIEW_ROWS_SAMPLE_SIZE divided by 2 until a sample of PREVIEW_ROWS is downloaded
- If PREVIEW ROWS is false and the Header content-length is less than MAX_CONTENT_LENGTH, download the file using http2 with brotli, gzip and deflate encoding in that order, direct to disk instead of streaming it and writing to disk in CHUNK_SIZE bytes.
- for resources whose URL does not start with CKAN_SITE_URL, (it's a link to a third-party site), have more robust, fault-tolerant downloading, logging broken links, adding the link to the DATAPUSHER_RETRY queue
- add an In-Progress placeholder resource view for queued resources, with optional link to the Datastore Tab where the Datapusher+ log messages are displayed
These changes will:
- make sure we don't download unnecessarily large files, if we're only take the first N rows to create a PREVIEW in the datastore
- if we're not doing PREVIEW, to download files in the most efficient way with http2 and compression.
- allow better cataloging/datapushing of data hosted on third-party sites with configurable retries
- improve user experience with the In-Progress placeholder resource view - for both the Data Publisher and Data Users
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request