Skip to content

Better downloading of resources #46

@jqnatividad

Description

@jqnatividad

Currently, DP+ downloads the resource as follows:

  • checks the Header content-length if its below MAX_CONTENT_LENGTH to see if the file is too large, filesize checking is not done if PREVIEW_ROWS is true
  • does chunked download in CHUNK_SIZE bytes
  • while doing chunked downloads, checks if the actual file size is actually below MAX_CONTENT_LENGTH (unless PREVIEW_ROWS is true)
  • checks the hash of the downloaded file if it has changed, if it hasn't , it will skip pushing the file into the datastore
  • supports only HTTP, HTTPS and FTP url schemes

Improve downloading by:

  • adding SFTP and S3 url schemes to start with, and then incrementally add other libcloud providers as required
  • SFTP keys will be managed in the Datapusher+ Management Interface
  • if PREVIEW_ROWS is true do not download the entire file, only download the first PREVIEW_ROWS_SAMPLE_SIZE and see if you have enough PREVIEW_ROWS, and then keep adding to the sample by PREVIEW_ROWS_SAMPLE_SIZE divided by 2 until a sample of PREVIEW_ROWS is downloaded
  • If PREVIEW ROWS is false and the Header content-length is less than MAX_CONTENT_LENGTH, download the file using http2 with brotli, gzip and deflate encoding in that order, direct to disk instead of streaming it and writing to disk in CHUNK_SIZE bytes.
  • for resources whose URL does not start with CKAN_SITE_URL, (it's a link to a third-party site), have more robust, fault-tolerant downloading, logging broken links, adding the link to the DATAPUSHER_RETRY queue
  • add an In-Progress placeholder resource view for queued resources, with optional link to the Datastore Tab where the Datapusher+ log messages are displayed

These changes will:

  • make sure we don't download unnecessarily large files, if we're only take the first N rows to create a PREVIEW in the datastore
  • if we're not doing PREVIEW, to download files in the most efficient way with http2 and compression.
  • allow better cataloging/datapushing of data hosted on third-party sites with configurable retries
  • improve user experience with the In-Progress placeholder resource view - for both the Data Publisher and Data Users

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions