Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archival Pipelines #3

Open
7 tasks
pekasen opened this issue May 13, 2022 · 3 comments
Open
7 tasks

Archival Pipelines #3

pekasen opened this issue May 13, 2022 · 3 comments
Assignees
Labels
enhancement New feature or request priority Prioritize this one.

Comments

@pekasen
Copy link
Member

pekasen commented May 13, 2022

As of yet dabapush initializes pipelines solely by the readers and writers name, thus, a call like dabapush run default would look for a reader named 'default ' and a writer named default. The reader extracts all records according to it's programming from the specified file and glob-pattern and passes these records to the writer.

This hinders archival pipelines in two ways: in an archival pipeline have want to have a dependency on the outcome of another pipeline, e.g. we want to archive all the files that have been successfully read by dabapush. Therefore, the input to this pipeline would not be a path/glob-pattern pair but rather the logged files of the already finished pipeline.

Giving the reader that functionality seems a bit spaghetti-like, overloading the class with functionality that is not related to reading and processing files to records in a way that the writer-class objects can process them further.

Cleanest solution would be to enhance the pipelines further: a third object type e.g. named Attacher could be the cleanest solution to that problem. It would take over the responsibility to discover and open files for the reader and through inheritance we can design multiple, different Attachers, e.g. for reading files from disk by means of a path and glob-pattern, by reading the log and filtering for files from specific, already finished pipelines or even read remote files from S3 or SFTP.

Thus, a pipeline would include at least three objects: an Attacher, which decides which files to open, a reader that extracts meaningful records from these files and a writer that persist/writes these records. Initializing these three-piece pipelines can still be achieved by name only, thus, no changes in the structure of the configuration file format is necessary, although some fields must be moved from the reader configuration to an attacher configuration.

In summary of the new pipeline features:

  • pipelines should be able to read logged files from another pipeline, i.e. to move already read file from local storage to S3.
  • another class, the Attacher, is responsible for file/discovery and opening, the reader extracts meaningsful records from the opened file.
  • file should only be logged if processing is complete and did not fail.
  • dabapush is responsible for ensuring safe processing of files and records and keeps the log – which alleviates the Writer-classes from this responsibility.
  • failed items should not crash the pipeline but rather be persist into a special location, e.g. a file like ${date}-${pipeline}-malformed-objects.jsonl.
  • failed items log should be in a format that a Attacher is able too handle that file and process the entries accordingly.
  • therefore the log items should be enhanced with a tag which pipeline processed which file.
@pekasen pekasen added enhancement New feature or request priority Prioritize this one. labels May 13, 2022
@pekasen pekasen self-assigned this May 13, 2022
@FlxVctr
Copy link
Member

FlxVctr commented May 16, 2022

Couldn't an archiver be part of the generic writer class and simply switched on/off at instance creation (archiving=True/False)?

Edit: aah, I get it, you need all the information about the raw data from the reader. Right.

@FlxVctr
Copy link
Member

FlxVctr commented May 16, 2022

Another idea: why not have a 'Pipeline' class, that contains reader and writer and therefore all necessary information. This then could have a property whether it's archiving or not. It gets the info about what to archive from the writer and how to archive from the reader.

@FlxVctr
Copy link
Member

FlxVctr commented May 16, 2022

But I think I am a bit lost. A basic architecture diagram of how it works now and how it's supposed to work in your proposal would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority Prioritize this one.
Projects
None yet
Development

No branches or pull requests

2 participants