You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As of yet dabapush initializes pipelines solely by the readers and writers name, thus, a call like dabapush run default would look for a reader named 'default ' and a writer named default. The reader extracts all records according to it's programming from the specified file and glob-pattern and passes these records to the writer.
This hinders archival pipelines in two ways: in an archival pipeline have want to have a dependency on the outcome of another pipeline, e.g. we want to archive all the files that have been successfully read by dabapush. Therefore, the input to this pipeline would not be a path/glob-pattern pair but rather the logged files of the already finished pipeline.
Giving the reader that functionality seems a bit spaghetti-like, overloading the class with functionality that is not related to reading and processing files to records in a way that the writer-class objects can process them further.
Cleanest solution would be to enhance the pipelines further: a third object type e.g. named Attacher could be the cleanest solution to that problem. It would take over the responsibility to discover and open files for the reader and through inheritance we can design multiple, different Attachers, e.g. for reading files from disk by means of a path and glob-pattern, by reading the log and filtering for files from specific, already finished pipelines or even read remote files from S3 or SFTP.
Thus, a pipeline would include at least three objects: an Attacher, which decides which files to open, a reader that extracts meaningful records from these files and a writer that persist/writes these records. Initializing these three-piece pipelines can still be achieved by name only, thus, no changes in the structure of the configuration file format is necessary, although some fields must be moved from the reader configuration to an attacher configuration.
In summary of the new pipeline features:
pipelines should be able to read logged files from another pipeline, i.e. to move already read file from local storage to S3.
another class, the Attacher, is responsible for file/discovery and opening, the reader extracts meaningsful records from the opened file.
file should only be logged if processing is complete and did not fail.
dabapush is responsible for ensuring safe processing of files and records and keeps the log – which alleviates the Writer-classes from this responsibility.
failed items should not crash the pipeline but rather be persist into a special location, e.g. a file like ${date}-${pipeline}-malformed-objects.jsonl.
failed items log should be in a format that a Attacher is able too handle that file and process the entries accordingly.
therefore the log items should be enhanced with a tag which pipeline processed which file.
The text was updated successfully, but these errors were encountered:
Another idea: why not have a 'Pipeline' class, that contains reader and writer and therefore all necessary information. This then could have a property whether it's archiving or not. It gets the info about what to archive from the writer and how to archive from the reader.
As of yet dabapush initializes pipelines solely by the readers and writers name, thus, a call like
dabapush run default
would look for a reader named 'default ' and a writer nameddefault
. The reader extracts all records according to it's programming from the specified file and glob-pattern and passes these records to the writer.This hinders archival pipelines in two ways: in an archival pipeline have want to have a dependency on the outcome of another pipeline, e.g. we want to archive all the files that have been successfully read by dabapush. Therefore, the input to this pipeline would not be a path/glob-pattern pair but rather the logged files of the already finished pipeline.
Giving the reader that functionality seems a bit spaghetti-like, overloading the class with functionality that is not related to reading and processing files to records in a way that the writer-class objects can process them further.
Cleanest solution would be to enhance the pipelines further: a third object type e.g. named
Attacher
could be the cleanest solution to that problem. It would take over the responsibility to discover and open files for the reader and through inheritance we can design multiple, different Attachers, e.g. for reading files from disk by means of a path and glob-pattern, by reading the log and filtering for files from specific, already finished pipelines or even read remote files from S3 or SFTP.Thus, a pipeline would include at least three objects: an Attacher, which decides which files to open, a reader that extracts meaningful records from these files and a writer that persist/writes these records. Initializing these three-piece pipelines can still be achieved by name only, thus, no changes in the structure of the configuration file format is necessary, although some fields must be moved from the reader configuration to an attacher configuration.
In summary of the new pipeline features:
${date}-${pipeline}-malformed-objects.jsonl
.The text was updated successfully, but these errors were encountered: