Skip to content

Revised Protocol

Aidan Sawyer edited this page Mar 13, 2018 · 2 revisions

Introduction

I would not have made this program if I did not believe it was an improvement over our regular processes. It may not be right for every case, and it may very well be overkill for smaller collections, but using it should lead to a more agile, durable, and configurable long-term collection that can be easily parsed and searched, even within the local file system.

Benefits

Simplicity

Creating the actual text files is really intuitive and very simple. Delineations between elements and qualifiers are easy to view at a glance, there are no pages to scroll through, buttons to press, or text forms to tab through, and even the conversion from name formats (firstName lastName -> lastName, firstName) is taken away. The limited subset in options doesn’t require any foreknowledge of the dublin-core standard, but a look into the code and the running of the help command brings up (or will bring up) aid when it's needed.

Speed

When the physical items in question are well-formed enough to include an organized table of contents or title page with enough detail and the correct color scheme to have a quality scan and accurate OCR, a huge amount of typing can be avoided and simply copy/pasted into the text file. This benefit increases as the number of contributors grows.

Flexible Precision

Once all of the possible specificity (Sports Editor, Copy Editor, News Editor) is encoded into the text file, the level of direct precision and the cataloging decisions can be altered at any point from their direct or closest qualifier-level dublin core tags (e.g. dc.contributor.sportsEditor) to a more generic interpretation (dc.contributor.editor), up to generic element specificity (dc.contributor), and simply run again. As a significant amount of the dublin-core standard is included and referenced in a hierarchical class structure within the program, it’s very easy to alter exactly what tags you’re using.

Control

I tried wherever possible to give the user the ability to abstract where appropriate and responsible to allow for user customization or developer/export input. This affords the user complete control over input and output format and sorting rules.

Decreased Duplication

You’re still going to be performing a highly redundant task when creating the text files, but the number duplicated fields themselves have been reduced not only via the shared.csv file, but by the capacity to copy/paste from an OCR’ed table of contents or title page and avoid typing at all.

Configurable and Open Source

A main consideration in writing the software was ensuring that it could be configured and tweaked without altering the sourcecode, and whenever desired, that the code would be easy to customize and alter even on a per-collection basis. This stated, at many points, editing a matching rule or determining the metadata qualifier or element should only require altering one or two functions of <10 lines each. This underscores the user control, and opens the program up to a higher number of users.

Extensible/Portable

The largest benefit of using this program is that it allows (or will allow) multiple forms of output from a single, customizable, plain text format that is platform and computing language dependent. This allows the input to be created on most any machine with any set up, sent quite simply in the body of an HTTP request, and supports outputs that can be used by any number of alternative tools.

Reusable and Approachable

Keeping these text files has multiple benefits even beyond rerunning the application with different selected outputs for use in new applications. The text files themselves exist as a highly readable way to maintain the metadata of a given item co-located with the item itself.

Database Protection

That the main function of the program is to create an item that will be used later for a batch import and no direct interaction with the database is required, the user is not required to be logged into the digital archive instance, or even to have any access at all, if you want to hand off the actual import of the outputs into the database. The batch-ing and programmatic creation also suggest and lead to more consistent and arguably higher quality inputs that are easier to trace back in time and/or rollback.

Drawbacks

Increased Start-up Overhead

Determining the header format and the level of specificity you’ll use in parsing and sorting contributors or other pieces will take some time and have significant impact on the process. This determined schema must then be kept in mind during the duration of the text file creation.

Increased Training Demand

Though the process between the schema creation and the program execution is very simple, the others can require some prerequisite programming and/or dublin-core knowledge. Granted, these parts could very well be conducted by trained faculty and only the middle pieces assigned to student workers.

Introduce More Error Checking

One of the benefits of the program is that it allows for leveraging the OCR scans for data entry. These scans are rarely perfect, and do tend to require some formatting or text recognition errors (such as rn -> m, cl -> d, i -> 1, l -> I, etc). Since the names are also not yet cross referenced, suggested, or predicted, there is a greater chance of making typos as well.

Formatting the Text

It can get tedious and complicated to constantly be formatting OCRed text into a file or even just recreating them constantly from scratch. This is hard to avoid and difficult to fix.

Difficult to Track Progress

Other than looking at file creation, there is no automated online way to track the progress of a worker through a collection.