Skip to content

Commit

Permalink
BibHarvest: configurable workflow with refextract
Browse files Browse the repository at this point in the history
* Changed OAI Harvest admin interface for post-process
  selection from dropdown to checkbox style.

* Messages/Warnings in the OAI Harvest admin interface
  are now displayed as such.

* Added a check that the bibfilter-script path is valid when
  'filter' (f) step is selected in OAI Harvest admin interface.

* Changed 'extract' post-process to 'extract plots' (p) and
  added two new post-processes: 'extract references' (r) and
  'attach full-text' (t).

* Harvesting post-process modes are now generated and processed
  on-the-fly, based on users choice from the OAI Harvest admin
  interface.

* Seperated downloading of materials (tarballs, pdf) from the various
  steps in order to only download required materials once per OAI record.
  i.e. the download of full-text pdf only happens once even if it is
  required several times. Modified the plotextractor_getter slightly to
  allow this behaviour.

* Improved feedback messages from the OAI Harvest deamon and error
  messages will now flow all the way to Bibtask output logs for each
  post-process step.

* Updated OAI Admin Guide to reflect latest changes.
  • Loading branch information
jalavik authored and tiborsimko committed Feb 28, 2011
1 parent a76516a commit 2a32ba8
Show file tree
Hide file tree
Showing 6 changed files with 785 additions and 550 deletions.
78 changes: 59 additions & 19 deletions modules/bibharvest/doc/admin/oai-admin-guide.webdoc
Original file line number Diff line number Diff line change
Expand Up @@ -56,13 +56,30 @@ Once defined, run the <code>oaiharvest</code> command-line tool to start periodi
<li><strong>Sets</strong>: upon validation the baseURL is checked for the available sets of the repository. The administator can choose the desired sets to harvest. If none is checked, all records are harvested. Otherwise selective harvesting for checked sets is done.</li>
<li><strong>Frequency</strong>: how often do you intend to harvest from this repository? (Note: selecting "never" excludes this source from automatic periodical harvesting)</li>
<li><strong>Starting date</strong>: on your first harvesting session, do you intend to harvest the whole repository (from beginning) or only newly added material (from today)? (WARNING: harvesting large collections of material may take very long!)</li>
<li><strong>Postprocess</strong>: how is the harvested material be dealt with after harvesting?
<ul><li>h: harvest only - metadata will be gathered and saved locally<li>h-u: harvest and upload - metadata will be automatically queued for upload (this is useful when harvested metadata is already in marcxml format)</li><li>h-c: harvest and convert - harvested metadata will be converted and saved locally</li><li>h-c-u: harvest, convert and upload - harvested metadata will be converted and queued for upload - this is useful when harvested metadata is in a format different from marcxml (e.g. oai_dc)</li><li>h-c-f-u: harvest, convert, filter, upload - useful when you want to filter harvested data in some non-trivial way before uploading</li></ul></li>
<li><strong>Postprocess</strong>: how is the harvested material be dealt with after harvesting? You can choose among several action to be taken after harvesting records:
<ul>
<li>c: convert - harvested metadata will be converted (given BibConvert config file) and saved locally - this is useful when harvested metadata is in a format different from marcxml (e.g. oai_dc)</li>
<li>r: extract references - attempt to download full-text pdf for each record and run refextract to add references to resulting MARCXML</li>
<li>p: extract plots - attempt to download material (like tarballs) for each record and extract plots (only from LaTeX sources) to attach with FFT to resulting MARXML</li>
<li>t: attach full-text - attempt to download full-text for each record and add it as FFT upload in resulting MARCXML</li>
<li>f: filter - useful when you want to filter harvested data in some non-trivial way before uploading</li>
<li>u: upload - metadata will be automatically queued for upload (this is useful when harvested metadata is already in marcxml format)</li>
</ul>Note: Select none of the above to only harvest records
</li>
<li><strong>BibConvert configuration file</strong>: if postprocess involves conversion, the filename of an existing BibConvert configuration file (or its full path if file is not in standard BibConvert config directory) is needed.
This file determines how the metadata conversion will take place (e.g. oai_dc to marcxml).
In order to ensure that the configuration file performs the desired conversion, please test it in advance and follow the instruction contained in the <a href="<CFG_SITE_URL>/help/admin/bibconvert-admin-guide<lang:link/>">BibConvert Admin Guide</a>.
Please note that some conversion parameters previously input at the command line can now be handled by the BibConvert Configuration Language (namely record separator, file header and file footer). Please consult the <a href="<CFG_SITE_URL>/help/admin/bibconvert-admin-guide<lang:link/>">BibConvert Admin Guide</a> for more information on this.
This allows fully automatised harvesting and conversion (and possibly upload) of records provided a valid, functioning, BibConvert template is provided.</li>
<li><strong>BibFilter program file</strong>: if postprocess involves filtering, the full path of an existing BibFilter script is needed.
This file determines how the harvested records should be filtered. It should produce 4 different MARXML outputs:
<ul>
<li>*.insert.xml: full records will be uploaded with "-i" parameter, new record</li>
<li>*.append.xml: record snippets will be uploaded with "-a" parameter, record additions</li>
<li>*.correct.xml: record snippets will be uploaded with "-c" parameter, record corrections</li>
<li>*.holdingpen.xml: full records will be uploaded with "-o" parameter, to Holding Pen</li>
</ul>
Please refer to <a href="<CFG_SITE_URL>/help/admin/bibupload-admin-guide<lang:link/>">BibUpload Admin Guide</a> for more information in BibUpload parameters. In order to ensure that the program performs the desired actions, please test it in advance.
</ul>

<b>Edit and delete OAI sources</b><br />
Expand Down Expand Up @@ -119,25 +136,48 @@ All have to be done in order to achieve this is selecting appropriate records by
<b>Oaiharvest usage</b>
<blockquote>
<pre>
oaiharvest [options]

Specific options:
-r, --repository=REPOS_ONE,"REPOS TWO" name of the OAI repositories to be harvested (default=all)
-d, --dates=yyyy-mm-dd:yyyy-mm-dd harvest repositories between specified dates (overrides repositories' last updated timestamps)
oaiharvest [options]

Manual single-shot harvesting mode:
-o, --output specify output file
-v, --verb OAI verb to be executed
-m, --method http method (default POST)
-p, --metadataPrefix metadata format
-i, --identifier OAI identifier
-s, --set OAI set(s). Whitespace-separated list
-r, --resuptionToken Resume previous harvest
-f, --from from date (datestamp)
-u, --until until date (datestamp)
-c, --certificate path to public certificate (in case of certificate-based harvesting)
-k, --key path to private key (in case of certificate-based harvesting)
-l, --user username (in case of password-protected harvesting)
-w, --password password (in case of password-protected harvesting)

Automatic periodical harvesting mode:
-r, --repository="repo A"[,"repo B"] which repositories to harvest (default=all)
-d, --dates=yyyy-mm-dd:yyyy-mm-dd reharvest given dates only

Scheduling options:
-u, --user=USER User name under which to submit this task.
-t, --runtime=TIME Time to execute the task. [default=now]
Examples: +15s, 5m, 3h, 2002-10-27 13:57:26.
-s, --sleeptime=SLEEP Sleeping frequency after which to repeat the task.
Examples: 30m, 2h, 1d. [default=no]
-L --limit=LIMIT Time limit when it is allowed to execute the task.
Examples: 22:00-03:00, Sunday 01:00-05:00.
Syntax: [Wee[kday]] [hh[:mm][-hh[:mm]]].
-P, --priority=PRI Task priority (0=default, 1=higher, etc).
-N, --name=NAME Task specific name (advanced option).

General options:
-h, --help Print this help.
-V, --version Print version information.
-v, --verbose=LEVEL Verbose level (0=min, 1=default, 9=max).
--profile=STATS Print profile information. STATS is a comma-separated
list of desired output stats (calls, cumulative,
file, line, module, name, nfl, pcalls, stdname, time).

Scheduling options:
-u, --user=USER user name to store task, password needed
-s, --sleeptime=SLEEP time after which to repeat tasks (no)
e.g.: 1s, 30m, 24h, 7d
-t, --time=TIME moment for the task to be active (now)
e.g.: +15s, 5m, 3h , 2002-10-27 13:57:26

General options:
-h, --help print this help and exit
-V, --version print version and exit
-v, --verbose=LEVEL verbose level (from 0 to 9, default 1)
</pre>
<em>Note that oaiharvest has additional parameters for manual "single-shot" harvesting. Run <code>oaiharvest -h</code> for more details about this mode.</em>
</blockquote>

<code>oaiharvest</code> performs a number of operations on the repositories listed in the database. By default <code>oaiharvest</code> considers all repositories, one by one (this gets overridden when <code>--repository</code> argument is passed).
Expand Down
53 changes: 27 additions & 26 deletions modules/bibharvest/lib/bibharvest_templates.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,11 +80,11 @@ def tmpl_output_numbersources(self, ln, numbersources):
_ = gettext_set_language(ln)
present = _("OAI sources currently present in the database")
notpresent = _("No OAI sources currently present in the database")
if (numbersources>0):
output = """&nbsp;&nbsp;&nbsp;&nbsp;<strong><span class="info">%s %s</span></strong><br /><br />""" % (numbersources, present)
if (numbersources > 0):
output = """&nbsp;&nbsp;&nbsp;&nbsp;<strong><span class="info">%s %s</span></strong><br /><br />""" % (numbersources, present)
return output
else:
output = """&nbsp;&nbsp;&nbsp;&nbsp;<strong><span class="warning">%s</span></strong><br /><br />""" % notpresent
output = """&nbsp;&nbsp;&nbsp;&nbsp;<strong><span class="warning">%s</span></strong><br /><br />""" % notpresent
return output

def tmpl_output_schedule(self, ln, schtime, schstatus):
Expand All @@ -94,12 +94,12 @@ def tmpl_output_schedule(self, ln, schtime, schstatus):
msg_cur = _("current status:")
msg_notask = _("No oaiharvest task currently scheduled.")
if schtime and schstatus:
output = """&nbsp;&nbsp;&nbsp;&nbsp;<strong>%s<br />
output = """&nbsp;&nbsp;&nbsp;&nbsp;<strong>%s<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - %s %s <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - %s %s </strong><br /><br />""" % (msg_next, msg_sched, schtime, msg_cur, schstatus)
return output
else:
output = """&nbsp;&nbsp;&nbsp;&nbsp;<strong><span class="warning">%s</span></strong><br /><br />""" % msg_notask
output = """&nbsp;&nbsp;&nbsp;&nbsp;<strong><span class="warning">%s</span></strong><br /><br />""" % msg_notask
return output

def tmpl_admin_w200_text(self, ln, title, name, value, suffix="<br/>"):
Expand All @@ -115,21 +115,22 @@ def tmpl_admin_w200_text(self, ln, title, name, value, suffix="<br/>"):
""" % (cgi.escape(name, 1), cgi.escape(value, 1), suffix)
return text

def tmpl_admin_checkboxes(self, ln, title, name, values, labels, states):
def tmpl_admin_checkboxes(self, ln, title, name, values, labels, states, message=""):
"""Draws a list of HTML checkboxes
- 'title' *string* - The name of the list of checkboxes
- 'name' *string* - The name for this group of checkboxes
- 'values' *list* - The values of the checkboxes
- 'labels' *list* - The labels for the checkboxes
- 'states' *list* - The previous state of each checkbox 1|0
- 'message' *string* - Put a message over the checkboxes. Optional.
len(values) == len(labels) == len(states)
"""
_ = gettext_set_language(ln)
text = """<div><div style="float:left;"><span class="adminlabel">%s</span></div>""" % title
text += """<table><tr><td>"""
text += '&nbsp;&nbsp; <small><i>(Leave all unchecked for non-selective' \
' harvesting)</i></small><br/>'
if message:
text += '&nbsp;&nbsp; <small><i>(%s)</i></small><br/>' % (message,)

for i in range(len(values)):
value = values[i]
Expand Down Expand Up @@ -181,7 +182,7 @@ def tmpl_print_info(self, ln, infotext):
def tmpl_print_warning(self, ln, warntext):
"""Outputs some info"""
_ = gettext_set_language(ln)
text = """<span class="warning">%s</span>""" % warntext
text = """<br /><span class="warning">%s</span>""" % warntext
return text

def tmpl_print_brs(self, ln, howmany):
Expand Down Expand Up @@ -230,7 +231,7 @@ def tmpl_output_table(self, title_row=None, data=None):
data = []
result = "<table><tr>"
for header in title_row:
result += "<td><b>"+ header + "</b></td>"
result += "<td><b>" + header + "</b></td>"
result += "</tr>"
for row in data:
result += "<tr>"
Expand All @@ -240,12 +241,12 @@ def tmpl_output_table(self, title_row=None, data=None):
result += "</table>"
return result

def tmpl_table_begin(self, title_row = None):
def tmpl_table_begin(self, title_row=None):
result = "<table class=\"brtable\">"
if title_row != None:
result += "<tr>"
for header in title_row:
result += "<td><b>"+ header + "</b></td>"
result += "<td><b>" + header + "</b></td>"
result += "</tr>"
return result

Expand All @@ -272,7 +273,7 @@ def tmpl_table_end(self):
def tmpl_history_day_details_link(self, ln, date, oai_src_id):
"""Return link to detailed history for the day"""
_ = gettext_set_language(ln)
return create_html_link(urlbase = oai_harvest_admin_url + \
return create_html_link(urlbase=oai_harvest_admin_url + \
"/viewhistoryday",
urlargd={'ln':ln,
'oai_src_id': str(oai_src_id),
Expand All @@ -282,11 +283,11 @@ def tmpl_history_day_details_link(self, ln, date, oai_src_id):
'start': str(10)},
link_label=_("View next entries..."))

def tmpl_history_table_output_day_cell(self, date, number_of_records, oai_src_id, ln, show_details = False):
def tmpl_history_table_output_day_cell(self, date, number_of_records, oai_src_id, ln, show_details=False):
inner_text = "<b>" + self.format_date(date)
inner_text += " ( "+ str(number_of_records) + " entries ) &nbsp;&nbsp;&nbsp;"
inner_text += " ( " + str(number_of_records) + " entries ) &nbsp;&nbsp;&nbsp;"
if show_details:
inner_text += self.tmpl_history_day_details_link(ln, date, oai_src_id)
inner_text += self.tmpl_history_day_details_link(ln, date, oai_src_id)
inner_text += "</b>"
return self.tmpl_table_output_cell(inner_text, colspan=6)

Expand All @@ -312,7 +313,7 @@ def tmpl_output_normal_frame(self, content):
output += "</div>"
return output

def tmpl_output_month_selection_bar(self, oai_src_id, ln, current_year = None, current_month=None):
def tmpl_output_month_selection_bar(self, oai_src_id, ln, current_year=None, current_month=None):
"""constructs the month selection bar"""
_ = gettext_set_language(ln)
if current_month == None or current_year == None:
Expand Down Expand Up @@ -427,7 +428,7 @@ def tmpl_output_identifiers(self, identifiers):
return result

def tmpl_output_select_day_button(self, day):
result = """<button class="adminbutton" onClick="return selectDay(""" + str(day)+ """)">Select</button>"""
result = """<button class="adminbutton" onClick="return selectDay(""" + str(day) + """)">Select</button>"""
return result

def tmpl_output_menu(self, ln, oai_src_id, guideurl):
Expand All @@ -437,34 +438,34 @@ def tmpl_output_menu(self, ln, oai_src_id, guideurl):
_ = gettext_set_language(ln)
link_default_argd = {'ln': ln}

main_link = create_html_link(urlbase = oai_harvest_admin_url,
main_link = create_html_link(urlbase=oai_harvest_admin_url,
urlargd=link_default_argd,
link_label=_("main Page"))

link_default_argd['oai_src_id'] = str(oai_src_id)

edit_link = create_html_link(urlbase = oai_harvest_admin_url + \
edit_link = create_html_link(urlbase=oai_harvest_admin_url + \
"/editsource",
urlargd=link_default_argd,
link_label=_("edit"))
delete_link = create_html_link(urlbase = oai_harvest_admin_url + \
delete_link = create_html_link(urlbase=oai_harvest_admin_url + \
"/delsource",
urlargd=link_default_argd,
link_label=_("delete"))
test_link = create_html_link(urlbase = oai_harvest_admin_url + \
test_link = create_html_link(urlbase=oai_harvest_admin_url + \
"/testsource",
urlargd=link_default_argd,
link_label=_("test"))
hist_link = create_html_link(urlbase = oai_harvest_admin_url + \
hist_link = create_html_link(urlbase=oai_harvest_admin_url + \
"/viewhistory",
urlargd=link_default_argd,
link_label=_("history"))
harvest_link = create_html_link(urlbase = oai_harvest_admin_url + \
harvest_link = create_html_link(urlbase=oai_harvest_admin_url + \
"/harvest",
urlargd=link_default_argd,
link_label=_("harvest"))
separator = "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"
menu_header = self.tmpl_draw_titlebar(ln, title = "Menu" , guideurl = guideurl)
menu_header = self.tmpl_draw_titlebar(ln, title="Menu" , guideurl=guideurl)
# Putting everything together
result = menu_header + main_link
if oai_src_id != None:
Expand Down Expand Up @@ -518,7 +519,7 @@ def tmpl_view_holdingpen_body(self, filter, content):
<input type="submit" value="Show"></input>
</form>
</div>
<ul id="holdingpencontainer"> %s </ul>""" %(filter, content, )
<ul id="holdingpencontainer"> %s </ul>""" % (filter, content,)


def tmpl_view_holdingpen_headers(self):
Expand Down
Loading

0 comments on commit 2a32ba8

Please sign in to comment.