PriorDesign

Questions needing to be answered are in red.

Ben's implementation design

Diffs to Dave's Design

distinction between "Records to Process" and "Processed Records" among the "Input Records" is no more of a distinction than in any other service (the MST platform keeps tracks of which records it has already processed)
Output Records never contain uplinks to Input Records. They always contain uplinks to Output Record ids. If an Output Record contains an uplink to a record that hasn’t been processed yet, a recordId is set aside for it. That Output Record will be marked as held until the linked record arrives.
Match algorithm isn’t really a record-by-record comparison. Instead it checks the collection of all matchpoint ids the service has already processed. In some sense it is a record-by-record comparison - it's just a different way of looking at it.

First Iteration

Ability to hold all necessary data in memory * no one-off database lookups are necessary
Merge algorithm randomly selects a record (most likely the last in)

Subsequent Iterations

smarter merging either by: * selecting which record to use * actually aggregating records
to support the above feature, it might make sense to not actually store the output xml (because of ids), but generate records on the fly for downstream oai-pmh requests

Plan for Implementation

add tasks to fogbugz and provide estimates (will be more accurate than the total guesses below). The estimates below assume 100% devotion to aggregation service.
enhance documentation * estimate: 3 days
implement first iteration * estimate: 1 week
create tests for each of Dave’s scenarios (pp 3-4 of the ppt) * estimate: 1 weeks
Run against full record set from voyager * determine how many records matched * determine performance specifically of the merge process * estimate: 1 weeks
Run against full record set from voyager + irplus * find the above determinations as well * estimate: 3 days
go back to step #1 for subsequent iteration (more advanced merge algorithm)

From Dave's PPT

I think it's important to go through Dave's implementation line-by-line to make sure the actual implementation doesn't overlook anything Dave has already thoroughly considered. BDA: my notes/diffs in blueWhat do we think about this approach? Is it worth continuing?
PROCESS Record: Take an input (W, E, M, or H) record

search for matching records by comparing it with each (all) of the "processed records" at the same FRBR level, using the match algorithm for the appropriate FRBR level (four match algorithms),
- BDA: search for matching records by checking the matchpoint caches*** for each match, check the successor of that record to see if it is a merged record (if it has multiple predecessors, then it is merged), and include the other predecessors in the found set (no need to compare with those, since they will definitely match). BDA: The matchpoint cache will contain the incoming record ids that match. With those incoming record ids, you can find all the output record ids. From there you can check the "records with multiple predecessors cache" to get any other predecessors. (This is important if the record currently being processed (A) matches record X and X matches Y, but A doesn't match Y. Without doing all these steps, you'd fail.) For any matching processed record (merged or not), mark it's successor output record as deleted, and temporarily keep track of the corresponding predecessor(s) for remerge. NOTE: If the marked-deleted record is in Held Records, just permanently delete it, if it is in Released Records, mark it deleted, and follow DELETE OUTPUT algorithm for that record. BDA: If the record is to be merged, instead of deleting the existing successor records for matches, we'll just overwrite one of them. If there are more than one, all but one will be deleted. This will prevent us from having to update uplinks in child records. There will still be a need to update child records if there were multiple successors (and thus required a delete). To accomplish this, we'd need to reprocess that record individually or just get the output and modify as needed (the latter seems to make more sense). Continue matching until you have done the comparison with all records in the processed records set. Then run the group of matched records through the appropriate merge algorithm (there are four merge algorithms, one for each frbr level), producing one new output record. BDA: For the first implementation we'll just use the current record as the merged record (with modified uplinks and xc:recordIds). Set predecessor links between that record and the matched set of input records BDA: update all caches accordingly (caches are periodically flushed to the db) Uplinks in the merged output record are the set of the uplinks in the matched input records (unique, but we can't check that yet, because two uplinks that look different could point to the same merged entity at a higher level, so we dedup later, once). BDA: uplinks in output records will always be to other output records (which may be non-existent or held). This way the output record content won't need to be updated later on - just switched from HELD to ACTIVE Check if this record can be released: If EVERY uplink points to a parent record that is BOTH in the "processed records" set and has a successor in the Released Records set, then we can proceed to the RELEASE Record steps below. If not, then we add a row to the WAITING FOR RELEASE QUEUE (record id, and parent OAI-ID).**
- BDA: If every parent is non-held (check the record status cache), then the record will be marked as active. Otherwise, it'll be held.*** Move currently processing record into Processed Records set**

RELEASE Record (this algorithm handles one Held Record that is "ready for release" by modifying it's uplinks and then releasing it):

Rules: All uplinks in Held records will point to records in the Processed Records set. All uplinks in Released Records will point to records in the Released Records set.
- BDA: All output records will link to other output records. If an output record points to a record that doesn't yet exist (hasn't come into the agg service yet), then it will be marked as held.*** Modify uplinks in the Held record by following each parent records successor link and then modifying each uplinks to point to there (which will always be a Released Record). BDA: This step isn't necessary (has already been addressed). Deduplicate the modified uplinks (it is important to do that now AFTER we have rewritten the links) - use simple string comparison. BDA: yes - dedupe. Move record from Held Records set into the Released Records set. BDA: Mark record as "active" Check the WAITING FOR RELEASE queue to see if any records are waiting on this record. If so, record the Held-Record-ID from this table, and remove the corresponding row. Now, query the table to see if this Held-Record-ID is waiting on any OTHER parents (some records will be waiting on multiple records). If it is not waiting on any others, run RELEASE Record on this Held-Record-ID (save this to a TO-RELEASE queue until the next step is complete) BDA: Query ORPHANS table using the record-id of the predecessor of the record being released. For each record found, examine child uplinks and rewrite each one that used to point to the predecessor, and rewrite them to point to the newly Released record. Deduplicate uplinks in the child record (DO NOT SKIP THIS STEP).
Continue to RELEASE any records added to the TO-RELEASE queue.**

DELETE OUTPUT Record:

Look for children (in "Released Records" set) that point to this record. They now point to a deleted record (which is okay temporarily, but the deleted record id cannot be reused, because the children will point to this deleted record id for an indeterminate amount of time, and downstream services would be corrupted if reused)
In the ORPHANS table, save the child's record-id also save the ID of the marked-deleted parent records predecessor

Downloads
Installing the Toolkit
- Hardware Requirements
- Installing 3rd Party Tools
- Installing the Metadata Services Toolkit
  - In Windows
  - In Unix
- Configuring
  - Configuring the MST
  - Configuring Tomcat
  - Configuring MySQL
    - MySQL Permissions
    - MySQL Configurations
  - Configuring server
- Starting the MST
  - In Windows
  - In Unix
- Uninstalling and Reinstalling the MST
- Upgrading the MST
- Useful Info
Using the Toolkit
Services
- What is a service?
- What are Configuration 1 and Configuration 2?
- XC MARCXML Normalization
- MARCXML to XC Transformation
- DC to XC Transformation
  - Mappings
  - Example Input and Output Records
- MARC Aggregation
- Multiple Instances of the Same Service
  - How to install multiple instances of the same service
- Harvesting from an MST Service
  - How to harvest from an MST Service
How To Implement a Service
- Quick and Dirty Tutorial
- Details on the process method
- Testing your service
- AdvancedFeatures
- Contribute to a core service
About the XC Schema
MST Frequently Asked Questions
Performance Results
- RecordBreakdown
- MySQLCustomizations
Release Notes
Next Coding Period Summary
Glossary
Developer ScratchPad
- ServerChart
- Transformation 1.0
  - TransformationDocumentationNotes
  - new
    - TransformationDocumentation
  - old
    - AdditionalWorksAndExpressions
    - Transformation Service Documentation
    - TransformationServiceSteps
    - XcRoleTranslationTable
- AggregationServices
  - MarcAggregation
    - MySQL Tuning for MAS
    - Scratch Pad
  - TransformationTwoPointOh
  - old
    - FirstIteration
    - PriorDesign
- PackagingMST
- 1.0 Decisions
  - RepositoryUpdatesDeletes
  - RecordCountProblems
  - UIChanges
  - ServiceUpdates
  - LogsUI
- ReleaseWork
- QuickInstallNotes
- MST Implementation Details
  - OaiIdIndexAlgorithm
  - CacheDetails
  - MessageHandling
  - ServiceTests
  - ProcessingStepsExplained
  - ResumptionToken->completeListSize
  - UpdateDelete
  - OaiPmhImpl
- record counts
  - RecordCountsOnePtTwoPtOne
  - in production
  - how to log and display
  - RecordCountsOnePtZero
  - RecordCountTestRestarted
  - UrRecordCounts
  - RecordCountTesting
  - TransformationWackiness
- OaiImplementation
- Testing
  - randys-30
  - RegressionTests
- QuickRef
- UnicodeNormalization
- LoggingHelp
- CodeFormatPolicy
- SvnBranchingStrategy
- MultipleEclipseWorkspaces
- DeleteReaddServiceForRetest
- FileHarvests
- CharsetEncodingWithEric
- DrupalSolrOptimization
  - WorkPlan
  - MetricsForAssessment
  - IdeasForImprovement
  - RandomNotes
Wiki en español
- Servicios
  - Qué es un Servicio de Metadatos?
  - Servicio de Normalización XC MARCXML
  - Servicio de Transformación MARCXML a Esquema XC
    - Introducción al Servicio de Transformación
    - Modificaciones en el Servicio de Transformación
  - Servicio de Agregación MARC

PriorDesign

Other Documents

Ben's implementation design

From Dave's PPT

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!