OaiIdIndexAlgorithm

The thing we most want to minimize in order to optimize performance is doing one-off database queries for each incoming record (both for harvests and services). Answering the question "have I seen this record previously ?" is an important one because it determines whether the harvester/service needs to insert a new record (assigning a new id) or update an existing record (get the existing id). For an initial run, this question doesn't need to be answered per record because we can assume the answer is no. However, for ongoing harvests it would still probably be too slow to ask the database each time. In order to answer this question effectively, we should keep a hashtable in memory (input-id -> output-id). For services, this map is easy because we can use our internal MST record ids (which are numeric and unique across the entire system). For harvests, though, the unique part is based on an external oai-id of which we don't know much about.

To keep the all oai-ids in memory for a given repository requires a decent amount of memory for a large repository.

(50 chars * 2 bytes + 38 + 4) * 10,000,000 = 1.4G

_{REF: [http://www.javamex.com/tutorials/memory/string_memory_usage.shtml]}

This is not totally unreasonable, but it also wouldn't be difficult to reduce it by about 75% by simply stripping out redundant data in the oai-id so that what remains is a smaller (unique per repo) representation of the oai-id. In order to do this, we simply allow the user the ability to enter the redundant portion of the oai-id per repo. If the user does not configure this, the MST will simply use more memory than it needs to. This should work fine until you get into 10s of millions of records (obviously dependent on how much memory you have).

the user configures a redundant section of oai-id
- this will most likely be something like "oai:extensiblecatalog.info:"
- examples:
  1. oai:extensiblecatalog.info:0
  - results in 0
  1. oai:extensiblecatalog.info:bib:0
  - results in bib:0
  1. oai:extensiblecatalog.info:bib/0
  - results in bib/0

FAQ

Question: What if, for a particular repository, a user enters “repo.domain.edu” as the redundant section. And then the repository has a few records without that string contained in their identifier? How would the MST behave in the harvesting case? In the service processing step?
- Answer: If an oai-id does not contain the redundantToken, the oai-id will simply be kept in memory in full. All this means is that memory will not be used efficiently. This question only applies to harvesting because services will map
  input-record-ids -> output-record-ids
  based on our own internal (unique per mst) integral ids.
Question:What are some other potential error conditions that we can account for and at least explain the ramifications in advance?
- Answer: The main risk is running out of memory. In order to run out of memory, some combination of these 3 things would have to happen:
  1. user did not enter the redundant portion of the oai-id
  2. there are 10s of millions of records in the harvested repository
  3. the system running the MST has less than 2G of RAM
    
    We could potentially allow for another optional way to do it that doesn't require a memory cache. We could do a db lookup to map input-records to output-records. However, this will significantly slow down the harvest. Although, it might not be a big deal since it'll only matter for subsequent harvests. I don't think this is necessary at the time, but good to know we could do it.

Downloads
Installing the Toolkit
- Hardware Requirements
- Installing 3rd Party Tools
- Installing the Metadata Services Toolkit
  - In Windows
  - In Unix
- Configuring
  - Configuring the MST
  - Configuring Tomcat
  - Configuring MySQL
    - MySQL Permissions
    - MySQL Configurations
  - Configuring server
- Starting the MST
  - In Windows
  - In Unix
- Uninstalling and Reinstalling the MST
- Upgrading the MST
- Useful Info
Using the Toolkit
Services
- What is a service?
- What are Configuration 1 and Configuration 2?
- XC MARCXML Normalization
- MARCXML to XC Transformation
- DC to XC Transformation
  - Mappings
  - Example Input and Output Records
- MARC Aggregation
- Multiple Instances of the Same Service
  - How to install multiple instances of the same service
- Harvesting from an MST Service
  - How to harvest from an MST Service
How To Implement a Service
- Quick and Dirty Tutorial
- Details on the process method
- Testing your service
- AdvancedFeatures
- Contribute to a core service
About the XC Schema
MST Frequently Asked Questions
Performance Results
- RecordBreakdown
- MySQLCustomizations
Release Notes
Next Coding Period Summary
Glossary
Developer ScratchPad
- ServerChart
- Transformation 1.0
  - TransformationDocumentationNotes
  - new
    - TransformationDocumentation
  - old
    - AdditionalWorksAndExpressions
    - Transformation Service Documentation
    - TransformationServiceSteps
    - XcRoleTranslationTable
- AggregationServices
  - MarcAggregation
    - MySQL Tuning for MAS
    - Scratch Pad
  - TransformationTwoPointOh
  - old
    - FirstIteration
    - PriorDesign
- PackagingMST
- 1.0 Decisions
  - RepositoryUpdatesDeletes
  - RecordCountProblems
  - UIChanges
  - ServiceUpdates
  - LogsUI
- ReleaseWork
- QuickInstallNotes
- MST Implementation Details
  - OaiIdIndexAlgorithm
  - CacheDetails
  - MessageHandling
  - ServiceTests
  - ProcessingStepsExplained
  - ResumptionToken->completeListSize
  - UpdateDelete
  - OaiPmhImpl
- record counts
  - RecordCountsOnePtTwoPtOne
  - in production
  - how to log and display
  - RecordCountsOnePtZero
  - RecordCountTestRestarted
  - UrRecordCounts
  - RecordCountTesting
  - TransformationWackiness
- OaiImplementation
- Testing
  - randys-30
  - RegressionTests
- QuickRef
- UnicodeNormalization
- LoggingHelp
- CodeFormatPolicy
- SvnBranchingStrategy
- MultipleEclipseWorkspaces
- DeleteReaddServiceForRetest
- FileHarvests
- CharsetEncodingWithEric
- DrupalSolrOptimization
  - WorkPlan
  - MetricsForAssessment
  - IdeasForImprovement
  - RandomNotes
Wiki en español
- Servicios
  - Qué es un Servicio de Metadatos?
  - Servicio de Normalización XC MARCXML
  - Servicio de Transformación MARCXML a Esquema XC
    - Introducción al Servicio de Transformación
    - Modificaciones en el Servicio de Transformación
  - Servicio de Agregación MARC

OaiIdIndexAlgorithm

FAQ

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!