-
Notifications
You must be signed in to change notification settings - Fork 3
OaiIdIndexAlgorithm
The thing we most want to minimize in order to optimize performance is doing one-off database queries for each incoming record (both for harvests and services). Answering the question "have I seen this record previously ?" is an important one because it determines whether the harvester/service needs to insert a new record (assigning a new id) or update an existing record (get the existing id). For an initial run, this question doesn't need to be answered per record because we can assume the answer is no. However, for ongoing harvests it would still probably be too slow to ask the database each time. In order to answer this question effectively, we should keep a hashtable in memory (input-id -> output-id). For services, this map is easy because we can use our internal MST record ids (which are numeric and unique across the entire system). For harvests, though, the unique part is based on an external oai-id of which we don't know much about.
To keep the all oai-ids in memory for a given repository requires a decent amount of memory for a large repository.
(50 chars * 2 bytes + 38 + 4) * 10,000,000 = 1.4G
REF: [http://www.javamex.com/tutorials/memory/string_memory_usage.shtml]
This is not totally unreasonable, but it also wouldn't be difficult to reduce it by about 75% by simply stripping out redundant data in the oai-id so that what remains is a smaller (unique per repo) representation of the oai-id. In order to do this, we simply allow the user the ability to enter the redundant portion of the oai-id per repo. If the user does not configure this, the MST will simply use more memory than it needs to. This should work fine until you get into 10s of millions of records (obviously dependent on how much memory you have).
- the user configures a redundant section of oai-id
- this will most likely be something like "oai:extensiblecatalog.info:"
- examples:
- oai:extensiblecatalog.info:0
- results in 0
- oai:extensiblecatalog.info:bib:0
- results in bib:0
- oai:extensiblecatalog.info:bib/0
- results in bib/0
-
Question: What if, for a particular repository, a user enters “repo.domain.edu” as the redundant section. And then the repository has a few records without that string contained in their identifier? How would the MST behave in the harvesting case? In the service processing step?
-
Answer: If an oai-id does not contain the redundantToken, the oai-id will simply be kept in memory in full. All this means is that memory will not be used efficiently. This question only applies to harvesting because services will map
input-record-ids -> output-record-ids
based on our own internal (unique per mst) integral ids.
-
Answer: If an oai-id does not contain the redundantToken, the oai-id will simply be kept in memory in full. All this means is that memory will not be used efficiently. This question only applies to harvesting because services will map
-
Question:What are some other potential error conditions that we can account for and at least explain the ramifications in advance?
-
Answer: The main risk is running out of memory. In order to run out of memory, some combination of these 3 things would have to happen:
- user did not enter the redundant portion of the oai-id
- there are 10s of millions of records in the harvested repository
- the system running the MST has less than 2G of RAM
We could potentially allow for another optional way to do it that doesn't require a memory cache. We could do a db lookup to map input-records to output-records. However, this will significantly slow down the harvest. Although, it might not be a big deal since it'll only matter for subsequent harvests. I don't think this is necessary at the time, but good to know we could do it.
-
Answer: The main risk is running out of memory. In order to run out of memory, some combination of these 3 things would have to happen:
- Downloads
- Installing the Toolkit
- Hardware Requirements
- Installing 3rd Party Tools
- Installing the Metadata Services Toolkit
- Configuring
- Starting the MST
- Uninstalling and Reinstalling the MST
- Upgrading the MST
- Useful Info
- Using the Toolkit
- Services
- What is a service?
- What are Configuration 1 and Configuration 2?
- XC MARCXML Normalization
- MARCXML to XC Transformation
- DC to XC Transformation
- MARC Aggregation
- Multiple Instances of the Same Service
- Harvesting from an MST Service
- How To Implement a Service
- About the XC Schema
- MST Frequently Asked Questions
-
Performance Results
- RecordBreakdown
- MySQLCustomizations
- Release Notes
- Next Coding Period Summary
- Glossary
- Developer ScratchPad
- ServerChart
- Transformation 1.0
- TransformationDocumentationNotes
- new
- TransformationDocumentation
- old
- AdditionalWorksAndExpressions
- Transformation Service Documentation
- TransformationServiceSteps
- XcRoleTranslationTable
- AggregationServices
- MarcAggregation
- TransformationTwoPointOh
- old
- FirstIteration
- PriorDesign
- PackagingMST
- 1.0 Decisions
- ReleaseWork
- QuickInstallNotes
- MST Implementation Details
- OaiIdIndexAlgorithm
- CacheDetails
- MessageHandling
- ServiceTests
- ProcessingStepsExplained
- ResumptionToken->completeListSize
- UpdateDelete
- OaiPmhImpl
- record counts
- RecordCountsOnePtTwoPtOne
- in production
- how to log and display
- RecordCountsOnePtZero
- RecordCountTestRestarted
- UrRecordCounts
- RecordCountTesting
- TransformationWackiness
- OaiImplementation
- Testing
- randys-30
- RegressionTests
- QuickRef
- UnicodeNormalization
- LoggingHelp
- CodeFormatPolicy
- SvnBranchingStrategy
- MultipleEclipseWorkspaces
- DeleteReaddServiceForRetest
- FileHarvests
- CharsetEncodingWithEric
- DrupalSolrOptimization
- WorkPlan
- MetricsForAssessment
- IdeasForImprovement
- RandomNotes
-
Wiki en español
- Servicios
- Qué es un Servicio de Metadatos?
- Servicio de Normalización XC MARCXML
- Servicio de Transformación MARCXML a Esquema XC
- Servicio de Agregación MARC
- Servicios