Description
There is a widely used URI normalization algorithm implemented by h.util.uri.normalize
which ensures that certain differences are ignored when comparing/searching by URL in many contexts. These differences include:
- Whether the site was visited over HTTP or HTTPS
- Whether a default port was specified or not (80 vs 443)
- The value of query string parameters which are known to be irrelevant to the content (eg. signatures or Google Analytics campaign metadata)
Periodically we find reasons that we want to change this algorithm, for example to add new parameters to the BLACKLISTED_QUERY_PARAMS
set. A concrete example is ignoring the token
signed token that is added to Canvas files URLs.
However we can't do this because we have no procedure or tools to help us re-process the existing URIs that are stored in the database and in Elasticsearch. As a result changing the normalization algorithm will break lookup of existing annotations unless we update references in the DB/Elasticsearch.
Normalized URLs are currently stored in the following database fields:
annotation.target_uri_normalized
document.claimant_normalized
document.uri_normalized
They are also stored in the search index in fields generated by AnnotationSearchIndexPresenter
.
This issue covers implementing and documenting tools to facilitate updating the URI normalization algorithm.