Implement a procedure for updating the URI normalization algorithm

There is a widely used URI normalization algorithm implemented by `h.util.uri.normalize` which ensures that certain differences are ignored when comparing/searching by URL in many contexts. These differences include:

- Whether the site was visited over HTTP or HTTPS
- Whether a default port was specified or not (80 vs 443)
- The value of query string parameters which are known to be irrelevant to the content (eg. signatures or Google Analytics campaign metadata)

Periodically we find reasons that we want to change this algorithm, for example to add new parameters to the `BLACKLISTED_QUERY_PARAMS` set. A concrete example is ignoring the `token` signed token that is added to Canvas files URLs. 

However we can't do this because we have no procedure or tools to help us re-process the existing URIs that are stored in the database and in Elasticsearch. As a result changing the normalization algorithm will break lookup of existing annotations unless we update references in the DB/Elasticsearch.

Normalized URLs are currently stored in the following database fields:

- `annotation.target_uri_normalized`
- `document.claimant_normalized`
- `document.uri_normalized`

They are also stored in the search index in fields generated by `AnnotationSearchIndexPresenter`.

This issue covers implementing and documenting tools to facilitate updating the URI normalization algorithm.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement a procedure for updating the URI normalization algorithm #6552

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement a procedure for updating the URI normalization algorithm #6552

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions