Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify encoding of colon between scheme and type #361

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 24 additions & 36 deletions PURL-SPECIFICATION.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ sometimes look like a ``host`` but its interpretation is specific to a ``type``.


Some ``purl`` examples
~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~

::

Expand All @@ -72,7 +72,7 @@ Some ``purl`` examples


A ``purl`` is a URL
~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~

- A ``purl`` is a valid URL and URI that conforms to the URL definitions or
specifications at:
Expand Down Expand Up @@ -110,7 +110,7 @@ A ``purl`` is a URL


Rules for each ``purl`` component
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A ``purl`` string is an ASCII URL string composed of seven components.

Expand All @@ -122,27 +122,11 @@ The rules for each component are:

- **scheme**:

- The ``scheme`` is a constant with the value "pkg"
- Since a ``purl`` never contains a URL Authority, its ``scheme`` must not be
suffixed with double slash as in 'pkg://' and should use instead
'pkg:'. Otherwise this would be an invalid URI per rfc3986 at
https://tools.ietf.org/html/rfc3986#section-3.3::

If a URI does not contain an authority component, then the path
cannot begin with two slash characters ("//").

It is therefore incorrect to use such '://' scheme suffix as the URL would
no longer be valid otherwise. In its canonical form, a ``purl`` must
NOT use such '://' ``scheme`` suffix but only ':' as a ``scheme`` suffix.
- ``purl`` parsers must accept URLs such as 'pkg://' and must ignore the '//'.
- ``purl`` builders must not create invalid URLs with such double slash '//'.
- The ``scheme`` is followed by a ':' separator
- For example these two purls are strictly equivalent and the first is in
canonical form. The second ``purl`` with a '//' is an acceptable ``purl`` but is
an invalid URI/URL per rfc3986::

pkg:gem/ruby-advisory-db-check@0.12.4
pkg://gem/ruby-advisory-db-check@0.12.4
- The ``scheme`` is a constant with the value "pkg".
- The ``scheme`` and ``type`` MUST be separated by a colon ':'.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The ``scheme`` and ``type`` MUST be separated by a colon ':'.
- The ``scheme`` MUST be followed by a unencoded colon ':'.

- ``purl`` parsers MUST accept URLs in which the ``scheme`` and colon ':' are
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- ``purl`` parsers MUST accept URLs in which the ``scheme`` and colon ':' are
- ``purl`` parsers MUST accept URLs where the ``scheme`` and colon ':' are

followed by one or more slash '/' characters, such as 'pkg://', and MUST
ignore -- i.e., normalize by removing -- all such '/' characters.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ignore -- i.e., normalize by removing -- all such '/' characters.
ignore and remove all such '/' characters.



- **type**:
Expand Down Expand Up @@ -234,21 +218,24 @@ The rules for each component are:
Character encoding
~~~~~~~~~~~~~~~~~~

For clarity and simplicity a ``purl`` is always an ASCII string. To ensure that
there is no ambiguity when parsing a ``purl``, separator characters and non-ASCII
characters must be UTF-encoded and then percent-encoded as defined at::
For clarity and simplicity, a ``purl`` is always an ASCII string. To ensure that
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to refactor this section separately. @mprpic volunteered to make this PR.
Let's remove all changes to this Character encoding section from this PR.

there is no ambiguity when parsing a ``purl``, unless otherwise provided in
this specification, separator characters and non-ASCII characters MUST be
UTF-encoded and then percent-encoded as defined at::

https://en.wikipedia.org/wiki/Percent-encoding

Use these rules for percent-encoding and decoding ``purl`` components:

- the ``type`` must NOT be encoded and must NOT contain separators

- the '#', '?', '@' and ':' characters must NOT be encoded when used as
separators. They may need to be encoded elsewhere
- the '#', '?', '@' and ':' characters MUST remain unencoded and displayed
as-is when used as separators. They may need to be encoded elsewhere.

- the ':' ``scheme`` and ``type`` separator does not need to and must NOT be encoded.
It is unambiguous unencoded everywhere
- the colon ':' separator between ``scheme`` and ``type`` MUST remain unencoded.
For example, in the PURL snippet ``pkg:npm`` the colon ':' MUST remain
unencoded and displayed as-is, i.e., ``pkg:npm``, and the PURL snippet
``pkg%3Anpm`` is invalid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gernot-h @pombredanne Consider adding at the top of the file, perhaps as a new one-line paragraph following the current first paragraph, something along the lines of the following:

This specification uses RFC 2119 (https://datatracker.ietf.org/doc/html/rfc2119), as clarified in RFC 8174 (https://datatracker.ietf.org/doc/html/rfc8174), for the interpretation of certain terms, e.g., MUST NOT.

Or perhaps a slight modification to the example provided by RFC 2119:

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119, as clarified in RFC 8174.

(Note that the core spec currently contains a great deal of language that will need to be modified to implement RFC 2119/8174.)

- the '/' used as ``type``/``namespace``/``name`` and ``subpath`` segments separator
does not need to and must NOT be percent-encoded. It is unambiguous unencoded
Expand All @@ -259,7 +246,7 @@ Use these rules for percent-encoding and decoding ``purl`` components:
- the '=' ``qualifiers`` key/value separator must NOT be encoded
- the '#' ``subpath`` separator must be encoded as ``%23`` elsewhere

- All non-ASCII characters must be encoded as UTF-8 and then percent-encoded
- All non-ASCII characters MUST be encoded as UTF-8 and then percent-encoded.

It is OK to percent-encode ``purl`` components otherwise except for the ``type``.
Parsers and builders must always percent-decode and percent-encode ``purl``
Expand All @@ -268,7 +255,7 @@ build" sections.


How to build ``purl`` string from its components
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Building a ``purl`` ASCII string works from left to right, from ``type`` to
``subpath``.
Expand Down Expand Up @@ -343,7 +330,7 @@ To build a ``purl`` string from its components:


How to parse a ``purl`` string in its components
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Parsing a ``purl`` ASCII string into its components works from right to left,
from ``subpath`` to ``type``.
Expand Down Expand Up @@ -386,7 +373,8 @@ To parse a ``purl`` string in its components:
- The left side lowercased is the ``scheme``
- The right side is the ``remainder``

- Strip the ``remainder`` from leading and trailing '/'
- Strip all leading and trailing '/' characters (e.g., '/', '//', '///' and
so on) from the ``remainder``

- Split this once from left on '/'
- The left side lowercased is the ``type``
Expand Down Expand Up @@ -424,7 +412,7 @@ There are several known ``purl`` package type definitions tracked in the
separate `<PURL-TYPES.rst>`_ document.

Known ``qualifiers`` key/value pairs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Note: Do not abuse ``qualifiers``: it can be tempting to use many qualifier
keys but their usage should be limited to the bare minimum for proper package
Expand Down
93 changes: 93 additions & 0 deletions faq.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
Frequently Asked Questions
==========================

The following FAQs are organized into

- a "Components" section that includes each of the seven PURL components
(``scheme``, ``type``, ``namespace``, ``name``, ``version``, ``qualifiers``
and ``subpath``), and

- a "General" section containing a mix of questions and answers that don't fit
neatly into a component-focused category.

If you have a question about the PURL specification and don't find an answer
below, you can open an issue `here <https://github.com/package-url/purl-spec/issues/new?template=Blank+issue>`_.

Components
~~~~~~~~~~

Scheme
------

**QUESTION**: Can the ``scheme`` component be followed by a colon and two slashes, like a URI?

No. Since a Package-URL, or PURL, never contains a URL Authority, its ``scheme`` should not be suffixed with double slash as in 'pkg://' and should use 'pkg:' instead. Otherwise this would be an invalid URI per RFC 3986 at https://tools.ietf.org/html/rfc3986#section-3.3::

If a URI does not contain an authority component, then the path
cannot begin with two slash characters ("//").

This rule applies to all slash '/' characters between the ``scheme``'s colon separator and the ``type`` component, e.g., ':/', '://', ':///' et al.

In its canonical form, a PURL must not use any such ':/' ``scheme`` suffix and may only use ':' as a ``scheme`` suffix. This means that:

- PURL parsers must accept URLs such as 'pkg://'and must ignore -- i.e., normalize by deleting -- all such '/' characters.
- PURL builders should not create invalid URLs with one or more slash '/' characters between 'pkg:' and the `type` component.

For example, although these two PURLs are strictly equivalent, the first is in canonical form, while the second -- with a '//' between 'pkg:' and the ``type`` 'gem' -- is an acceptable PURL but is an invalid URI/URL per RFC 3986::

pkg:gem/ruby-advisory-db-check@0.12.4

pkg://gem/ruby-advisory-db-check@0.12.4

**QUESTION**: Is the colon between ``scheme`` and ``type`` encoded? Can it be encoded? If yes, how?

There are two sections of the core specification that address this question:

- The "Rules for each ``purl`` component" section provides that "[t]he ``scheme`` and ``type`` MUST be separated by a colon ':'".
- The "Character encoding" section provides that

the '#', '?', '@' and ':' characters MUST remain unencoded and displayed as-is when used as separators. . . . [T]he colon ':' separator between ``scheme`` and ``type`` MUST remain unencoded. For example, in the PURL snippet ``pkg:npm`` the colon ':' MUST remain unencoded and displayed as-is, i.e., ``pkg:npm``, and the PURL snippet ``pkg%3Anpm`` is invalid.

In this case, the colon ':' between ``scheme`` and ``type`` is being used as a separator, and consequently should be used as-is, never encoded and never requiring any decoding. Moreover, it should be a parsing error if the colon ':' does not come directly after 'pkg'. Tools are welcome to recover from this error to help with damaged PURLs, but that's not a requirement.

----

Type
----

[to come]

----

Namespace
---------

[to come]

----

Name
----

[to come]

----

Version
-------

[to come]

----

Qualifiers
----------

[to come]

----

Subpath
-------

[to come]
12 changes: 12 additions & 0 deletions test-suite-data.json
Original file line number Diff line number Diff line change
Expand Up @@ -622,5 +622,17 @@
"qualifiers": null,
"subpath": null,
"is_invalid": false
},
{
"description": "invalid encoded colon : between scheme and type",
"purl": "pkg%3Amaven/org.apache.commons/io",
"canonical_purl": null,
"type": "maven",
"namespace": "org.apache.commons",
"name": "io",
"version": null,
"qualifiers": null,
"subpath": null,
"is_invalid": true
}
]