Skip to content

perf: improve URL parsing performance #9694

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions packages/opentelemetry/src/utils/parseSpanDescription.ts
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@ export function getSanitizedUrl(
// This is the normalized route name - may not always be available!
const httpRoute = attributes[SemanticAttributes.HTTP_ROUTE];

// TODO(@anonrig): Use WHATWG URL API when we drop Node.js v10 support.
const parsedUrl = typeof httpUrl === 'string' ? parseUrl(httpUrl) : undefined;
const url = parsedUrl ? getSanitizedUrlString(parsedUrl) : undefined;
const query = parsedUrl && parsedUrl.search ? parsedUrl.search : undefined;
Expand Down
86 changes: 75 additions & 11 deletions packages/utils/src/url.ts
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
type PartialURL = {
host?: string;
path?: string;
protocol?: string;
relative?: string;
search?: string;
hash?: string;
};
type PartialURL = Partial<{
host: string;
path: string;
protocol: string;
relative: string;
search: string;
hash: string;
urlInstance: URL;
}>;

const urlRegex = /^(([^:/?#]+):)?(\/\/([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?$/;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an external reference to the source of this regexp that would make sense to add here? I'm asking because these look daunting (at least to me), but I find that they can usually be a good way to educate, for example maybe by linking to the standard that this is supposed to support.


/**
* Parses string form of URL into an object
Expand All @@ -19,7 +22,32 @@ export function parseUrl(url: string): PartialURL {
return {};
}

const match = url.match(/^(([^:/?#]+):)?(\/\/([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?$/);
// Node.js v10 and above supports WHATWG URL API. We can use it when available.
// TODO(@anonrig): Remove this check when we drop support for Node v10.
if (typeof URL !== undefined) {
try {
const parsed = new URL(url);
const pathname = parsed.pathname;

return {
host: parsed.host,
// WHATWG URL API includes the leading slash in the pathname
// Example: Returns `/` for `https://sentry.io`
path: pathname.length === 1 ? '' : pathname,
// WHATWG URL API includes the trailing colon in the protocol
// Example: Returns `https:` for `https://sentry.io`
protocol: parsed.protocol.slice(0, -1),
search: parsed.search,
hash: parsed.hash,
relative: parsed.pathname + parsed.search + parsed.hash,
urlInstance: parsed,
};
} catch {
// If URL is invalid, fallback to regex parsing to support URLs without protocols.
}
}

const match = url.match(urlRegex);

Check failure

Code scanning / CodeQL

Polynomial regular expression used on uncontrolled data

This [regular expression](1) that depends on [library input](2) may run slow on strings with many repetitions of '"'. This [regular expression](1) that depends on [library input](3) may run slow on strings with many repetitions of '"'. This [regular expression](1) that depends on [library input](4) may run slow on strings with many repetitions of '"'.

if (!match) {
return {};
Expand Down Expand Up @@ -62,15 +90,51 @@ export function getNumberOfUrlSegments(url: string): number {
* see: https://develop.sentry.dev/sdk/data-handling/#structuring-data
*/
export function getSanitizedUrlString(url: PartialURL): string {
const { protocol, host, path } = url;
const { protocol, host, path, urlInstance } = url;

// This means that the environment supports WHATWG URL API.
// This case will not be executed if URL does not have a protocol
// since WHATWG URL specification requires protocol to be present.
if (urlInstance !== undefined) {
const { port, username, password, hostname, pathname, protocol } = urlInstance;
const hasAuthority = username.length > 0 || password.length > 0;
let output = `${protocol}//`;

if (hasAuthority) {
if (username) {
output += '[filtered]';
}

output += ':';

if (password) {
output += '[filtered]';
}

output += '@';
}

output += hostname;

if (port && port !== '80' && port !== '443') {
output += `:${port}`;
}

// Do not append pathname if it is empty.
// For example: Pathname is `/` for `https://sentry.io`
if (pathname.length > 1) {
output += pathname;
}

return output;
}

const filteredHost =
(host &&
host
// Always filter out authority
.replace(/^.*@/, '[filtered]:[filtered]@')
// Don't show standard :80 (http) and :443 (https) ports to reduce the noise
// TODO: Use new URL global if it exists
.replace(/(:80)$/, '')
.replace(/(:443)$/, '')) ||
'';
Expand Down