Skip to content

Improve performance by not copying Tokens as much #1558

Open
@alamb

Description

@alamb

Part of #1557

While looking at the flamegraph posted to that ticket, one obvious source of improvement is to stop copying each token so much.

Functions like peek_token(), and expect_token() copy the Token (which often includes a string)

impl Parser {
    // returns an OWNED TokenWithLocation
    pub fn peek_token(&self) -> TokenWithLocation {
   ...
    }
}

For example
https://github.com/apache/datafusion-sqlparser-rs/blob/2e90e105a74bf9f50f2bad6c22992759ddb06880/src/parser/mod.rs#L3314-L3334

In the above code, the non_whitespace.cloned() call actually copies the token (and string) even when many callsites only need to check what it is and could get by with a &

Suggestions for improvements

The biggest suggestion is to stop copying each token unless it actually needed a clone. For example instead of this

impl Parser {
    // returns an OWNED TokenWithLocation
    pub fn peek_token(&self) -> TokenWithLocation {
   ...
    }
}

Make it like this

impl Parser {
    // returns an reference to TokenWithLocation
    // (the & means no copying!)
    pub fn peek_token_ref(&self) -> &TokenWithLocation {
   ...
    }
}

I think we could do this without massive breaking changes by adding new functions like peek_token_ref() that returned a reference to the token rather than a clone

Ideas

I played around with it a bit and here is what I came up with. I think a similar pattern could be applied to other places

impl Parser {
...
    /// Return the first non-whitespace token that has not yet been processed
    /// (or None if reached end-of-file) and mark it as processed. OK to call
    /// repeatedly after reaching EOF.
    pub fn next_token(&mut self) -> TokenWithLocation {
        self.next_token_ref().clone()
    }

    /// Return the first non-whitespace token that has not yet been processed
    /// (or None if reached end-of-file) and mark it as processed. OK to call
    /// repeatedly after reaching EOF.
    pub fn next_token_ref(&mut self) -> &TokenWithLocation {
        loop {
            self.index += 1;
            // skip whitespace
            if let Some(TokenWithLocation {
                token: Token::Whitespace(_),
                location: _,
            }) = self.tokens.get(self.index - 1) {
                continue
            }
            break;
        }
        if (self.index - 1) < self.tokens.len() {
            &self.tokens[self.index - 1]
        }
        else {
            eof_token()
        }
    }
...
}

static EOF_TOKEN: OnceLock<TokenWithLocation> = OnceLock::new();
fn eof_token() -> &'static TokenWithLocation {
    EOF_TOKEN.get_or_init(|| {
        TokenWithLocation::wrap(Token::EOF)
    })
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions