Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex#each_name(&) and Regex#names #9760

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Sija
Copy link
Contributor

@Sija Sija commented Sep 18, 2020

fwiw Regex#names exists in Ruby.

Regex#each_name was added as a helper and a way to access group names without additional allocations.

@Sija
Copy link
Contributor Author

Sija commented Dec 3, 2020

Could this get a 2nd approval?

@asterite
Copy link
Member

asterite commented Dec 3, 2020

These were never discussed before, sorry. I don't understand their usefulness so I won't approve them.

@Sija
Copy link
Contributor Author

Sija commented Jan 7, 2021

2nd approval anyone? /cc @bcardiff @jhass

@straight-shoota
Copy link
Member

This needs a rebase

@Sija Sija force-pushed the regex-each-name-and-names branch from c2f3c1c to 878094a Compare January 7, 2021 01:34
@Sija
Copy link
Contributor Author

Sija commented Jan 7, 2021

@straight-shoota done.

@asterite
Copy link
Member

asterite commented Jan 7, 2021

Can you provide a use case?

Also, UInt16 should be replace with Int32.

Then, I don't understand why names is useful for, and why there needs to be an unsorted and sorted variants. This makes the API bigger but I can't see how it can be used for anything useful.

@Sija
Copy link
Contributor Author

Sija commented Mar 15, 2021

Can you provide a use case?

@asterite These new methods are documented and I consider that enough for explaining the use cases.

Also, UInt16 should be replace with Int32.

That would be probably better, yes.

Then, I don't understand why names is useful for, and why there needs to be an unsorted and sorted variants. This makes the API bigger but I can't see how it can be used for anything useful.

Agreed, although unsorted version is there for efficiency reasons, yet if deemed unnecessary I can leave only sorted version.

@Sija
Copy link
Contributor Author

Sija commented May 7, 2021

@asterite ping 👋

@asterite
Copy link
Member

asterite commented May 7, 2021

Sorry, I don't have much more time to review or decide things.

Copy link
Member

@straight-shoota straight-shoota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep the API as small as possible. Regex is inefficient by itself, so I don't see much reason to provide a super-efficient overload for unsorted names, which seems a rather obscure use-case.

src/regex.cr Outdated
Comment on lines 574 to 580
def name_table : Hash(UInt16, String)
Hash(UInt16, String).new.tap do |name_table|
each_name do |name, capture_number|
name_table[capture_number] = name
end
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key type should most certainly not be UInt16 but Int32.

I'm also wondering if an array would be a better data structure. With many unnamed groups it would occupy more memory, but lookup is more performant. And looking up items by index is a classic job for Indexable.

Copy link
Contributor Author

@Sija Sija May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC you can't use the Array since the index keys might not be correspondent the to array indexes.

Copy link
Member

@straight-shoota straight-shoota May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that this method already exists in the current API with this return type. So we shouldn't break it.
But nevertheless, it's bad. We shouldn't expose UInt16. That's just an implementation detail.
So I'd suggest to deprecate that method and add another one with a different name and correct return type.

Yes, in an array the indices of unnamed capture groups must be nil values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example:

/(.)(?<foo>.)(.)(?<bar>.)(.)/.name_table # => {4 => "bar", 2 => "foo"}
/(.)(?<foo>.)(.)(?<bar>.)(.)/.name_array # => [nil, "foo", nil, "bar"]

@Sija Sija force-pushed the regex-each-name-and-names branch from 878094a to 0799228 Compare May 9, 2021 00:59
@Sija Sija requested a review from straight-shoota May 9, 2021 01:08
@straight-shoota
Copy link
Member

We can't just change the return type of #name_table from Hash(UInt16, String) to Hash(Int32, String). That would break the API. For now, we need to keep that method.
There could be a different method returning Hash(Int32, String), though.

I've also some doubts if #names is a good addition. I see some risk that the behaviour cause confusion: The absence of unnamed capture groups might be missed. So I'm leaning towards not having this method.
Is there a good use case for that, besides "Ruby has it"? I don't think it's much relevant and can easily be implemented when needed.

@Sija
Copy link
Contributor Author

Sija commented May 9, 2021

We can't just change the return type of #name_table from Hash(UInt16, String) to Hash(Int32, String). That would break the API. For now, we need to keep that method.
There could be a different method returning Hash(Int32, String), though.

https://github.com/search?q=name_table+language%3ACrystal&type=code

There's literally no single case there which would necessitate adding a new method.

I've also some doubts if #names is a good addition. I see some risk that the behaviour cause confusion: The absence of unnamed capture groups might be missed. So I'm leaning towards not having this method.

I don't get the point. If you call names you'd expect to see named captures, what else could you expect?

Is there a good use case for that, besides "Ruby has it"? I don't think it's much relevant and can easily be implemented when needed.

There's no other way to get the list of named captures, so that's enough for the use case for me, and this kind of - already accessible - feature, shouldn't force you to monkey-patch stdlib.

@straight-shoota
Copy link
Member

There's literally no single case there which would necessitate adding a new method.

That just means no single use published at GitHub. And it doesn't matter. It's a breaking change.

I don't get the point. If you call names you'd expect to see named captures, what else could you expect?

You could expect the named captures to be indexed by their capture group index.

There's no other way to get the list of named captures, so that's enough for the use case for me,

That's not a use case at all. There's no point in adding a method unless there is a real use case.
And sure, you can get an array of names from name_table.values (unsorted) or name_table.to_a.sort!.map(&.[1]) (sorted).

@beta-ziliani
Copy link
Member

Hi @Sija , thanks for taking the time to write the PR. But we need to ask you a little bit extra:

Can you provide a use case?

@asterite These new methods are documented and I consider that enough for explaining the use cases.

The documentation explains the use of the method, but does not provide enough evidence that the change is necessary. What we're asking is what makes you want to do this change. What couldn't be done before, or how inefficient or ugly was to do it, and what is better with your proposal? Ideally, you should first open an issue stating the problem, and perhaps drafting a solution, and then move to coding when we all agree on your proposal. Since we're discussing it here now, can you provide an answer to this question in this PR? Thanks!

@Sija
Copy link
Contributor Author

Sija commented Jul 2, 2021

@beta-ziliani My use case was the replacement/escaping named group values without knowing the group names in advance. Without #names I had to resort to #name_table.to_a.sort!.map(&.[1]) call. And sorry for the delay 🙏

@HertzDevil
Copy link
Contributor

#each_name sounds reasonable, but I agree the method name should be changed. We now have Regex::PCRE2#each_named_capture_group which is pretty much the same idea. Note that you could only skip the Hash allocation, not the Strings themselves; the standard library avoids exposing strings as Bytes publicly in this way, even if it means fewer allocations.

#names is entirely redundant. With #each_name you are free to implement your own #name or #name_array (#9760 (comment)) based on it.

@Sija
Copy link
Contributor Author

Sija commented Apr 5, 2024

#each_name sounds reasonable, but I agree the method name should be changed. We now have Regex::PCRE2#each_named_capture_group which is pretty much the same idea. Note that you could only skip the Hash allocation, not the Strings themselves; the standard library avoids exposing strings as Bytes publicly in this way, even if it means fewer allocations.

Yeah, name is up for the debate.

#names is entirely redundant. With #each_name you are free to implement your own #name or #name_array (#9760 (comment)) based on it.

This I'm not sure I follow. Why is it redundant? You could pose the same argument for String#chars vs String#each_char. Also, Ruby has it, so they didn't see it as redundant either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants