Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace linkRegex with xurls library #6261

Merged
merged 4 commits into from
Mar 7, 2019

Conversation

mrsdizzie
Copy link
Member

Rather than maintaining a complicated regex to match URLs for autolinking, gitea can use this existing go library that takes care of the matching with very little code change to gitea itself:

https://github.com/mvdan/xurls

After spending a while trying to find the perfect regex for all cases this library still works better as it is more flexible than a single regex ever will be.

This will also fix the following issues: #5844 #3095 #3381

This passes all current tests and I've added new ones based on URLs mentioned in those issues above.

Rather than maintaining a complicated regex to match URLs for
autolinking, gitea can use this existing go library that takes care of
the matching with very little code change to gitea itself. After
spending a while trying to find the perfect regex for all cases this library
still works better as it is more flexible than a single regex ever will be.

This will also fix the following issues: go-gitea#5844 go-gitea#3095 go-gitea#3381

This passes all our current tests and I've added new ones mentioned in
those issues as well.
@codecov-io
Copy link

codecov-io commented Mar 7, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@01bd1fc). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #6261   +/-   ##
=========================================
  Coverage          ?   38.81%           
=========================================
  Files             ?      355           
  Lines             ?    50253           
  Branches          ?        0           
=========================================
  Hits              ?    19504           
  Misses            ?    27920           
  Partials          ?     2829
Impacted Files Coverage Δ
modules/markup/html.go 88.09% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 01bd1fc...805a970. Read the comment docs.

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Mar 7, 2019
@@ -645,7 +642,7 @@ func emailAddressProcessor(ctx *postProcessCtx, node *html.Node) {
// linkProcessor creates links for any HTTP or HTTPS URL not captured by
// markdown.
func linkProcessor(ctx *postProcessCtx, node *html.Node) {
m := linkRegex.FindStringIndex(node.Data)
m := xurls.Strict().FindStringIndex(node.Data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result of xurls.Strict() should be cached - it compiles several regexps.

@techknowlogick techknowlogick added this to the 1.8.0 milestone Mar 7, 2019
@techknowlogick
Copy link
Member

Tagging this as a bugfix as it solves a bug, so that we can get it in the 1.8.0 release.

This is much faster and we only care about https? links to preserve
existing behavior.
@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Mar 7, 2019
@GiteaBot GiteaBot added lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels Mar 7, 2019
@mrsdizzie
Copy link
Member Author

Thanks much for the feedback! That is exactly right thanks for catching.

Also here is a tiny test program to compare how long this takes vs the current implementation:

https://gist.github.com/mrsdizzie/edfcbf36a5355d1db5f5d7218543a7a4

The results I got were:

Running each test on 10,000 random lines
Starting current linkRegex test:
52.756294ms
Starting modifiedLinkRegex test:
84.711437ms
Starting xurl test:
92.17964ms
Starting xurl with all schemes test:
1.871560519s

modifiedLinkRegex being a new regex that would try and match all of the URLs mentioned in the bugs above. So this is pretty on par with anything we could have added and doesn't introduce noticeable slowness in real world situations of large content

@techknowlogick techknowlogick merged commit f2de5dc into go-gitea:master Mar 7, 2019
@go-gitea go-gitea locked and limited conversation to collaborators Nov 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants