Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow code search by filename #32210

Merged
merged 5 commits into from
Oct 11, 2024
Merged

Conversation

bsofiato
Copy link
Contributor

@bsofiato bsofiato commented Oct 7, 2024

This is a large and complex PR, so let me explain in detail its changes.

First, I had to create new index mappings for Bleve and ElasticSerach as the current ones do not support search by filename. This requires Gitea to recreate the code search indexes (I do not know if this is a breaking change, but I feel it deserves a heads-up).

I've used this approach to model the filename index. It allows us to efficiently search for both the full path and the name of a file. Bleve, however, does not support this out-of-box, so I had to code a brand new token filter to generate the search terms.

I also did an overhaul in the indexer_test.go file. It now asserts the order of the expected results (this is important since matches based on the name of a file are more relevant than those based on its content).
I've added new test scenarios that deal with searching by filename. They use a new repo included in the Gitea fixture.

The screenshot below depicts how Gitea shows the search results. It shows results based on content in the same way as the current version does. In matches based on the filename, the first seven lines of the file contents are shown (BTW, this is how GitHub does it).

image

Resolves #32096

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Oct 7, 2024
@pull-request-size pull-request-size bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 7, 2024
@github-actions github-actions bot added the modifies/go Pull requests that update Go code label Oct 7, 2024
@bsofiato bsofiato force-pushed the feature/search_by_path branch 6 times, most recently from 9c77450 to 05a5090 Compare October 7, 2024 19:16
@bsofiato bsofiato force-pushed the feature/search_by_path branch 2 times, most recently from 89f67c6 to 12c542d Compare October 7, 2024 21:40
@yp05327
Copy link
Contributor

yp05327 commented Oct 8, 2024

The year of copyright in new files should be 2024.

@bsofiato
Copy link
Contributor Author

bsofiato commented Oct 8, 2024

The year of copyright in new files should be 2024.

Done

Signed-off-by: Bruno Sofiato <bruno.sofiato@gmail.com>
@lafriks
Copy link
Member

lafriks commented Oct 9, 2024

Please do not force push, it's hard to follow changes this way

@bsofiato
Copy link
Contributor Author

bsofiato commented Oct 9, 2024

Please do not force push, it's hard to follow changes this way

Sure. My bad 😔

@lunny lunny added this to the 1.24.0 milestone Oct 9, 2024
@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Oct 11, 2024
@delvh
Copy link
Member

delvh commented Oct 11, 2024

This requires Gitea to recreate the code search indexes

Ah, right - for this to happen, we need to increase issueIndexerLatestVersion in the affected indexers

@lunny
Copy link
Member

lunny commented Oct 11, 2024

Can you add a search test with filename as keyword?

@bsofiato
Copy link
Contributor Author

Can you add a search test with filename as keyword?

I have included some., I'll provide some comments on the test cases to better describe their underlying scenarios. What do you think ?

@bsofiato
Copy link
Contributor Author

Can you add a search test with filename as keyword?

I have included some., I'll provide some comments on the test cases to better describe their underlying scenarios. What do you think ?

Hey @lunny, I've added some comments to the test cases to better describe their scenarios :)

@GiteaBot GiteaBot added lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels Oct 11, 2024
@lunny lunny added the reviewed/wait-merge This pull request is part of the merge queue. It will be merged soon. label Oct 11, 2024
@lunny lunny modified the milestones: 1.24.0, 1.23.0 Oct 11, 2024
@lunny lunny enabled auto-merge (squash) October 11, 2024 23:06
@lunny lunny merged commit 900ac62 into go-gitea:main Oct 11, 2024
26 checks passed
@GiteaBot GiteaBot removed the reviewed/wait-merge This pull request is part of the merge queue. It will be merged soon. label Oct 11, 2024
matera-bs pushed a commit to matera-ar/gitea that referenced this pull request Oct 15, 2024
go-gitea#32210)

This is a large and complex PR, so let me explain in detail its changes.

First, I had to create new index mappings for Bleve and ElasticSerach as
the current ones do not support search by filename. This requires Gitea
to recreate the code search indexes (I do not know if this is a breaking
change, but I feel it deserves a heads-up).

I've used [this
approach](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis-pathhierarchy-tokenizer.html)
to model the filename index. It allows us to efficiently search for both
the full path and the name of a file. Bleve, however, does not support
this out-of-box, so I had to code a brand new [token
filter](https://blevesearch.com/docs/Token-Filters/) to generate the
search terms.

I also did an overhaul in the `indexer_test.go` file. It now asserts the
order of the expected results (this is important since matches based on
the name of a file are more relevant than those based on its content).
I've added new test scenarios that deal with searching by filename. They
use a new repo included in the Gitea fixture.

The screenshot below depicts how Gitea shows the search results. It
shows results based on content in the same way as the current version
does. In matches based on the filename, the first seven lines of the
file contents are shown (BTW, this is how GitHub does it).

![image](https://github.com/user-attachments/assets/9d938d86-1a8d-4f89-8644-1921a473e858)

Resolves go-gitea#32096

---------

Signed-off-by: Bruno Sofiato <bruno.sofiato@gmail.com>
matera-bs pushed a commit to matera-ar/gitea that referenced this pull request Oct 15, 2024
go-gitea#32210)

This is a large and complex PR, so let me explain in detail its changes.

First, I had to create new index mappings for Bleve and ElasticSerach as
the current ones do not support search by filename. This requires Gitea
to recreate the code search indexes (I do not know if this is a breaking
change, but I feel it deserves a heads-up).

I've used [this
approach](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis-pathhierarchy-tokenizer.html)
to model the filename index. It allows us to efficiently search for both
the full path and the name of a file. Bleve, however, does not support
this out-of-box, so I had to code a brand new [token
filter](https://blevesearch.com/docs/Token-Filters/) to generate the
search terms.

I also did an overhaul in the `indexer_test.go` file. It now asserts the
order of the expected results (this is important since matches based on
the name of a file are more relevant than those based on its content).
I've added new test scenarios that deal with searching by filename. They
use a new repo included in the Gitea fixture.

The screenshot below depicts how Gitea shows the search results. It
shows results based on content in the same way as the current version
does. In matches based on the filename, the first seven lines of the
file contents are shown (BTW, this is how GitHub does it).

![image](https://github.com/user-attachments/assets/9d938d86-1a8d-4f89-8644-1921a473e858)

Resolves go-gitea#32096

---------

Signed-off-by: Bruno Sofiato <bruno.sofiato@gmail.com>
zjjhot added a commit to zjjhot/gitea that referenced this pull request Oct 16, 2024
* giteaofficial/main:
  Make `owner/repo/pulls` handlers use "PR reader" permission (go-gitea#32254)
  make `show stats` work when only one file changed (go-gitea#32244)
  Update scheduled tasks even if changes are pushed by "ActionsUser" (go-gitea#32246)
  Support migrating GitHub/GitLab PR draft status (go-gitea#32242)
  Only rename a user when they should receive a different name (go-gitea#32247)
  Fix dropdown content overflow (go-gitea#31610)
  Make git push options accept short name (go-gitea#32245)
  Allow code search by filename (go-gitea#32210)
  Allow maintainers to view and edit files of private repos when "Allow maintainers to edit" is enabled (go-gitea#32215)
  Use per package global lock for container uploads instead of memory lock (go-gitea#31860)
matera-bs pushed a commit to matera-ar/gitea that referenced this pull request Oct 29, 2024
go-gitea#32210)

This is a large and complex PR, so let me explain in detail its changes.

First, I had to create new index mappings for Bleve and ElasticSerach as
the current ones do not support search by filename. This requires Gitea
to recreate the code search indexes (I do not know if this is a breaking
change, but I feel it deserves a heads-up).

I've used [this
approach](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis-pathhierarchy-tokenizer.html)
to model the filename index. It allows us to efficiently search for both
the full path and the name of a file. Bleve, however, does not support
this out-of-box, so I had to code a brand new [token
filter](https://blevesearch.com/docs/Token-Filters/) to generate the
search terms.

I also did an overhaul in the `indexer_test.go` file. It now asserts the
order of the expected results (this is important since matches based on
the name of a file are more relevant than those based on its content).
I've added new test scenarios that deal with searching by filename. They
use a new repo included in the Gitea fixture.

The screenshot below depicts how Gitea shows the search results. It
shows results based on content in the same way as the current version
does. In matches based on the filename, the first seven lines of the
file contents are shown (BTW, this is how GitHub does it).

![image](https://github.com/user-attachments/assets/9d938d86-1a8d-4f89-8644-1921a473e858)

Resolves go-gitea#32096

---------

Signed-off-by: Bruno Sofiato <bruno.sofiato@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. modifies/go Pull requests that update Go code size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Code search should look for the filename as well
6 participants