Skip to content

Commit

Permalink
FEATURE: Roll out new search optimisations (discourse#20364)
Browse files Browse the repository at this point in the history
- Reduce duplication of terms in post index from unlimited to 6. This will
result in reduced index size and reduced weighting for posts containing
a huge amount of duplicate terms. (Eg: a post containing "sam sam sam sam
sam sam sam sam", will index as "sam sam sam sam sam sam", only including
the word up to 6 times.) This corrects a flaw where title weighting could
be ignored.

- Prioritize exact matches of words in titles. Our search always performs
a prefix match. However we want to give special weight to exact title matches
meaning that a search for "sum" will find topics such as "the sum of us" vs
"summer in spring".

- Pick up fixes to our search algorithm which are missing from old indexes.
Specifically pick up the fix that indexes URLs properly. (`https://happy.com`
was stemmed to `happi` in keywords and then was not searchable)

see also:

https://meta.discourse.org/t/refinements-to-search-being-tested-on-meta/254158

Indexing will take a while and work in batches, in the background.
  • Loading branch information
SamSaffron authored Feb 20, 2023
1 parent 3c57db5 commit cd247d5
Show file tree
Hide file tree
Showing 5 changed files with 12 additions and 7 deletions.
9 changes: 6 additions & 3 deletions app/services/search_indexer.rb
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
# frozen_string_literal: true

class SearchIndexer
POST_INDEX_VERSION = 4
MIN_POST_REINDEX_VERSION = 3
TOPIC_INDEX_VERSION = 3
MIN_POST_BLURB_INDEX_VERSION = 4

POST_INDEX_VERSION = 5
TOPIC_INDEX_VERSION = 4
CATEGORY_INDEX_VERSION = 3
USER_INDEX_VERSION = 3
TAG_INDEX_VERSION = 3

# version to apply when issuing a background reindex
REINDEX_VERSION = 0
TS_VECTOR_PARSE_REGEX = /('([^']*|'')*'\:)(([0-9]+[A-D]?,?)+)/

Expand Down
4 changes: 2 additions & 2 deletions config/site_settings.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2191,10 +2191,10 @@ search:
default: false
hidden: true
prioritize_exact_search_title_match:
default: false
default: true
hidden: true
max_duplicate_search_index_terms:
default: -1
default: 6
hidden: true
use_pg_headlines_for_excerpt:
default: false
Expand Down
2 changes: 1 addition & 1 deletion lib/search/grouped_search_results.rb
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ def find_user_data(guardian)
def blurb(post)
opts = { term: @blurb_term, blurb_length: @blurb_length }

if post.post_search_data.version > SearchIndexer::MIN_POST_REINDEX_VERSION &&
if post.post_search_data.version >= SearchIndexer::MIN_POST_BLURB_INDEX_VERSION &&
!Search.segment_chinese? && !Search.segment_japanese?
if SiteSetting.use_pg_headlines_for_excerpt
scrubbed_headline = post.headline.gsub(SCRUB_HEADLINE_REGEXP, '\1')
Expand Down
2 changes: 1 addition & 1 deletion spec/jobs/reindex_search_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ def self.get_posts
end

it "should not reindex posts with a developmental version" do
post = Fabricate(:post, version: SearchIndexer::MIN_POST_REINDEX_VERSION + 1)
Fabricate(:post, version: SearchIndexer::POST_INDEX_VERSION + 1)

subject.rebuild_posts(indexer: FakeIndexer)

Expand Down
2 changes: 2 additions & 0 deletions spec/lib/search_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,8 @@

before do
SearchIndexer.enable
SiteSetting.max_duplicate_search_index_terms = -1
SiteSetting.prioritize_exact_search_title_match = false
[post1, post2].each { |post| SearchIndexer.index(post, force: true) }
end

Expand Down

0 comments on commit cd247d5

Please sign in to comment.