Upgrade Elasticsearch to version 6.x #4211

safwanrahman · 2018-06-08T19:59:54Z

~~Currently, only project indexing working. Investigating and trying to index the pages/files in a cleaner way.~~ (Fixed)
Things need to be done

Implement a cleaner way to index File
Search for project
Search for File
Add highlighter for result description
Implement facet searching in a cleaner way
Configure travis to run the tests
Fixup the tests in order to pass

This fixes #4183

ericholscher

Tested this locally and it's working well w/ projects & faceting. The next big thing is definitely getting the File search working. 👍

ericholscher · 2018-06-12T12:28:29Z

.travis.yml

@@ -42,3 +42,4 @@ notifications:
 branches:
  only:
  - master
+  - search_upgrade


Interesting :)

ericholscher · 2018-06-12T14:31:37Z

readthedocs/search/indexes.py

@@ -143,7 +142,7 @@ def bulk_index(self, data, index=None, chunk_size=500, parent=None,
            docs.append(doc)

        # TODO: This doesn't work with the new ES setup.
-        bulk_index(self.es, docs, chunk_size=chunk_size)
+        # bulk_index(self.es, docs, chunk_size=chunk_size)

    def index_document(self, data, index=None, parent=None, routing=None):
        doc = self.extract_document(data)


Guessing this entire file and other related code should be deleted?

Yes. it need to be deleted. I will delete once I implement the file searching functionality!

ericholscher · 2018-06-12T14:32:15Z

readthedocs/search/tests/conftest.py



 @pytest.fixture
-def all_projects():
+def all_projects(es_index):


Where does this get passed in? Is it automatically callign the above fixture based on name?

Actually, its pytest's dependeny enjection. So if you have a fixture name foo and you accept this fixture in def bar(foo), the foo fixture will be passed to the bar fixture.

ericholscher · 2018-06-12T14:33:01Z

readthedocs/search/views.py

+                                                            language=user_input.language)
+            response = project_search.execute()
+            results = response.hits
+            facets = response.facets


Is this used?

Yes. Its used for showing facet (language) in project search results.

ericholscher · 2018-06-12T14:36:40Z

readthedocs/search/documents.py

+            'doc_types': [cls],
+            'model': cls._doc_type.model,
+            'query': query
+        }


Is this logic required? It seems a bit heavy/complex.

I think, to keep alligned with the search method, we can keep this logic. maybe its not needed now, but it will be useful to keep it alligned.

ericholscher

I think this change is too complicated. We should be able to do this without much of the code changes here with just a filter on the manager.

ericholscher · 2018-06-13T11:38:02Z

readthedocs/projects/models.py

@@ -902,6 +903,7 @@ class ImportedFile(models.Model):
    path = models.CharField(_('Path'), max_length=255)
    md5 = models.CharField(_('MD5 checksum'), max_length=255)
    commit = models.CharField(_('Commit'), max_length=255)
+    is_html = models.BooleanField(default=False)


I don't think we need this on the model, the queryset manager can just do the filter, no?

Yeah. it can do the filtering. but I thought it would be much slow to filter in the queryset manager. I am ok to remove it.

ericholscher · 2018-06-13T11:38:27Z

readthedocs/projects/tasks.py

+            if fnmatch.fnmatch(filename, '*.html'):
+                model_class = HTMLFile
+            else:
+                model_class = ImportedFile


I don't believe this is needed, since it's all the same model in the database.

The problem is with actually signal manager. I have opened django-es/django-elasticsearch-dsl#111 about this.

Untill it has been fixed, we need to have a proxy model for the purpose, I believe.

ericholscher · 2018-06-13T11:39:01Z

readthedocs/projects/managers.py

+class HTMLFileManager(models.Manager):
+
+    def get_queryset(self):
+        return super(HTMLFileManager, self).get_queryset().filter(is_html=True)


Can't this just do filter(filename__endswith='html') instead of adding additional state to the model?

Yeah. It can be done. I thought it would be slower, so I added another state.
But I think performance is not a issue here. So I am good to filter by name.

TODO: integrate it with view and template

safwanrahman · 2018-06-14T05:40:13Z

@ericholscher I think you can take another look into this!
I will integrate the view and templated later today!

ericholscher

Good changes. I look forward to testing it locally once it's hooked up in the templates & views :)

ericholscher · 2018-06-14T09:44:45Z

readthedocs/projects/managers.py

+
+class ImportedFileManager(models.Manager):
+
+    def get_queryset(self):


I don't think we should exclude them here. This will change the logic in other downstream code which is using ImportedFile, which we don't want to do.

ericholscher · 2018-06-14T09:46:44Z

readthedocs/projects/models.py

+                                                           version_slug=self.version.slug,
+                                                           include_file=False)
+
+        file_path = find_file(basename=basename, pattern='*.fjson', path=full_path)


I don't think we need to do all this I/O. Shouldn't it simply be:

os.path.join(full_path, self.path.replace('.html', '.fjson') or similar? It should be in the same path structure as the existing path.

ericholscher · 2018-06-14T09:47:40Z

readthedocs/projects/models.py

+        file_path = find_file(basename=basename, pattern='*.fjson', path=full_path)
+        return file_path
+
+    @cached_property


How long is this cached for? Will we really see value in caching it vs. the tradeoff of keeping a bunch of JSON in memory?

As per documentation cached result will persist as long as the instance does.
As its called multiple time for same instance while indexing, (eg path, content, header), I think the cache helps much in indexing fast.

ericholscher · 2018-06-14T09:50:57Z

readthedocs/search/documents.py

+        return queryset
+
+    def update(self, thing, **kwargs):


When does this update actually get called? I believe it's a signal attached to the saving of the ImportedFile objects?

Its called everytime the object is created, updated, or deleted.
Also whenever we run the management command.
call from the registry

ericholscher · 2018-06-14T09:51:49Z

readthedocs/search/faceted_search.py

+    facets = {
+        'project': TermsFacet(field='project'),
+        'version': TermsFacet(field='version')
+    }


Love how simple this is.

Yes. me too! the thing is made so simple by using OOP concepts!

safwanrahman · 2018-06-14T20:18:06Z

@ericholscher I have fixed the integration in view and template. ~~But currently the description with highlighting is not showing in the file search result. I will look over into it later.~~ (Fixed)
Can you run it locally and let me know about your feedback?

ericholscher

Looks good! Once we get the tests working, I think we can merge this into the search_upgrade branch, and continue to work on improvements as additional PR's, so as not to complicate this one.

safwanrahman · 2018-06-19T00:16:46Z

@ericholscher I have fixed the tests and now it is passing. 🎾
Could not mock the processed_json property, so added another method get_processed_json that will be called from processed_json and mock get_processed_json instead.

Without mocking, the get_processed_json would run in every test for all the files, which will make the test slower. I will add other tests for the model methods in future.
possible to merge?

ericholscher

Looks good. I'll get it merged once we rename the migration.

ericholscher · 2018-06-19T08:54:01Z

readthedocs/projects/migrations/0026_auto_20180618_1645.py

@@ -0,0 +1,54 @@
+# -*- coding: utf-8 -*-


This migration should have a name, saying what it does.

I also worry that this migration will become out of date with a long-running branch beside the master. I'm not sure the best path to take -- we will likely just need to re-create it before we merge it into master.

Yes! We can absolutely do this. I will keep this in mind.

safwanrahman · 2018-06-19T10:17:37Z

@ericholscher Renamed the migration. ready to merge!

ericholscher · 2018-06-19T10:19:13Z

Looks good. 👍

Upgrade Elasticsearch to version 6.x

safwanrahman added 2 commits June 9, 2018 01:42

first phase to elasticsearch 6.2.x

3c41b42

adding requirements

6410495

agjohnson changed the title ~~[Fix #4183] Search Proof of Concept~~ Search Proof of Concept Jun 8, 2018

agjohnson added the PR: work in progress Pull request is not ready for full review label Jun 8, 2018

safwanrahman added 4 commits June 9, 2018 08:25

implementing project search, test and travis fix

272b50a

fixing travis

b8f1a06

fixing search install plugin

6c430e5

fixing up tests

035c312

safwanrahman force-pushed the search branch from 9e71e38 to 035c312 Compare June 9, 2018 03:19

fixing lint

746b378

ericholscher reviewed Jun 12, 2018

View reviewed changes

first phase file search

de47978

ericholscher reviewed Jun 13, 2018

View reviewed changes

safwanrahman added 2 commits June 14, 2018 09:51

indexing the file objects

ab6fffb

File searching basic backend task has been implemented

3523fab

TODO: integrate it with view and template

ericholscher reviewed Jun 14, 2018

View reviewed changes

integrate the new search with view and template

9a5b0ed

safwanrahman removed the PR: work in progress Pull request is not ready for full review label Jun 14, 2018

safwanrahman self-assigned this Jun 14, 2018

fixing highlighting

e9b1c03

ericholscher approved these changes Jun 15, 2018

View reviewed changes

safwanrahman added 2 commits June 19, 2018 03:44

fixing up tests

37f6936

adding migration

f730556

safwanrahman changed the title ~~Search Proof of Concept~~ Upgrade Elasticsearch to 6.x and rewrite using elasticsearch-dsl Jun 18, 2018

safwanrahman changed the title ~~Upgrade Elasticsearch to 6.x and rewrite using elasticsearch-dsl~~ Upgrade to Elasticsearch 6.x Jun 18, 2018

lint fix

05f5e05

safwanrahman force-pushed the search branch from 8645edc to 05f5e05 Compare June 19, 2018 00:14

safwanrahman changed the title ~~Upgrade to Elasticsearch 6.x~~ Upgrade Elasticsearch to version 6.x Jun 19, 2018

This was referenced Jun 19, 2018

Adding Test for new search prototype #4264

Closed

Integrate ICU Analysis plugin into upgraded search #4266

Closed

ericholscher approved these changes Jun 19, 2018

View reviewed changes

renameing

0965a94

ericholscher merged commit fd75aa3 into readthedocs:search_upgrade Jun 19, 2018

This was referenced Jun 20, 2018

Search Proof of Concept #4183

Closed

[Fix #2328 #2013] Refresh search index and test for case insensitive search #4277

Merged

safwanrahman deleted the search branch July 7, 2018 19:24

safwanrahman pushed a commit to safwanrahman/readthedocs.org that referenced this pull request Jul 16, 2018

Merge pull request readthedocs#4211 from safwanrahman/search

8d7942b

Upgrade Elasticsearch to version 6.x

safwanrahman pushed a commit to safwanrahman/readthedocs.org that referenced this pull request Jul 16, 2018

Merge pull request readthedocs#4211 from safwanrahman/search

d4f6708

Upgrade Elasticsearch to version 6.x

safwanrahman mentioned this pull request Jul 19, 2018

[Fix #4407] Port Project Search for Elasticsearch 6.x #4408

Merged


		class ImportedFileManager(models.Manager):

		def get_queryset(self):

Upgrade Elasticsearch to version 6.x #4211

Upgrade Elasticsearch to version 6.x #4211

Conversation

safwanrahman commented Jun 8, 2018 • edited Loading

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

safwanrahman commented Jun 14, 2018

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

safwanrahman Jun 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

safwanrahman commented Jun 14, 2018 • edited Loading

ericholscher left a comment

Choose a reason for hiding this comment

safwanrahman commented Jun 19, 2018 • edited Loading

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

safwanrahman Jun 19, 2018 • edited Loading

Choose a reason for hiding this comment

safwanrahman commented Jun 19, 2018

ericholscher commented Jun 19, 2018

safwanrahman commented Jun 8, 2018 •

edited

Loading

safwanrahman Jun 14, 2018 •

edited

Loading

safwanrahman commented Jun 14, 2018 •

edited

Loading

safwanrahman commented Jun 19, 2018 •

edited

Loading

safwanrahman Jun 19, 2018 •

edited

Loading