-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade Elasticsearch to version 6.x #4211
Upgrade Elasticsearch to version 6.x #4211
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this locally and it's working well w/ projects & faceting. The next big thing is definitely getting the File search working. 👍
@@ -42,3 +42,4 @@ notifications: | |||
branches: | |||
only: | |||
- master | |||
- search_upgrade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting :)
@@ -143,7 +142,7 @@ def bulk_index(self, data, index=None, chunk_size=500, parent=None, | |||
docs.append(doc) | |||
|
|||
# TODO: This doesn't work with the new ES setup. | |||
bulk_index(self.es, docs, chunk_size=chunk_size) | |||
# bulk_index(self.es, docs, chunk_size=chunk_size) | |||
|
|||
def index_document(self, data, index=None, parent=None, routing=None): | |||
doc = self.extract_document(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guessing this entire file and other related code should be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. it need to be deleted. I will delete once I implement the file searching functionality!
readthedocs/search/tests/conftest.py
Outdated
|
||
|
||
@pytest.fixture | ||
def all_projects(): | ||
def all_projects(es_index): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this get passed in? Is it automatically callign the above fixture based on name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, its pytest's dependeny enjection. So if you have a fixture name foo
and you accept this fixture in def bar(foo)
, the foo
fixture will be passed to the bar
fixture.
readthedocs/search/views.py
Outdated
language=user_input.language) | ||
response = project_search.execute() | ||
results = response.hits | ||
facets = response.facets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Its used for showing facet (language) in project search results.
'doc_types': [cls], | ||
'model': cls._doc_type.model, | ||
'query': query | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this logic required? It seems a bit heavy/complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, to keep alligned with the search
method, we can keep this logic. maybe its not needed now, but it will be useful to keep it alligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is too complicated. We should be able to do this without much of the code changes here with just a filter on the manager.
readthedocs/projects/models.py
Outdated
@@ -902,6 +903,7 @@ class ImportedFile(models.Model): | |||
path = models.CharField(_('Path'), max_length=255) | |||
md5 = models.CharField(_('MD5 checksum'), max_length=255) | |||
commit = models.CharField(_('Commit'), max_length=255) | |||
is_html = models.BooleanField(default=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this on the model, the queryset manager can just do the filter, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. it can do the filtering. but I thought it would be much slow to filter in the queryset manager. I am ok to remove it.
if fnmatch.fnmatch(filename, '*.html'): | ||
model_class = HTMLFile | ||
else: | ||
model_class = ImportedFile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe this is needed, since it's all the same model in the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is with actually signal manager. I have opened django-es/django-elasticsearch-dsl#111 about this.
Untill it has been fixed, we need to have a proxy model for the purpose, I believe.
readthedocs/projects/managers.py
Outdated
class HTMLFileManager(models.Manager): | ||
|
||
def get_queryset(self): | ||
return super(HTMLFileManager, self).get_queryset().filter(is_html=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this just do filter(filename__endswith='html')
instead of adding additional state to the model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. It can be done. I thought it would be slower, so I added another state.
But I think performance is not a issue here. So I am good to filter by name.
TODO: integrate it with view and template
@ericholscher I think you can take another look into this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good changes. I look forward to testing it locally once it's hooked up in the templates & views :)
readthedocs/projects/managers.py
Outdated
|
||
class ImportedFileManager(models.Manager): | ||
|
||
def get_queryset(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should exclude them here. This will change the logic in other downstream code which is using ImportedFile
, which we don't want to do.
readthedocs/projects/models.py
Outdated
version_slug=self.version.slug, | ||
include_file=False) | ||
|
||
file_path = find_file(basename=basename, pattern='*.fjson', path=full_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to do all this I/O. Shouldn't it simply be:
os.path.join(full_path, self.path.replace('.html', '.fjson')
or similar? It should be in the same path structure as the existing path
.
readthedocs/projects/models.py
Outdated
file_path = find_file(basename=basename, pattern='*.fjson', path=full_path) | ||
return file_path | ||
|
||
@cached_property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long is this cached for? Will we really see value in caching it vs. the tradeoff of keeping a bunch of JSON in memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per documentation cached result will persist as long as the instance does.
As its called multiple time for same instance while indexing, (eg path
, content
, header
), I think the cache helps much in indexing fast.
readthedocs/search/documents.py
Outdated
return queryset | ||
|
||
def update(self, thing, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does this update
actually get called? I believe it's a signal attached to the saving of the ImportedFile
objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its called everytime the object is created, updated, or deleted.
Also whenever we run the management command.
call from the registry
facets = { | ||
'project': TermsFacet(field='project'), | ||
'version': TermsFacet(field='version') | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love how simple this is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. me too! the thing is made so simple by using OOP concepts!
@ericholscher I have fixed the integration in view and template. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Once we get the tests working, I think we can merge this into the search_upgrade
branch, and continue to work on improvements as additional PR's, so as not to complicate this one.
@ericholscher I have fixed the tests and now it is passing. 🎾 Without mocking, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I'll get it merged once we rename the migration.
@@ -0,0 +1,54 @@ | |||
# -*- coding: utf-8 -*- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This migration should have a name, saying what it does.
I also worry that this migration will become out of date with a long-running branch beside the master
. I'm not sure the best path to take -- we will likely just need to re-create it before we merge it into master
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! We can absolutely do this. I will keep this in mind.
@ericholscher Renamed the migration. ready to merge! |
Looks good. 👍 |
Upgrade Elasticsearch to version 6.x
Upgrade Elasticsearch to version 6.x
Currently, only project indexing working. Investigating and trying to index the pages/files in a cleaner way.(Fixed)Things need to be done
File
facet
searching in a cleaner wayThis fixes #4183