Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Elasticsearch to version 6.x #4211

Merged
merged 16 commits into from
Jun 19, 2018

Conversation

safwanrahman
Copy link
Member

@safwanrahman safwanrahman commented Jun 8, 2018

Currently, only project indexing working. Investigating and trying to index the pages/files in a cleaner way. (Fixed)
Things need to be done

  • Implement a cleaner way to index File
  • Search for project
  • Search for File
  • Add highlighter for result description
  • Implement facet searching in a cleaner way
  • Configure travis to run the tests
  • Fixup the tests in order to pass

This fixes #4183

@agjohnson agjohnson changed the title [Fix #4183] Search Proof of Concept Search Proof of Concept Jun 8, 2018
@agjohnson agjohnson added the PR: work in progress Pull request is not ready for full review label Jun 8, 2018
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this locally and it's working well w/ projects & faceting. The next big thing is definitely getting the File search working. 👍

@@ -42,3 +42,4 @@ notifications:
branches:
only:
- master
- search_upgrade
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting :)

@@ -143,7 +142,7 @@ def bulk_index(self, data, index=None, chunk_size=500, parent=None,
docs.append(doc)

# TODO: This doesn't work with the new ES setup.
bulk_index(self.es, docs, chunk_size=chunk_size)
# bulk_index(self.es, docs, chunk_size=chunk_size)

def index_document(self, data, index=None, parent=None, routing=None):
doc = self.extract_document(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guessing this entire file and other related code should be deleted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. it need to be deleted. I will delete once I implement the file searching functionality!



@pytest.fixture
def all_projects():
def all_projects(es_index):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this get passed in? Is it automatically callign the above fixture based on name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, its pytest's dependeny enjection. So if you have a fixture name foo and you accept this fixture in def bar(foo), the foo fixture will be passed to the bar fixture.

language=user_input.language)
response = project_search.execute()
results = response.hits
facets = response.facets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Its used for showing facet (language) in project search results.

'doc_types': [cls],
'model': cls._doc_type.model,
'query': query
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this logic required? It seems a bit heavy/complex.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, to keep alligned with the search method, we can keep this logic. maybe its not needed now, but it will be useful to keep it alligned.

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is too complicated. We should be able to do this without much of the code changes here with just a filter on the manager.

@@ -902,6 +903,7 @@ class ImportedFile(models.Model):
path = models.CharField(_('Path'), max_length=255)
md5 = models.CharField(_('MD5 checksum'), max_length=255)
commit = models.CharField(_('Commit'), max_length=255)
is_html = models.BooleanField(default=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this on the model, the queryset manager can just do the filter, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. it can do the filtering. but I thought it would be much slow to filter in the queryset manager. I am ok to remove it.

if fnmatch.fnmatch(filename, '*.html'):
model_class = HTMLFile
else:
model_class = ImportedFile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this is needed, since it's all the same model in the database.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is with actually signal manager. I have opened django-es/django-elasticsearch-dsl#111 about this.

Untill it has been fixed, we need to have a proxy model for the purpose, I believe.

class HTMLFileManager(models.Manager):

def get_queryset(self):
return super(HTMLFileManager, self).get_queryset().filter(is_html=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this just do filter(filename__endswith='html') instead of adding additional state to the model?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. It can be done. I thought it would be slower, so I added another state.
But I think performance is not a issue here. So I am good to filter by name.

@safwanrahman
Copy link
Member Author

@ericholscher I think you can take another look into this!
I will integrate the view and templated later today!

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good changes. I look forward to testing it locally once it's hooked up in the templates & views :)


class ImportedFileManager(models.Manager):

def get_queryset(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should exclude them here. This will change the logic in other downstream code which is using ImportedFile, which we don't want to do.

version_slug=self.version.slug,
include_file=False)

file_path = find_file(basename=basename, pattern='*.fjson', path=full_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do all this I/O. Shouldn't it simply be:

os.path.join(full_path, self.path.replace('.html', '.fjson') or similar? It should be in the same path structure as the existing path.

file_path = find_file(basename=basename, pattern='*.fjson', path=full_path)
return file_path

@cached_property
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long is this cached for? Will we really see value in caching it vs. the tradeoff of keeping a bunch of JSON in memory?

Copy link
Member Author

@safwanrahman safwanrahman Jun 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per documentation cached result will persist as long as the instance does.
As its called multiple time for same instance while indexing, (eg path, content, header), I think the cache helps much in indexing fast.

return queryset

def update(self, thing, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this update actually get called? I believe it's a signal attached to the saving of the ImportedFile objects?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its called everytime the object is created, updated, or deleted.
Also whenever we run the management command.
call from the registry

facets = {
'project': TermsFacet(field='project'),
'version': TermsFacet(field='version')
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love how simple this is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. me too! the thing is made so simple by using OOP concepts!

@safwanrahman
Copy link
Member Author

safwanrahman commented Jun 14, 2018

@ericholscher I have fixed the integration in view and template. But currently the description with highlighting is not showing in the file search result. I will look over into it later. (Fixed)
Can you run it locally and let me know about your feedback?

@safwanrahman safwanrahman removed the PR: work in progress Pull request is not ready for full review label Jun 14, 2018
@safwanrahman safwanrahman self-assigned this Jun 14, 2018
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Once we get the tests working, I think we can merge this into the search_upgrade branch, and continue to work on improvements as additional PR's, so as not to complicate this one.

@safwanrahman safwanrahman changed the title Search Proof of Concept Upgrade Elasticsearch to 6.x and rewrite using elasticsearch-dsl Jun 18, 2018
@safwanrahman safwanrahman changed the title Upgrade Elasticsearch to 6.x and rewrite using elasticsearch-dsl Upgrade to Elasticsearch 6.x Jun 18, 2018
@safwanrahman
Copy link
Member Author

safwanrahman commented Jun 19, 2018

@ericholscher I have fixed the tests and now it is passing. 🎾
Could not mock the processed_json property, so added another method get_processed_json that will be called from processed_json and mock get_processed_json instead.

Without mocking, the get_processed_json would run in every test for all the files, which will make the test slower. I will add other tests for the model methods in future.
possible to merge?

@safwanrahman safwanrahman changed the title Upgrade to Elasticsearch 6.x Upgrade Elasticsearch to version 6.x Jun 19, 2018
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I'll get it merged once we rename the migration.

@@ -0,0 +1,54 @@
# -*- coding: utf-8 -*-
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This migration should have a name, saying what it does.

I also worry that this migration will become out of date with a long-running branch beside the master. I'm not sure the best path to take -- we will likely just need to re-create it before we merge it into master.

Copy link
Member Author

@safwanrahman safwanrahman Jun 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! We can absolutely do this. I will keep this in mind.

@safwanrahman
Copy link
Member Author

@ericholscher Renamed the migration. ready to merge! :shipit:

@ericholscher
Copy link
Member

Looks good. 👍

@ericholscher ericholscher merged commit fd75aa3 into readthedocs:search_upgrade Jun 19, 2018
@safwanrahman safwanrahman deleted the search branch July 7, 2018 19:24
safwanrahman pushed a commit to safwanrahman/readthedocs.org that referenced this pull request Jul 16, 2018
safwanrahman pushed a commit to safwanrahman/readthedocs.org that referenced this pull request Jul 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants