-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
search engine: first draft of code and also (slightly desynchronized) chapter text! #33
Conversation
* Eliminating duplicate postings earlier on to speed up indexing and simplify the code a little; inadvertently fixed an undetected bug that was allowing duplicate postings through. (Probably the same posting ended up in different segments previously and then didn't get dup-squashed when merged.) Another surprising effect of this was that my example 117MB dataset no longer required any merging, allowing me to index it in 6m10s instead of the previous 20m. Now I need a bigger example dataset to test on so the merge code gets exercised. I'm fixing this by reducing maximum segment size. * Removed comments that are now covered by the text. * Switched to using a Path class for almost all pathnames in the code. It provides a little syntactic sugar, that's all. * Eliminated some unused variables (or rather made them be _) * Factored out tuple-file reading and writing functionality. * Added some directory metadata indexing code in preparation for incremental index updates. * Added line numbers to the grep output format so you can use it in Emacs grep-mode.
I added a section explaining query evaluation. It could benefit from being more concrete. The code probably needs to be changed quite a bit still; it's about 37 lines.
Thanks to Darius Bacon for several suggestions for improvement.
Previously filenames with spaces would create corrupt indices. Now we urllib.escape the filenames (and everything else) before writing them to tuple files. This guarantees that we can read and write arbitrary strings, at the cost of minimal storage overhead. I haven't measured the speed overhead but I doubt it's large.
2.5 is the version of Jython that ships with Debian. I was able to get most of the way there by removing `print_function` (I wasn't using it), `itertools.chain.from_iterable` (I didn't really need its laziness), and adding a `from __future__ import with_statement`. But Jython 2.5's `GzipFile` isn't a context manager, so you can't use the `with` statement to close it automatically. Rather than write more code to handle that, maybe I'll give up for now.
Sticking the I/O code into the sequential-access section.
BTW, I think "jgensler8" owns "user@localhost.localdomain" or something according to GitHub, thus the spurious "jgensler8" commits. |
Now README.md has the extra comments needed to extract an outdated and somewhat incomplete version of index.py from it. All that is left is to update the code in the file and write a build script, and then we can eliminate the duplication.
Now the code extracted from the chapter text isn't missing any code, although it's in a different order than index.py is, and a few comments are missing. I haven't tested it to make sure that it works, but hopefully it does.
open_gzipped = lambda self, *args: gzip.GzipFile(self.name, *args) | ||
basename = lambda self: os.path.basename(self.name) | ||
parent = lambda self: Path(os.path.dirname(self.name)) | ||
abspath = lambda self: os.path.abspath(self.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not seen this way of writing a class definition before. Would you consider this "acceptable python", or do you think of this as a way to implement useful functionality in a minimal number of lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote it this way in order to better explore the parallelism within groups of definitions, to make the differences within each group (the real content of the definitions) stand out visually; this could be improved with further vertical alignment. I do consider it "acceptable Python", but it's also clearly "nontraditional Python". I think traditional Python would be slightly less readable to me (needlessly obscuring the unavoidable parallelism between the definitions), but readability is subjective, or at least context-dependent — what one group of people finds more readable, another might find less readable.
If the overall opinion of reviewers is that the readability benefit of tabular formatting here is minimal or even negative, I wouldn't be unhappy to format this in the traditional way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely readable, but equally unusual. I'd suggest including something to this effect in the chaper.
I read the program and attempted to understand the file data |
I started with the chapter, and didn't find any concepts that weren't explained or any deep knowledge required. This is quite elegant, and I appreciate that it's built out of many, small functions. |
Thanks to @djmitche and @akuchling for the great commentary -- @kragen, I think once you respond to those comments, you should be in pretty good shape and we can begin copy-editing. |
I'm really flattered and gratified! All this commentary gives me some good directions to go in to improve the thing. |
Conflicts: README.md
This incorporates some of the reviewers' feedback. Also this is the first version of index.py that is built from README.md, and therefore arguably it shouldn't be checked in at all, but checking it in enables other people to run the search engine without rebuilding it from the document. Like a `configure` script.
Improving these areas reviewers identified: - "skip file" name is confusingly similar to skip lists - unorthodox formatting `Path` class not explained (or fully exploited) - "none" in comment sounds like "None". - not clear why "seg_merged" won't collide. - how do we know "line" ends with a newline?
With my sample dataset of the Linux kernel, | ||
4096 postings is about 20K gzipped (5 bytes per posting), | ||
or about 150K uncompressed, | ||
representing about 150K of original text, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these repetitions of 150K a typo? Are you saying the index is 150K uncompressed, and that your original dataset is also 150K of raw content?
@kragen I've given this a quick read now, and I'm going to give it another closer read because there is a lot of really awesome stuff in here. My first impression is that your chapter could benefit from an early 'map' to readers about the order you'll be explaining your system components. Right now, it feels like the ordering of your chapter sections is a bit arbitrary, and I found myself asking at the beginning of each section: 'did this section follow the previous for a specific reason? What was the reason?' I personally would like to see index generation handled and explained before anything, with querying to follow. I think it would also be simpler just to state that you're not handling re-indexing / incremental indexing at all in this chapter, and then never mention it again. There's enough good content in here that I don't think the chapter would suffer from its omission. I'll have more commentary to follow after a follow-up read. |
Merging for archival posterity. |
search engine: first draft of code and also (slightly desynchronized) chapter text!
No description provided.