Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

search engine: first draft of code and also (slightly desynchronized) chapter text! #33

Merged
merged 67 commits into from
Feb 3, 2016

Conversation

kragen
Copy link
Contributor

@kragen kragen commented Feb 24, 2014

No description provided.

user and others added 26 commits February 24, 2014 17:26
* Eliminating duplicate postings earlier on to speed up indexing and
  simplify the code a little; inadvertently fixed an undetected bug
  that was allowing duplicate postings through.  (Probably the same
  posting ended up in different segments previously and then didn't
  get dup-squashed when merged.)  Another surprising effect of this
  was that my example 117MB dataset no longer required any merging,
  allowing me to index it in 6m10s instead of the previous 20m.  Now I
  need a bigger example dataset to test on so the merge code gets
  exercised.  I'm fixing this by reducing maximum segment size.
* Removed comments that are now covered by the text.
* Switched to using a Path class for almost all pathnames in the code.
  It provides a little syntactic sugar, that's all.
* Eliminated some unused variables (or rather made them be _)
* Factored out tuple-file reading and writing functionality.
* Added some directory metadata indexing code in preparation for
  incremental index updates.
* Added line numbers to the grep output format so you can use it in
  Emacs grep-mode.
I added a section explaining query evaluation.  It could benefit from
being more concrete.  The code probably needs to be changed quite a
bit still; it's about 37 lines.
Thanks to Darius Bacon for several suggestions for improvement.
Previously filenames with spaces would create corrupt indices.  Now we
urllib.escape the filenames (and everything else) before writing them
to tuple files.  This guarantees that we can read and write arbitrary
strings, at the cost of minimal storage overhead.  I haven't measured
the speed overhead but I doubt it's large.
2.5 is the version of Jython that ships with Debian.  I was able to
get most of the way there by removing `print_function` (I wasn't using
it), `itertools.chain.from_iterable` (I didn't really need its
laziness), and adding a `from __future__ import with_statement`.  But
Jython 2.5's `GzipFile` isn't a context manager, so you can't use the
`with` statement to close it automatically.  Rather than write more
code to handle that, maybe I'll give up for now.
Sticking the I/O code into the sequential-access section.
@kragen
Copy link
Contributor Author

kragen commented Feb 28, 2014

BTW, I think "jgensler8" owns "user@localhost.localdomain" or something according to GitHub, thus the spurious "jgensler8" commits.

kragen added 4 commits March 30, 2014 03:06
Now README.md has the extra comments needed to extract an outdated and
somewhat incomplete version of index.py from it.  All that is left is
to update the code in the file and write a build script, and then we
can eliminate the duplication.
Now the code extracted from the chapter text isn't missing any code,
although it's in a different order than index.py is, and a few
comments are missing.  I haven't tested it to make sure that it works,
but hopefully it does.
open_gzipped = lambda self, *args: gzip.GzipFile(self.name, *args)
basename = lambda self: os.path.basename(self.name)
parent = lambda self: Path(os.path.dirname(self.name))
abspath = lambda self: os.path.abspath(self.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not seen this way of writing a class definition before. Would you consider this "acceptable python", or do you think of this as a way to implement useful functionality in a minimal number of lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote it this way in order to better explore the parallelism within groups of definitions, to make the differences within each group (the real content of the definitions) stand out visually; this could be improved with further vertical alignment. I do consider it "acceptable Python", but it's also clearly "nontraditional Python". I think traditional Python would be slightly less readable to me (needlessly obscuring the unavoidable parallelism between the definitions), but readability is subjective, or at least context-dependent — what one group of people finds more readable, another might find less readable.

If the overall opinion of reviewers is that the readability benefit of tabular formatting here is minimal or even negative, I wouldn't be unhappy to format this in the traditional way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely readable, but equally unusual. I'd suggest including something to this effect in the chaper.

@akuchling
Copy link

I read the program and attempted to understand the file data
structure without reading the chapter. (We were pairing up and my partner
was reading the prose.) It took some effort, but once I understood the data structure,
the code is really elegant; good work!

@djmitche
Copy link
Contributor

I started with the chapter, and didn't find any concepts that weren't explained or any deep knowledge required. This is quite elegant, and I appreciate that it's built out of many, small functions.

@MichaelDiBernardo
Copy link
Collaborator

Thanks to @djmitche and @akuchling for the great commentary -- @kragen, I think once you respond to those comments, you should be in pretty good shape and we can begin copy-editing.

@kragen
Copy link
Contributor Author

kragen commented Jul 24, 2014

I'm really flattered and gratified! All this commentary gives me some good directions to go in to improve the thing.

default and others added 10 commits July 26, 2014 23:22
Conflicts:
	README.md
This incorporates some of the reviewers' feedback.  Also this is the
first version of index.py that is built from README.md, and therefore
arguably it shouldn't be checked in at all, but checking it in enables
other people to run the search engine without rebuilding it from the
document.  Like a `configure` script.
Improving these areas reviewers identified:

- "skip file" name is confusingly similar to skip lists
- unorthodox formatting `Path` class not explained (or fully exploited)
- "none" in comment sounds like "None".
- not clear why "seg_merged" won't collide.
- how do we know "line" ends with a newline?
With my sample dataset of the Linux kernel,
4096 postings is about 20K gzipped (5 bytes per posting),
or about 150K uncompressed,
representing about 150K of original text,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these repetitions of 150K a typo? Are you saying the index is 150K uncompressed, and that your original dataset is also 150K of raw content?

@MichaelDiBernardo
Copy link
Collaborator

@kragen I've given this a quick read now, and I'm going to give it another closer read because there is a lot of really awesome stuff in here.

My first impression is that your chapter could benefit from an early 'map' to readers about the order you'll be explaining your system components. Right now, it feels like the ordering of your chapter sections is a bit arbitrary, and I found myself asking at the beginning of each section: 'did this section follow the previous for a specific reason? What was the reason?'

I personally would like to see index generation handled and explained before anything, with querying to follow. I think it would also be simpler just to state that you're not handling re-indexing / incremental indexing at all in this chapter, and then never mention it again. There's enough good content in here that I don't think the chapter would suffer from its omission.

I'll have more commentary to follow after a follow-up read.

MichaelDiBernardo pushed a commit that referenced this pull request Apr 28, 2015
@MichaelDiBernardo
Copy link
Collaborator

Merging for archival posterity.

MichaelDiBernardo added a commit that referenced this pull request Feb 3, 2016
search engine: first draft of code and also (slightly desynchronized) chapter text!
@MichaelDiBernardo MichaelDiBernardo merged commit 902a270 into aosabook:master Feb 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants