search engine: first draft of code and also (slightly desynchronized) chapter text! #33

kragen · 2014-02-24T20:29:20Z

No description provided.

* Eliminating duplicate postings earlier on to speed up indexing and simplify the code a little; inadvertently fixed an undetected bug that was allowing duplicate postings through. (Probably the same posting ended up in different segments previously and then didn't get dup-squashed when merged.) Another surprising effect of this was that my example 117MB dataset no longer required any merging, allowing me to index it in 6m10s instead of the previous 20m. Now I need a bigger example dataset to test on so the merge code gets exercised. I'm fixing this by reducing maximum segment size. * Removed comments that are now covered by the text. * Switched to using a Path class for almost all pathnames in the code. It provides a little syntactic sugar, that's all. * Eliminated some unused variables (or rather made them be _) * Factored out tuple-file reading and writing functionality. * Added some directory metadata indexing code in preparation for incremental index updates. * Added line numbers to the grep output format so you can use it in Emacs grep-mode.

I added a section explaining query evaluation. It could benefit from being more concrete. The code probably needs to be changed quite a bit still; it's about 37 lines.

Thanks to Darius Bacon for several suggestions for improvement.

Previously filenames with spaces would create corrupt indices. Now we urllib.escape the filenames (and everything else) before writing them to tuple files. This guarantees that we can read and write arbitrary strings, at the cost of minimal storage overhead. I haven't measured the speed overhead but I doubt it's large.

2.5 is the version of Jython that ships with Debian. I was able to get most of the way there by removing `print_function` (I wasn't using it), `itertools.chain.from_iterable` (I didn't really need its laziness), and adding a `from __future__ import with_statement`. But Jython 2.5's `GzipFile` isn't a context manager, so you can't use the `with` statement to close it automatically. Rather than write more code to handle that, maybe I'll give up for now.

Sticking the I/O code into the sequential-access section.

kragen · 2014-02-28T19:42:56Z

BTW, I think "jgensler8" owns "user@localhost.localdomain" or something according to GitHub, thus the spurious "jgensler8" commits.

Now README.md has the extra comments needed to extract an outdated and somewhat incomplete version of index.py from it. All that is left is to update the code in the file and write a build script, and then we can eliminate the duplication.

Now the code extracted from the chapter text isn't missing any code, although it's in a different order than index.py is, and a few comments are missing. I haven't tested it to make sure that it works, but hopefully it does.

djmitche · 2014-04-13T18:53:19Z

search-engine/index.py

+    open_gzipped = lambda self, *args: gzip.GzipFile(self.name, *args)
+    basename     = lambda self: os.path.basename(self.name)
+    parent       = lambda self: Path(os.path.dirname(self.name))
+    abspath      = lambda self: os.path.abspath(self.name)


I've not seen this way of writing a class definition before. Would you consider this "acceptable python", or do you think of this as a way to implement useful functionality in a minimal number of lines?

I wrote it this way in order to better explore the parallelism within groups of definitions, to make the differences within each group (the real content of the definitions) stand out visually; this could be improved with further vertical alignment. I do consider it "acceptable Python", but it's also clearly "nontraditional Python". I think traditional Python would be slightly less readable to me (needlessly obscuring the unavoidable parallelism between the definitions), but readability is subjective, or at least context-dependent — what one group of people finds more readable, another might find less readable.

If the overall opinion of reviewers is that the readability benefit of tabular formatting here is minimal or even negative, I wouldn't be unhappy to format this in the traditional way.

It's definitely readable, but equally unusual. I'd suggest including something to this effect in the chaper.

akuchling · 2014-04-13T19:05:08Z

I read the program and attempted to understand the file data
structure without reading the chapter. (We were pairing up and my partner
was reading the prose.) It took some effort, but once I understood the data structure,
the code is really elegant; good work!

djmitche · 2014-04-13T19:07:19Z

I started with the chapter, and didn't find any concepts that weren't explained or any deep knowledge required. This is quite elegant, and I appreciate that it's built out of many, small functions.

MichaelDiBernardo · 2014-04-17T10:29:33Z

Thanks to @djmitche and @akuchling for the great commentary -- @kragen, I think once you respond to those comments, you should be in pretty good shape and we can begin copy-editing.

kragen · 2014-07-24T01:26:49Z

I'm really flattered and gratified! All this commentary gives me some good directions to go in to improve the thing.

Conflicts: README.md

This incorporates some of the reviewers' feedback. Also this is the first version of index.py that is built from README.md, and therefore arguably it shouldn't be checked in at all, but checking it in enables other people to run the search engine without rebuilding it from the document. Like a `configure` script.

Improving these areas reviewers identified: - "skip file" name is confusingly similar to skip lists - unorthodox formatting `Path` class not explained (or fully exploited) - "none" in comment sounds like "None". - not clear why "seg_merged" won't collide. - how do we know "line" ends with a newline?

MichaelDiBernardo · 2014-08-29T07:18:27Z

search-engine/README.md

+With my sample dataset of the Linux kernel,
+4096 postings is about 20K gzipped (5 bytes per posting),
+or about 150K uncompressed,
+representing about 150K of original text,


Are these repetitions of 150K a typo? Are you saying the index is 150K uncompressed, and that your original dataset is also 150K of raw content?

MichaelDiBernardo · 2014-08-29T07:34:16Z

@kragen I've given this a quick read now, and I'm going to give it another closer read because there is a lot of really awesome stuff in here.

My first impression is that your chapter could benefit from an early 'map' to readers about the order you'll be explaining your system components. Right now, it feels like the ordering of your chapter sections is a bit arbitrary, and I found myself asking at the beginning of each section: 'did this section follow the previous for a specific reason? What was the reason?'

I personally would like to see index generation handled and explained before anything, with querying to follow. I think it would also be simpler just to state that you're not handling re-indexing / incremental indexing at all in this chapter, and then never mention it again. There's enough good content in here that I don't think the chapter would suffer from its omission.

I'll have more commentary to follow after a follow-up read.

Some small changes

MichaelDiBernardo · 2016-02-03T12:14:36Z

Merging for archival posterity.

search engine: first draft of code and also (slightly desynchronized) chapter text!

user and others added 26 commits February 24, 2014 17:26

adding my name

e6daaaf

some initial search engine rethoughts

1817e9a

a little more search engine

3fd499d

more writing about search engines

5cc4dd0

implemented most of indexing

b6a4ecf

most of the search engine is now working

a040a2a

made querying the search engine work

4e24010

updating search-engine text

0d7b145

search engine quibble comment

a621ddc

documenting merge strategy and index structure

7a40518

search engine: more discussion of merge policy

1472b46

renamed indexdir vars to index_dir

8c85365

made index size sim more forgiving

6d84782

more updates to search-engine text

04c3a14

I added a section explaining query evaluation. It could benefit from being more concrete. The code probably needs to be changed quite a bit still; it's about 37 lines.

added search-engine todo

5c2431d

more search-engine notes

646324d

fixed my contributor row

39a162a

simplified search engine

bba0290

Thanks to Darius Bacon for several suggestions for improvement.

updated search engine todo

af29885

updated search-engine todo

82ef17f

search-engine: more text

f156979

Sticking the I/O code into the sequential-access section.

search-engine: explained generating indices

1f39e58

search-engine: removed todo item

ec97985

kragen added 3 commits February 28, 2014 17:27

search-engine: measured performance

7364dfd

search-engine: some readability/correctness updates

1f4c5df

search-engine: updated TODO

33d823a

kragen added 4 commits March 30, 2014 03:06

search engine: explaining context of handaxeweb

4f5a6bc

search engine: handaxewebifying README.md

d6a936c

Now README.md has the extra comments needed to extract an outdated and somewhat incomplete version of index.py from it. All that is left is to update the code in the file and write a build script, and then we can eliminate the duplication.

search engine: updated code inside chapter text

7c463f2

Now the code extracted from the chapter text isn't missing any code, although it's in a different order than index.py is, and a few comments are missing. I haven't tested it to make sure that it works, but hopefully it does.

search engine: gitignoring temp files from handaxeweb

6df4409

djmitche reviewed Apr 13, 2014
View reviewed changes

default and others added 10 commits July 26, 2014 23:22

search-engine: added Makefile

2228111

Merge github.com:aosabook/500lines

ecfd922

Conflicts: README.md

updating search-engine/TODO.md

d43d1df

search-engine: added diagrams

5da83b3

search-engine: refactored diagrams

3aa120f

search-engine: refactored diagrams more

8bb480e

search-engine: add xmlpi on diagrams

20fa9e7

search-engine: refactored and docstringed diagrams

4f9ab85

MichaelDiBernardo reviewed Aug 29, 2014
View reviewed changes

MichaelDiBernardo pushed a commit that referenced this pull request Apr 28, 2015

Merge pull request #33 from spderosso/model

813ca4d

Some small changes

MichaelDiBernardo added a commit that referenced this pull request Feb 3, 2016

Merge pull request #33 from kragen/master

902a270

search engine: first draft of code and also (slightly desynchronized) chapter text!

MichaelDiBernardo merged commit 902a270 into aosabook:master Feb 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

search engine: first draft of code and also (slightly desynchronized) chapter text! #33

search engine: first draft of code and also (slightly desynchronized) chapter text! #33

kragen commented Feb 24, 2014

kragen commented Feb 28, 2014

djmitche Apr 13, 2014

kragen Jul 24, 2014

djmitche Jul 24, 2014

akuchling commented Apr 13, 2014

djmitche commented Apr 13, 2014

MichaelDiBernardo commented Apr 17, 2014

kragen commented Jul 24, 2014

MichaelDiBernardo Aug 29, 2014

MichaelDiBernardo commented Aug 29, 2014

MichaelDiBernardo commented Feb 3, 2016

search engine: first draft of code and also (slightly desynchronized) chapter text! #33

search engine: first draft of code and also (slightly desynchronized) chapter text! #33

Conversation

kragen commented Feb 24, 2014

kragen commented Feb 28, 2014

djmitche Apr 13, 2014

Choose a reason for hiding this comment

kragen Jul 24, 2014

Choose a reason for hiding this comment

djmitche Jul 24, 2014

Choose a reason for hiding this comment

akuchling commented Apr 13, 2014

djmitche commented Apr 13, 2014

MichaelDiBernardo commented Apr 17, 2014

kragen commented Jul 24, 2014

MichaelDiBernardo Aug 29, 2014

Choose a reason for hiding this comment

MichaelDiBernardo commented Aug 29, 2014

MichaelDiBernardo commented Feb 3, 2016