a little more search engine

aosabook · MichaelDiBernardo · Feb 3, 2016 · Feb 24, 2014 · Feb 25, 2014 · Feb 25, 2014
commit 3fd499d08838b0a86571050d5acc3ea4ebcc84c1
diff --git a/search-engine/README.md b/search-engine/README.md
@@ -27,12 +27,19 @@ a posting-list-based search engine,
 modeled after Lucene in some ways,
 but highly simplified;
 it can perform full-text searches
+of directory trees in your filesystem,
+like `grep -r` but much faster.
+It's tuned to perform acceptably
+even on electromechanical hard disks
+coated with spinning rust.
+
+<!-- Originally I said:
 on XML dumps from StackOverflow.com
 or other StackExchange sites,
 thus providing instant help
 for all common technical problems.
 
-<!-- One problem with this: in the context of StackExchange dumps,
+One problem with this: in the context of StackExchange dumps,
 it’s difficult to motivate the need for incremental index updates, but
 most of the time, incremental index updates are not optional, and they
 can complicate a lot of things.  So I’d like to show that they can be
@@ -53,6 +60,9 @@ through the 41706 files present, totaling 609MiB.
 
 In that case, maybe mtime is the best thing to rank by?
 
+The metadata indexing may turn out to be hairier than I expect... I
+may still abandon this.
+
 -->
 
 The posting list
@@ -62,4 +72,47 @@ Since this search engine doesn't do ranking,
 it basically comes down to maintaining a posting list
 on disk
 and querying it.
-The simplest 
+We just need to be able to quickly find all the files
+that contain a given word.
+The simplest data structure
+that supports this task
+would be something like a sorted text file
+with a word and a filename on each line;
+for example:
+
+    get_ds ./sh/include/asm/segment.h
+    get_ds ./x86/include/asm/uaccess.h
+    get_eilvt ./x86/kernel/cpu/perf_event_amd_ibs.c
+    get_event ./x86/kernel/apm_32.c
+    get_event_constraints ./x86/kernel/cpu/perf_event.c
+    get_event_constraints ./x86/kernel/cpu/perf_event.h
+    get_event_constraints ./x86/kernel/cpu/perf_event_p4.c
+    get_exit_info ./x86/include/asm/kvm_host.h
+
+You could binary-search this file
+to find all the lines that begin with a given word;
+if you have a billion lines in the file,
+this might take as many as 60 probes into the file.
+On an electromechanical hard disk,
+this could take more than half a second,
+and it will have to be repeated for each search term.
+
+A somewhat simpler,
+although less optimal,
+approach
+is to break the file up into chunks
+with a compact chunk index
+which tells you what each chunk contains.
+Then you can read only the chunks
+that might contain the terms you need.
+
+Industrial-strength search engines
+avoid duplicating the vocabulary list
+and the list of document filenames
+(or URLs, or other identifiers)
+by identifying each item with an integer.
+This allows for delta compression.
+Instead, we simply rely on gzip,
+which typically makes our index
+about 15% <!-- XXX check this -->
+of the size of the original text.
diff --git a/search-engine/index.py b/search-engine/index.py
@@ -7,6 +7,7 @@
 import sys
 
 def postings_from_dir(dirname):
+    # XXX re.compile?
     for dirpath, dirnames, filenames in os.walk(dirname):
         for filename in filenames:
             pathname = os.path.join(dirpath, filename)
@@ -28,6 +29,24 @@ def sorted_uniq_inplace(lst):
 # 15% would be better.  5K per filesystem file is probably also
 # suboptimal.  split -l 10000 gives us instead 31 files, which gzip to
 # about 50K eaach, totaling 1.5M (15% of original size).
+
+# Indexing the whole arch/ subdirectory (118M) gives:
+# real	9m36.842s
+# user	6m27.896s
+# sys	0m10.569s
+# and a index file which is also 118M, which was nine sorted chunks.
+
+# Splitting it into 8192-line chunks yielded 383 files, which
+# compressed to 16M.
+
+# A simple Python program is able to parse about 150 000 lines per
+# second looking for a search term, which is some 5× slower than gzip
+# is able to decompress; this suggests that the optimal chunk size for
+# query speed is perhaps closer to 1500 lines than 8192 lines.  Going
+# to 4096 should get most of the benefit (27ms per chunk parsed)
+# without hurting compression too much, and will work better on faster
+# machines like the ones in the future.  Ha ha.
+
 def sorted_uniq_chunks(iterator, max_chunk_size=1000*1000):
     chunk = []
     for item in iterator: