Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

search engine: first draft of code and also (slightly desynchronized) chapter text! #33

Merged
merged 67 commits into from
Feb 3, 2016
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
e6daaaf
adding my name
Feb 24, 2014
1817e9a
some initial search engine rethoughts
invalid-email-address Feb 25, 2014
3fd499d
a little more search engine
invalid-email-address Feb 25, 2014
5cc4dd0
more writing about search engines
invalid-email-address Feb 25, 2014
b6a4ecf
implemented most of indexing
invalid-email-address Feb 25, 2014
a040a2a
most of the search engine is now working
invalid-email-address Feb 25, 2014
4e24010
made querying the search engine work
invalid-email-address Feb 26, 2014
0d7b145
updating search-engine text
invalid-email-address Feb 26, 2014
a621ddc
search engine quibble comment
kragen Feb 26, 2014
7a40518
documenting merge strategy and index structure
kragen Feb 26, 2014
1472b46
search engine: more discussion of merge policy
kragen Feb 26, 2014
8c85365
renamed indexdir vars to index_dir
kragen Feb 26, 2014
6d84782
made index size sim more forgiving
kragen Feb 26, 2014
3c483fe
total overhaul of search engine
kragen Feb 27, 2014
04c3a14
more updates to search-engine text
kragen Feb 27, 2014
5c2431d
added search-engine todo
kragen Feb 27, 2014
646324d
more search-engine notes
kragen Feb 27, 2014
39a162a
fixed my contributor row
kragen Feb 27, 2014
bba0290
simplified search engine
kragen Feb 27, 2014
0b97e99
fixed persistence to cope with spaces
kragen Feb 27, 2014
af29885
updated search engine todo
kragen Feb 27, 2014
d856016
failed attempt to port search-engine to Jython 2.5
kragen Feb 27, 2014
82ef17f
updated search-engine todo
kragen Feb 27, 2014
f156979
search-engine: more text
kragen Feb 27, 2014
1f39e58
search-engine: explained generating indices
kragen Feb 28, 2014
ec97985
search-engine: removed todo item
Feb 28, 2014
7364dfd
search-engine: measured performance
kragen Feb 28, 2014
1f4c5df
search-engine: some readability/correctness updates
kragen Feb 28, 2014
33d823a
search-engine: updated TODO
kragen Feb 28, 2014
c5669fd
Merge branch 'master' of github.com:kragen/500lines
kragen Feb 28, 2014
9e9ab14
search-engine text updates
Feb 28, 2014
d50802d
search-engine: exit on ^C
Feb 28, 2014
eb799c7
search-engine: updated TODO
Feb 28, 2014
ca0ea1c
search-engine: more text updates
Feb 28, 2014
443901c
search-engine: more cleanup and simplification
Feb 28, 2014
482c64d
search-engine: a little more on performance
Feb 28, 2014
8f58e51
search-engine: adding postings filters
Feb 28, 2014
82b5e60
search-engine: more text on performance
Feb 28, 2014
a7bab13
search-engine: adding crude ranking
Mar 1, 2014
f5e1aa5
search-engine: documenting very crude ranking
Mar 1, 2014
7e929d1
search-engine: added stopwords
Mar 1, 2014
10af720
search-engine: fixed quotes and apostrophes
kragen Mar 1, 2014
2f416cc
search-engine: added distinct tokenizers
kragen Mar 1, 2014
3463c6e
search-engine: recording file metadata
kragen Mar 1, 2014
101deb3
search-engine: a couple of tiny shortenings
kragen Mar 1, 2014
0b67acb
a couple more items for TODO
kragen Mar 1, 2014
0abf6ec
search-index: more updates to the text
kragen Mar 1, 2014
4c7abfb
search-engine: a couple of text fixes
kragen Mar 1, 2014
defcef7
search-engine: properly handling relative paths
kragen Mar 2, 2014
34637de
Merge branch 'master' of github.com:kragen/500lines
kragen Mar 2, 2014
fa49cd7
search-engine: incorporated most of ayust's suggestions
kragen Mar 3, 2014
458cec4
search-engine: integrated ayust's other comment
kragen Mar 4, 2014
3e2924d
search engine: adding litprog system
kragen Mar 30, 2014
4f5a6bc
search engine: explaining context of handaxeweb
kragen Mar 30, 2014
d6a936c
search engine: handaxewebifying README.md
kragen Mar 30, 2014
7c463f2
search engine: updated code inside chapter text
kragen Mar 30, 2014
6df4409
search engine: gitignoring temp files from handaxeweb
kragen Mar 30, 2014
2228111
search-engine: added Makefile
Jul 27, 2014
ecfd922
Merge github.com:aosabook/500lines
Jul 27, 2014
d43d1df
updating search-engine/TODO.md
kragen Aug 7, 2014
84f74dd
updating search-engine in response to feedback
kragen Aug 7, 2014
5da83b3
search-engine: added diagrams
kragen Aug 13, 2014
3aa120f
search-engine: refactored diagrams
kragen Aug 13, 2014
8bb480e
search-engine: refactored diagrams more
kragen Aug 13, 2014
20fa9e7
search-engine: add xmlpi on diagrams
kragen Aug 13, 2014
4f9ab85
search-engine: refactored and docstringed diagrams
kragen Aug 13, 2014
9ee37cc
search-engine: more fixes
kragen Aug 15, 2014
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
a little more search engine
  • Loading branch information
user committed Feb 25, 2014
commit 3fd499d08838b0a86571050d5acc3ea4ebcc84c1
57 changes: 55 additions & 2 deletions search-engine/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,12 +27,19 @@ a posting-list-based search engine,
modeled after Lucene in some ways,
but highly simplified;
it can perform full-text searches
of directory trees in your filesystem,
like `grep -r` but much faster.
It's tuned to perform acceptably
even on electromechanical hard disks
coated with spinning rust.

<!-- Originally I said:
on XML dumps from StackOverflow.com
or other StackExchange sites,
thus providing instant help
for all common technical problems.

<!-- One problem with this: in the context of StackExchange dumps,
One problem with this: in the context of StackExchange dumps,
it’s difficult to motivate the need for incremental index updates, but
most of the time, incremental index updates are not optional, and they
can complicate a lot of things. So I’d like to show that they can be
Expand All @@ -53,6 +60,9 @@ through the 41706 files present, totaling 609MiB.

In that case, maybe mtime is the best thing to rank by?

The metadata indexing may turn out to be hairier than I expect... I
may still abandon this.

-->

The posting list
Expand All @@ -62,4 +72,47 @@ Since this search engine doesn't do ranking,
it basically comes down to maintaining a posting list
on disk
and querying it.
The simplest
We just need to be able to quickly find all the files
that contain a given word.
The simplest data structure
that supports this task
would be something like a sorted text file
with a word and a filename on each line;
for example:

get_ds ./sh/include/asm/segment.h
get_ds ./x86/include/asm/uaccess.h
get_eilvt ./x86/kernel/cpu/perf_event_amd_ibs.c
get_event ./x86/kernel/apm_32.c
get_event_constraints ./x86/kernel/cpu/perf_event.c
get_event_constraints ./x86/kernel/cpu/perf_event.h
get_event_constraints ./x86/kernel/cpu/perf_event_p4.c
get_exit_info ./x86/include/asm/kvm_host.h

You could binary-search this file
to find all the lines that begin with a given word;
if you have a billion lines in the file,
this might take as many as 60 probes into the file.
On an electromechanical hard disk,
this could take more than half a second,
and it will have to be repeated for each search term.

A somewhat simpler,
although less optimal,
approach
is to break the file up into chunks
with a compact chunk index
which tells you what each chunk contains.
Then you can read only the chunks
that might contain the terms you need.

Industrial-strength search engines
avoid duplicating the vocabulary list
and the list of document filenames
(or URLs, or other identifiers)
by identifying each item with an integer.
This allows for delta compression.
Instead, we simply rely on gzip,
which typically makes our index
about 15% <!-- XXX check this -->
of the size of the original text.
19 changes: 19 additions & 0 deletions search-engine/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import sys

def postings_from_dir(dirname):
# XXX re.compile?
for dirpath, dirnames, filenames in os.walk(dirname):
for filename in filenames:
pathname = os.path.join(dirpath, filename)
Expand All @@ -28,6 +29,24 @@ def sorted_uniq_inplace(lst):
# 15% would be better. 5K per filesystem file is probably also
# suboptimal. split -l 10000 gives us instead 31 files, which gzip to
# about 50K eaach, totaling 1.5M (15% of original size).

# Indexing the whole arch/ subdirectory (118M) gives:
# real 9m36.842s
# user 6m27.896s
# sys 0m10.569s
# and a index file which is also 118M, which was nine sorted chunks.

# Splitting it into 8192-line chunks yielded 383 files, which
# compressed to 16M.

# A simple Python program is able to parse about 150 000 lines per
# second looking for a search term, which is some 5× slower than gzip
# is able to decompress; this suggests that the optimal chunk size for
# query speed is perhaps closer to 1500 lines than 8192 lines. Going
# to 4096 should get most of the benefit (27ms per chunk parsed)
# without hurting compression too much, and will work better on faster
# machines like the ones in the future. Ha ha.

def sorted_uniq_chunks(iterator, max_chunk_size=1000*1000):
chunk = []
for item in iterator:
Expand Down