Closed
Description
I recently tried translating Norvig's spellchecker into Julia. The following example shows that Julia's string performance needs a lot of work.
To get started, download the file http://norvig.com/ipython/big.txt for tokenization.
We'll tokenize it in Julia first:
function tokenize()
BIG = readall("big.txt");
tokens(text::String) =
[m.match for m in matchall(r"[a-z]+", lowercase(text))]
@elapsed t = tokens(BIG)
end
tokenize()
This takes 10 seconds on my machine.
In contrast, the following Python code is simpler (because there's no notion that matchall
won't return strings directly) and 20x faster.
import re
import time
BIG = file('big.txt').read()
def tokens(text):
return re.findall('[a-z]+', text.lower())
s = time.time()
t = tokens(BIG)
e = time.time()
e - s
This takes 0.4 seconds on my machine.