matchall is very slow

I recently tried translating Norvig's spellchecker into Julia. The following example shows that Julia's string performance needs a lot of work.

To get started, download the file http://norvig.com/ipython/big.txt for tokenization.

We'll tokenize it in Julia first:

```
function tokenize()
    BIG = readall("big.txt");
    tokens(text::String) =
      [m.match for m in matchall(r"[a-z]+", lowercase(text))]
    @elapsed t = tokens(BIG)
end
tokenize()
```

This takes 10 seconds on my machine.

In contrast, the following Python code is simpler (because there's no notion that `matchall` won't return strings directly) and 20x faster.

```
import re
import time
BIG = file('big.txt').read()
def tokens(text):
    return re.findall('[a-z]+', text.lower())

s = time.time()
t = tokens(BIG)
e = time.time()
e - s
```

This takes 0.4 seconds on my machine.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

matchall is very slow #3719

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

matchall is very slow #3719

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions