Skip to content

matchall is very slow #3719

Closed
Closed
@johnmyleswhite

Description

@johnmyleswhite

I recently tried translating Norvig's spellchecker into Julia. The following example shows that Julia's string performance needs a lot of work.

To get started, download the file http://norvig.com/ipython/big.txt for tokenization.

We'll tokenize it in Julia first:

function tokenize()
    BIG = readall("big.txt");
    tokens(text::String) =
      [m.match for m in matchall(r"[a-z]+", lowercase(text))]
    @elapsed t = tokens(BIG)
end
tokenize()

This takes 10 seconds on my machine.

In contrast, the following Python code is simpler (because there's no notion that matchall won't return strings directly) and 20x faster.

import re
import time
BIG = file('big.txt').read()
def tokens(text):
    return re.findall('[a-z]+', text.lower())

s = time.time()
t = tokens(BIG)
e = time.time()
e - s

This takes 0.4 seconds on my machine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions