word-count: 'javascript!!&@$%^&' should be not accepted as a word #177

ghajba · 2015-05-18T17:59:51Z

I did this exercise today while commuting to work and I think this solution was a bad-bad thing so I altered the test case and the example to handle this issue. If you do not think alike I understand.
Cheers, Gabor

…t the best the test case should not preserve puntuation because 'javascript!!&@$%^&' is not a word.

I missed previously to alter the name of the test to represent what it does

kytrinyx · 2015-05-19T02:30:11Z

/cc @exercism/track-maintainers What do you think? Should word count take a naive approach and just break on whitespace? Or should it discard interesting characters?

etrepum · 2015-05-19T04:19:14Z

I like having it require the split condition be anything that's not alphanumeric, it requires a little more digging than a whitespace split which is usually some sort of built-in method or function in most languages.

In Haskell we use the unicode definition of alphanumeric. This Python implementation looks like it's just ASCII. In Python that would be tricky (but possible) to do with a regex because the \w class includes underscore but the spec for this exercise does not.

import re

def word_count(text):
    d = {}
    # use a negative lookahead assertion to remove underscore from \w
    for word in re.findall(r'(?:(?!_)\w)+(?u)', text):
        d[word] = d.get(word, 0) + 1
    return d

Simpler solution would do two passes to handle the underscore case:

import re

def word_count(text):
    d = {}
    for word in re.findall(r'\w+', text.replace('_', ' ')):
        d[word] = d.get(word, 0) + 1
    return d

re.split is more awkward to use because then you have to deal with empty strings.

ghajba · 2015-05-19T05:45:43Z

If I may leave another comment: in the Java version of the same exercise special characters (like colon or this awkward javascript-extension) aren't permitted in the solution.

And I have to agree with @etrepum the current solution does not handle non-ascii characters. However the solution does not necessary have to be done with regular expressions, does it?

etrepum · 2015-05-19T06:14:33Z

Regex is certainly not a necessity, I was just modifying the existing proposed solution to make it work with the full spec that the Haskell version implements. Here's one way to do it in a fairly low-level fashion:

def word_count(text):
    i = 0
    end = len(text)
    d = {}
    while i < end:
        while i < end and not text[i].isalnum():
            i = i + 1
        j = i
        while j < end and text[j].isalnum():
            j = j + 1
        if j > i:
            word = text[i:j]
            d[word] = d.get(word, 0) + 1
        i = j
    return d

Dog · 2015-06-05T23:16:04Z

I'm in favor of doing it the Haskell way, and performing a split if its not alphanumeric.

@ghajba Would you like to try to implement test cases for unicode? Otherwise, I'll merge and update the exercise.

behrtam · 2015-11-20T11:53:19Z

The proposed implementation does not normalize (Expect case normalized words #207) and does not support unicode.
I would suggest that we create a different pr for the unicode test, because it might need some more work to support Python 2 & 3.
As we switch to the haskell way of splitting we should also add their test case for symbols.

testCase "symbols are separators" $
    fromList [("hey", 1), ("my", 1), ("spacebar", 1),
              ("is", 1), ("broken", 1)] @=?
    wordCount "hey,my_spacebar_is_broken."

kytrinyx · 2015-11-20T17:20:02Z

+1 for adding the haskell test case.

Let's do as you suggest and put the unicode test in a different PR.

behrtam · 2015-11-20T18:15:16Z

As @ghajba seems to have no time or interest in finishing this here and the example code doesn't work at all for the new way of splitting. Should we close this and reopen a new pr or merge this to a branch and refactor/fix it there?

kytrinyx · 2015-11-20T18:44:54Z

Yeah, I think that's fine. Thanks, @behrtam.

Change the separator as discussed in exercism#177 to non alphamumeric as it is also done in the haskell track.

Change the separator as discussed in exercism#177 to non alphanumeric as it is also done in the haskell track.

See exercism#177

ghajba added 2 commits May 18, 2015 19:46

word-count: the naive approach can be good but in this case it was no…

0e3375a

…t the best the test case should not preserve puntuation because 'javascript!!&@$%^&' is not a word.

word-count: correct test name

f37bfee

I missed previously to alter the name of the test to represent what it does

kytrinyx closed this Nov 20, 2015

behrtam added a commit to behrtam/python that referenced this pull request Nov 20, 2015

word-count: Change separator from whitespace to non alphanumeric

d7b4367

Change the separator as discussed in exercism#177 to non alphamumeric as it is also done in the haskell track.

behrtam added a commit to behrtam/python that referenced this pull request Nov 20, 2015

word-count: Change separator from whitespace to non alphanumeric

4771b97

Change the separator as discussed in exercism#177 to non alphanumeric as it is also done in the haskell track.

behrtam mentioned this pull request Nov 20, 2015

word-count: Change separator from whitespace to non alphanumeric #251

Merged

behrtam added a commit to behrtam/python that referenced this pull request Nov 20, 2015

word-count: Add test case for unicode support

bd18112

See exercism#177

behrtam added a commit to behrtam/python that referenced this pull request Nov 20, 2015

word-count: Add test case for unicode support

4f4d22d

See exercism#177

behrtam mentioned this pull request Nov 20, 2015

word-count: Add test case for unicode support #252

Merged

behrtam mentioned this pull request Mar 8, 2017

word-count: Removed test case exercism/haskell#501

Closed

This was referenced Mar 16, 2017

word-count: Remove unicode test case #427

Merged

Rework word-count exercise to be consistent #445

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

word-count: 'javascript!!&@$%^&' should be not accepted as a word #177

word-count: 'javascript!!&@$%^&' should be not accepted as a word #177

Uh oh!

ghajba commented May 18, 2015

Uh oh!

kytrinyx commented May 19, 2015

Uh oh!

etrepum commented May 19, 2015

Uh oh!

ghajba commented May 19, 2015

Uh oh!

etrepum commented May 19, 2015

Uh oh!

Dog commented Jun 5, 2015

Uh oh!

behrtam commented Nov 20, 2015

Uh oh!

kytrinyx commented Nov 20, 2015

Uh oh!

behrtam commented Nov 20, 2015

Uh oh!

kytrinyx commented Nov 20, 2015

Uh oh!

Uh oh!

Uh oh!

word-count: 'javascript!!&@$%^&' should be not accepted as a word #177

word-count: 'javascript!!&@$%^&' should be not accepted as a word #177

Uh oh!

Conversation

ghajba commented May 18, 2015

Uh oh!

kytrinyx commented May 19, 2015

Uh oh!

etrepum commented May 19, 2015

Uh oh!

ghajba commented May 19, 2015

Uh oh!

etrepum commented May 19, 2015

Uh oh!

Dog commented Jun 5, 2015

Uh oh!

behrtam commented Nov 20, 2015

Uh oh!

kytrinyx commented Nov 20, 2015

Uh oh!

behrtam commented Nov 20, 2015

Uh oh!

kytrinyx commented Nov 20, 2015

Uh oh!

Uh oh!