Skip to content

word-count: 'javascript!!&@$%^&' should be not accepted as a word #177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

ghajba
Copy link

@ghajba ghajba commented May 18, 2015

I did this exercise today while commuting to work and I think this solution was a bad-bad thing so I altered the test case and the example to handle this issue. If you do not think alike I understand.
Cheers, Gabor

ghajba added 2 commits May 18, 2015 19:46
…t the best

the test case should not preserve puntuation because 'javascript!!&@$%^&' is not a word.
I missed previously to alter the name of the test to represent what it does
@kytrinyx
Copy link
Member

/cc @exercism/track-maintainers What do you think? Should word count take a naive approach and just break on whitespace? Or should it discard interesting characters?

@etrepum
Copy link
Contributor

etrepum commented May 19, 2015

I like having it require the split condition be anything that's not alphanumeric, it requires a little more digging than a whitespace split which is usually some sort of built-in method or function in most languages.

In Haskell we use the unicode definition of alphanumeric. This Python implementation looks like it's just ASCII. In Python that would be tricky (but possible) to do with a regex because the \w class includes underscore but the spec for this exercise does not.

import re

def word_count(text):
    d = {}
    # use a negative lookahead assertion to remove underscore from \w
    for word in re.findall(r'(?:(?!_)\w)+(?u)', text):
        d[word] = d.get(word, 0) + 1
    return d

Simpler solution would do two passes to handle the underscore case:

import re

def word_count(text):
    d = {}
    for word in re.findall(r'\w+', text.replace('_', ' ')):
        d[word] = d.get(word, 0) + 1
    return d

re.split is more awkward to use because then you have to deal with empty strings.

@ghajba
Copy link
Author

ghajba commented May 19, 2015

If I may leave another comment: in the Java version of the same exercise special characters (like colon or this awkward javascript-extension) aren't permitted in the solution.

And I have to agree with @etrepum the current solution does not handle non-ascii characters. However the solution does not necessary have to be done with regular expressions, does it?

@etrepum
Copy link
Contributor

etrepum commented May 19, 2015

Regex is certainly not a necessity, I was just modifying the existing proposed solution to make it work with the full spec that the Haskell version implements. Here's one way to do it in a fairly low-level fashion:

def word_count(text):
    i = 0
    end = len(text)
    d = {}
    while i < end:
        while i < end and not text[i].isalnum():
            i = i + 1
        j = i
        while j < end and text[j].isalnum():
            j = j + 1
        if j > i:
            word = text[i:j]
            d[word] = d.get(word, 0) + 1
        i = j
    return d

@Dog
Copy link
Contributor

Dog commented Jun 5, 2015

I'm in favor of doing it the Haskell way, and performing a split if its not alphanumeric.

@ghajba Would you like to try to implement test cases for unicode? Otherwise, I'll merge and update the exercise.

@behrtam
Copy link
Contributor

behrtam commented Nov 20, 2015

  • The proposed implementation does not normalize (Expect case normalized words #207) and does not support unicode.
  • I would suggest that we create a different pr for the unicode test, because it might need some more work to support Python 2 & 3.
  • As we switch to the haskell way of splitting we should also add their test case for symbols.
testCase "symbols are separators" $
    fromList [("hey", 1), ("my", 1), ("spacebar", 1),
              ("is", 1), ("broken", 1)] @=?
    wordCount "hey,my_spacebar_is_broken."

@kytrinyx
Copy link
Member

+1 for adding the haskell test case.

Let's do as you suggest and put the unicode test in a different PR.

@behrtam
Copy link
Contributor

behrtam commented Nov 20, 2015

As @ghajba seems to have no time or interest in finishing this here and the example code doesn't work at all for the new way of splitting. Should we close this and reopen a new pr or merge this to a branch and refactor/fix it there?

@kytrinyx
Copy link
Member

Yeah, I think that's fine. Thanks, @behrtam.

@kytrinyx kytrinyx closed this Nov 20, 2015
behrtam added a commit to behrtam/python that referenced this pull request Nov 20, 2015
Change the separator as discussed in exercism#177 to non alphamumeric
as it is also done in the haskell track.
behrtam added a commit to behrtam/python that referenced this pull request Nov 20, 2015
Change the separator as discussed in exercism#177 to non alphanumeric
as it is also done in the haskell track.
behrtam added a commit to behrtam/python that referenced this pull request Nov 20, 2015
behrtam added a commit to behrtam/python that referenced this pull request Nov 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants