-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
word-count: 'javascript!!&@$%^&' should be not accepted as a word #177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…t the best the test case should not preserve puntuation because 'javascript!!&@$%^&' is not a word.
I missed previously to alter the name of the test to represent what it does
/cc @exercism/track-maintainers What do you think? Should word count take a naive approach and just break on whitespace? Or should it discard interesting characters? |
I like having it require the split condition be anything that's not alphanumeric, it requires a little more digging than a whitespace split which is usually some sort of built-in method or function in most languages. In Haskell we use the unicode definition of alphanumeric. This Python implementation looks like it's just ASCII. In Python that would be tricky (but possible) to do with a regex because the import re
def word_count(text):
d = {}
# use a negative lookahead assertion to remove underscore from \w
for word in re.findall(r'(?:(?!_)\w)+(?u)', text):
d[word] = d.get(word, 0) + 1
return d Simpler solution would do two passes to handle the underscore case: import re
def word_count(text):
d = {}
for word in re.findall(r'\w+', text.replace('_', ' ')):
d[word] = d.get(word, 0) + 1
return d
|
If I may leave another comment: in the Java version of the same exercise special characters (like colon or this awkward javascript-extension) aren't permitted in the solution. And I have to agree with @etrepum the current solution does not handle non-ascii characters. However the solution does not necessary have to be done with regular expressions, does it? |
Regex is certainly not a necessity, I was just modifying the existing proposed solution to make it work with the full spec that the Haskell version implements. Here's one way to do it in a fairly low-level fashion: def word_count(text):
i = 0
end = len(text)
d = {}
while i < end:
while i < end and not text[i].isalnum():
i = i + 1
j = i
while j < end and text[j].isalnum():
j = j + 1
if j > i:
word = text[i:j]
d[word] = d.get(word, 0) + 1
i = j
return d |
I'm in favor of doing it the Haskell way, and performing a split if its not alphanumeric. @ghajba Would you like to try to implement test cases for unicode? Otherwise, I'll merge and update the exercise. |
testCase "symbols are separators" $
fromList [("hey", 1), ("my", 1), ("spacebar", 1),
("is", 1), ("broken", 1)] @=?
wordCount "hey,my_spacebar_is_broken." |
+1 for adding the haskell test case. Let's do as you suggest and put the unicode test in a different PR. |
As @ghajba seems to have no time or interest in finishing this here and the example code doesn't work at all for the new way of splitting. Should we close this and reopen a new pr or merge this to a branch and refactor/fix it there? |
Yeah, I think that's fine. Thanks, @behrtam. |
Change the separator as discussed in exercism#177 to non alphamumeric as it is also done in the haskell track.
Change the separator as discussed in exercism#177 to non alphanumeric as it is also done in the haskell track.
I did this exercise today while commuting to work and I think this solution was a bad-bad thing so I altered the test case and the example to handle this issue. If you do not think alike I understand.
Cheers, Gabor