-
Notifications
You must be signed in to change notification settings - Fork 25
Restructure code for ordinal support #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
number_parser/parser.py
Outdated
def _is_cardinal_token(token, lang_dict): | ||
if token in lang_dict.all_numbers: | ||
return token | ||
return None | ||
|
||
|
||
def _is_ordinal_token(token, lang_dict): | ||
token = _apply_cardinal_conversion(token, lang_dict) | ||
return _is_cardinal_token(token, lang_dict) | ||
|
||
|
||
def _is_number_token(token, lang_dict): | ||
return _is_cardinal_token(token, lang_dict) or _is_ordinal_token(token, lang_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to put the language
as an argument instead of the lang_dict
to make it easier to use it.
For _is_number_token()
I think we can do it public: is_number_token()
, as my intention is to use it when integrating this library with dateparser
.
@@ -153,41 +153,61 @@ def _normalize_tokens(token_list): | |||
return [_strip_accents(token.lower()) for token in token_list] | |||
|
|||
|
|||
def _normalize_dict(lang_dict): | |||
def _normalize_dict(lang_data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renaming the parameter to lang_data
everywhere from lang_dict
number_parser/parser.py
Outdated
|
||
|
||
def _apply_cardinal_conversion(input_string, language): | ||
# this will be coming from the LanguageData | ||
def _apply_cardinal_conversion(token, lang_data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is updated so that now it handles each token rather than an entire string. Now when it is used by parse
with words ending in th
that are not numbers like with , goldsmith
these are not changed.
if _is_cardinal_token(token_cardinal_form_2, lang_data) is not None: | ||
return token_cardinal_form_2 | ||
|
||
return token | ||
|
||
|
||
def parse_ordinal(input_string, language='en'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parse_ordinal
acts as the third interface after parse , parse_number
. All these interfaces have the language
parameter while the rest use the lang_data
.
This currently converts all tokens to their cardinal form and then gives it to parse_number
, which means that using parse_ordinal
with only cardinal , cardinal/ordinal mixture and multiple ordinals all work , which is not completely accurate. (added these cases in tests)
@@ -226,6 +246,8 @@ def parse(input_string, language='en'): | |||
|
|||
for token in tokens: | |||
compare_token = _strip_accents(token.lower()) | |||
ordinal_number = _is_ordinal_token(compare_token, lang_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter is added and used to break the processing when we encounter an ordinal number as it indicates the end of current processing. This is to ensure cases like twentieth seventh fiftieth third
work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a similar sort of logic to parse_ordinal
to break on encountering an ordinal number. This would then reject mixed cardinal/ordinal like twentieth three
Hi @noviluni , so I have made some changes in the functions to handle ordinal numbers correctly while using I would appreciate some inputs with regards to the behavior of Next step is the automatic language detection (in a new PR). Structurally, the three public interfaces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been playing around and it seems to work quite well, so I would go ahead and work on the dateparser
integration.
Not sure about cases like fifty-fifth
, fiftieth five
, etc. but I think it's not a big issue to leave it as-is for now.
Before merging this I think that you should add a reference to the new parse_ordinal()
function in the README (mentioning that it only supports English) and a mention regarding the fact that now parse()
parses ordinal numbers too.
I think we could add an optional parameter like convert_ordinals=True
to the parse()
function (defaults to True
) to allow to enable or disable this behavior, but for now it's not necessary (when merging this, could you create an issue in Github with this?)
Good job making this working 💪
|
||
|
||
def _apply_cardinal_conversion(token, lang_data): | ||
"""Converts ordinal tokens to cardinal while leaving other tokens unchanged.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment like # Note: For now it supports only English
?
updated readme to add info about ordinal numbers and , also fixed #33 |
Hi @noviluni thanks for the clarification I was looking into the restructuring with the requisite changes.
asked here. #31 (comment)
So currently I have just created the working
parse_ordinal
function and added tests.I think I am still a bit unclear how we can make parse use only parse_number, parse_ordinal.
Go through each token check and check whether it is a number_token (using
_is_number_token
, I also planned to updateis_cardinal_token
to return a tuple boolean, valid_token. So for fifty it will give False, 50 while for fiftieth it will give True , 50)Then use parser_number and parse_ordinal to return the number corresponding to this token / group of tokens ?
The issues that I see is basically deciding when to call
parse_number
, currently we keep on taking all the possible tokens for as long as possible. Say forone hundred and ten seven seventeen
, all these would be taken then_build_number
returns a list of values [110,7,17] , which then is used to build the output string "110 7 17". The decision logic to select when to break and all is a part of _build_number.Another thing would be as I mentioned earlier using
parse
to handle strings multiple consecutive ordinals would need to be fixed.twentieth seventh fiftieth third
.I can make some small changes in the
parse
to support ordinal numbers. Something along the lines, but not sure how exactly parse_number / ordinal will do this.