Skip to content

Restructure code for ordinal support #35

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 14, 2020

Conversation

arnavkapoor
Copy link
Collaborator

@arnavkapoor arnavkapoor commented Aug 7, 2020

Hi @noviluni thanks for the clarification I was looking into the restructuring with the requisite changes.
asked here. #31 (comment)

So currently I have just created the working parse_ordinal function and added tests.
I think I am still a bit unclear how we can make parse use only parse_number, parse_ordinal.

  • The flow that is wanted for parse function is follows from what i get
    1. Go through each token check and check whether it is a number_token (using _is_number_token , I also planned to update is_cardinal_token to return a tuple boolean, valid_token. So for fifty it will give False, 50 while for fiftieth it will give True , 50)

    2. Then use parser_number and parse_ordinal to return the number corresponding to this token / group of tokens ?

The issues that I see is basically deciding when to call parse_number , currently we keep on taking all the possible tokens for as long as possible. Say for one hundred and ten seven seventeen , all these would be taken then _build_number returns a list of values [110,7,17] , which then is used to build the output string "110 7 17". The decision logic to select when to break and all is a part of _build_number.
Another thing would be as I mentioned earlier using parse to handle strings multiple consecutive ordinals would need to be fixed. twentieth seventh fiftieth third.

I can make some small changes in the parse to support ordinal numbers. Something along the lines, but not sure how exactly parse_number / ordinal will do this.

if is_ordinal_token(token):  
      tokens_taken.append( _apply_cardinal_conversion(token,language) )
    _build_number(tokens_taken, lang_data)

Comment on lines 161 to 173
def _is_cardinal_token(token, lang_dict):
if token in lang_dict.all_numbers:
return token
return None


def _is_ordinal_token(token, lang_dict):
token = _apply_cardinal_conversion(token, lang_dict)
return _is_cardinal_token(token, lang_dict)


def _is_number_token(token, lang_dict):
return _is_cardinal_token(token, lang_dict) or _is_ordinal_token(token, lang_dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to put the language as an argument instead of the lang_dict to make it easier to use it.

For _is_number_token() I think we can do it public: is_number_token(), as my intention is to use it when integrating this library with dateparser.

@@ -153,41 +153,61 @@ def _normalize_tokens(token_list):
return [_strip_accents(token.lower()) for token in token_list]


def _normalize_dict(lang_dict):
def _normalize_dict(lang_data):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renaming the parameter to lang_data everywhere from lang_dict



def _apply_cardinal_conversion(input_string, language):
# this will be coming from the LanguageData
def _apply_cardinal_conversion(token, lang_data):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is updated so that now it handles each token rather than an entire string. Now when it is used by parse with words ending in th that are not numbers like with , goldsmith these are not changed.

if _is_cardinal_token(token_cardinal_form_2, lang_data) is not None:
return token_cardinal_form_2

return token


def parse_ordinal(input_string, language='en'):
Copy link
Collaborator Author

@arnavkapoor arnavkapoor Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_ordinal acts as the third interface after parse , parse_number . All these interfaces have the language parameter while the rest use the lang_data.
This currently converts all tokens to their cardinal form and then gives it to parse_number , which means that using parse_ordinal with only cardinal , cardinal/ordinal mixture and multiple ordinals all work , which is not completely accurate. (added these cases in tests)

@@ -226,6 +246,8 @@ def parse(input_string, language='en'):

for token in tokens:
compare_token = _strip_accents(token.lower())
ordinal_number = _is_ordinal_token(compare_token, lang_data)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter is added and used to break the processing when we encounter an ordinal number as it indicates the end of current processing. This is to ensure cases like twentieth seventh fiftieth third work

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a similar sort of logic to parse_ordinal to break on encountering an ordinal number. This would then reject mixed cardinal/ordinal like twentieth three

@arnavkapoor
Copy link
Collaborator Author

arnavkapoor commented Aug 12, 2020

Hi @noviluni , so I have made some changes in the functions to handle ordinal numbers correctly while using parse. I also added doc strings in the code to elaborate about each function. Additionally, as and when more languages / different logic is used only _apply_cardinal_conversion would need to be modified.

I would appreciate some inputs with regards to the behavior of parse_ordinal , how it should handle mixed data , purely cardinal data etc.

Next step is the automatic language detection (in a new PR). Structurally, the three public interfaces parse , parse_number , parse_ordinal , would call a _get_best_language(input_string) if language parameter is missing. Then they will populate the lang_data with this best language.

Copy link
Contributor

@noviluni noviluni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been playing around and it seems to work quite well, so I would go ahead and work on the dateparser integration.

Not sure about cases like fifty-fifth, fiftieth five, etc. but I think it's not a big issue to leave it as-is for now.

Before merging this I think that you should add a reference to the new parse_ordinal() function in the README (mentioning that it only supports English) and a mention regarding the fact that now parse() parses ordinal numbers too.

I think we could add an optional parameter like convert_ordinals=True to the parse() function (defaults to True) to allow to enable or disable this behavior, but for now it's not necessary (when merging this, could you create an issue in Github with this?)

Good job making this working 💪



def _apply_cardinal_conversion(token, lang_data):
"""Converts ordinal tokens to cardinal while leaving other tokens unchanged."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment like # Note: For now it supports only English?

@arnavkapoor arnavkapoor marked this pull request as ready for review August 14, 2020 07:34
@arnavkapoor
Copy link
Collaborator Author

updated readme to add info about ordinal numbers and , also fixed #33

@arnavkapoor arnavkapoor reopened this Aug 14, 2020
@arnavkapoor arnavkapoor merged commit d25621e into master Aug 14, 2020
@noviluni noviluni mentioned this pull request Aug 14, 2020
@arnavkapoor arnavkapoor mentioned this pull request Aug 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants