-
Notifications
You must be signed in to change notification settings - Fork 488
Adding after and next keywords with basic functionality. #638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #638 +/- ##
==========================================
+ Coverage 95.21% 95.24% +0.03%
==========================================
Files 302 302
Lines 2507 2523 +16
==========================================
+ Hits 2387 2403 +16
Misses 120 120
Continue to review full report at Codecov.
|
|
I'm curious why the CLDR libraries - which include translations for 'next Tuesday' for example aren't utilized to their fullest.... like this one for Russian: https://github.com/unicode-cldr/cldr-dates-full/blob/master/main/ru/dateFields.json These are all the variations given for relative terms related to Tuesday: |
|
Sorry if I'm missing something (and making a lot of noise in the comments) ... I downloaded all the changes above and tested:
So, 'after 15 days' worked fine and next week is still working - but 'next Tuesday' isn't. |
|
@jgtimestuff yeah, I am still trying to find a way for |
I can't figure out how one would incorporate the day of week in the freshness script, but... this works manually: TODAY = datetime.date.today() and if today happened to be Tuesday, it would cause a problem because you'd have to add a day... I suppose you could do that by turning TODAY into tomorrow... |
|
@jgtimestuff @Gallaecio, please review |
I think we should look into how “this/next ” is meant to be used in English, go with what we think is the most common interpretation by default, and provide settings to allow customizing that behavior to support all approaches. That said, it may be better to stick to the most common interpretation in this patch, and leave additional settings for a later patch or patches. That way we can get this merged sooner. |
|
In date processing, there is a week number of 1-52, isn't there? Could next Tuesday, use that to determine that it is not the Tuesday of this week? |
Grammatically ,
'next monday' is correctly applied here, we can try adding the above phrases |
|
Yes, next Monday is currently perfect - but next Tuesday is showing tomorrow, instead of next Tuesday. If we are looking at the weeknumber (I think it is currently 13), so Next Monday and Tuesday would be in week 14. |
|
Oh, sorry I misread your grammar post... I'm not sure anyone would follow the rule that with today being Sunday - next Monday would be tomorrow... However, in order to get this pushed along - I'm willing to accept this as a reasonable definition because there is quite a bit of ambiguity in how humans would actually apply the context. Next Wednesday or Thursday seems in my mind to fit with the immediate next description, but the tomorrow and the day after are fuzzy. Best to accept what you have. Thanks for listening! |
|
Morning all, I've been arguing with myself (something I tend to do often enough). I'm now leaning toward suggesting that we incorporate the week number in the logic to distinguish between This and Next for weekdays. If we document that 'This' followed by a weekday means they exist in the same week of the year, and 'Next' pushes the weekday into the next calendar week by default... So, next will always mean the weekday that occurs after the following Sunday. I.e. It's Saturday - Next Sunday or Monday will still be the next couple of days, but.. I think this better represents how people would expect the result to look into the future. Thoughts? |
|
I have no strong opinion either way. As soon as someone complains, we can implement an option to allow changing the interpretation 🙂 |
I think we can first try to add the next keyword to all the languages such as next tuesday works in all languages when parsed
Manually adding the next keyword translation is not the right approach there must be some better and faster way. Thoughts? After adding the next and this keyword we can do what you are suggesting @jgtimestuff Btw I think this PR is ready @Gallaecio |
|
I've been experimenting offline with: results are as follows (curiously, Monday is the first day of the week - instead of what I expected - Sunday), but it works well for switching to 'next' meaning the following week of the year if you decide to go that route:
|
|
Yeah, you are correct. @Gallaecio Thoughts? Should we add this and next keyword? |
|
I would not add “this” support in this pull request, just to try and get it merged as soon as possible. You can add that in a follow-up pull request, or let someone else do it. As to how to implement “next”, again, no strong opinion. I think it makes sense to eventually have a setting or settings to allow for any possible interpretation. |
Yeah I fixed that you can check now |
|
Excellent, my latest [offline] output - provides next week as the same result as next tuesday.
|
|
I think the logic for next weekday - whether using immediate-next or next-next aside, works well for English now. The problem of extending this to all languages though is problematic and perhaps another issue to address separately. Again, I wonder about all the available relative terms in CLDR's json compared to the dataparser.py's library and whether we can tap into those somehow with the code provided on this ticket once the terms for next tuesday are translated from all languages to English. There are several different ways to say 'next' in Russian depending on he gender of the noun representing the day of the week. |
|
yeah, on a seperate PR we can add the |
|
Hmm… Then we might need to make it so that more things can be implemented through YAML. I need to have a closer look… |
|
I’m looking at the script that you are supposed to run after modifying the YAML file, See for example how the YAML file defines |
|
Modifying the JSON file is bad? YAML files only contain different abbreviations of skip, pertain, year, month, week, hour, minutes, seconds, days, ago, in, relative-type, simplifications. |
Yeah, that is why I thought of modifying the JSON file. |
|
JSON files are not meant to be modified by Dateparser contributors, they come from CLDR and are meant to only contain their data. When we want to extend their data, we use the YAML files to do that. The JSON files should only be modified when updating data from the CLDR. |
|
So what should I do now? Extend all the YAML files same as JSON files add all the missing data? |
|
Closed by mistake 😅 |
|
Reverting the change in the JSON files, applying it to the YAML files (same change, YAML syntax) and running the script I mentioned should generate the same Python file changes, and everything should work the same. |
|
Does that need to be done for all the JSON files? Afterwards, JSON files present in |
|
This is where I was quite confused about the CLDR data... They have next Tuesday in multiple languages under relative-time-types, and I was expected we would use those terms to run across languages, but didn't see how those terms would make it through to the generated .py language files. The write_all_data or get_cldr scripts are selective - I think something needs to change in them to pull the relative-type-types out and into the .py versions? |
|
https://github.com/scrapinghub/dateparser/blob/master/scripts/get_cldr_data.py is supposed to fetch the yaml files from this path: Of course, the unicode-cldr repository has language specific datefield.json versions that could be copied over in full from - https://github.com/unicode-cldr/cldr-dates-full/blob/master/main/en/dateFields.json The dateFields versions if brought over 'might' replace the current versions in dateparser and if the write_all_data is run... I think it should update all .py files with the current data from those jsons? Oddly, I noticed something particularly weird about the various versions: unicode-cldr/cldr-dates-full/blob/master/main/ru/dateFields.json "tue": { dateparser_data/cldr_language_data dateparser_data/supplementary_language_data/date_translation_data/ru.yaml only has wed, fri, sat, sun? So, the ru.yaml in dateparser has the least amount of data to the point of missing a few days of the week. |
|
I took all the dateFields.json files from the cldr-dates-full-master\cldr-dates-full-master\main directories for each language and copied them to my local copy of dateparser in the subdirectories: dateparser_data\cldr_language_data\date_translation_data. Problem is that I've done this without switching or creating a new branch on git locally... so it's showing on master here. I'm assuming, and of course I could be wrong, that having these files available to execute the write_all_data.py would result in all the relative-time data being updated in all the .py language files. Then it would be necessary to re-work the logic for next Tuesday etc... to incorporate the relative-type--1, relative-type-0, relative-type-1 variants of each weekday. Am I correct in this assumption and if so, I then assume the write_all_data would have to be run locally then upload all the changes to the [language].py files that are created afterward. 2020-04-01 07:04 AM 27,034 af-NA.json |
|
Apologies again... Looking over write_complete_data.py, it seems that it depends on the utils.py and order_languages.py to do some work with cldr - but they need a raw_data directory somewhere in the process. So, creating the json files in the date_translation_data directory doesn't help because it still wants to go back and fetch them from git or this 'raw_data' directory somewhere. |
|
Okay, offline I've managed to get all the cldr dateFields.json files rewritten over to language.py files using write_complete_data.py by forcing the open statements to read binary - was getting lots of cp1252 can't decode errors prior to that. So, for example my ru.py has these options for Sunday: which matches the English en.py file: So, theoretically... we could tap into all those variations for past, present, future - right? |
|
I see the structure I ended up with doesn't match the original .py sets at all... sigh. |
We can extend the current structure in a way that all works well with dateFields.json. Thoughts? |
|
Maybe open a separate pull request just for updating the CLDR data, and making any necessary changes to the corresponding scripts? |
|
Going back in time a bit here, but what's the difference between Tuesday and Next Tuesday? If we accept that a text talking about a given weekday has a 50/50 chance of talking about past or future because present wouldn't use the weekday. On Tuesday, I will or I did - do something are really the only choices. In other words, do we really need to add code specifically for 'next Tuesday' if it would most likely always return the same date as 'Tuesday' by itself? The recent code changes could still be used to force the Tuesday from this week into the future in the output... where dateparser.parse('tuesday') when compared to a current date of Wed through Sunday) would NOT return the date for Tuesday of this week. i.e. if current.weekday() is >= _weekday then we want the future and not the one from the current week. We wouldn't normally say on Wednesday, when it is Wednesday - or on Tuesday when it's Thursday and be referring to the Tuesday that has passed. |
|
How do these look for results given today is Thursday 2020-04-09? import dateparser I put in a rule that 'next' isn't needed for days of the week that have already passed in the current week or is the current day - With today being Thursday, all the days of Monday-Thursday are automatically 'next week' - so whether next is present or not... it'll return the appropriate date for the upcoming weekday. For days equal to or greater than today's date it will only apply 'next' if the difference of days of the week is greater than 2... so Friday, and Saturday without 'next' provide the dates for the near future while next Friday and next Saturday jump a week. Thoughts? |
This may be true for natural language, but I’m not so sure about computing systems. I think I’ve seen websites using the name of the day of the week to refer to past events within the last 6 days (then they use ‘1 week ago’ and similar for older events). |
|
I tried a couple more additions like making 'Saturday' recognize 'weekend' and noticed that the 'first day of' a month is quirky because I wrote a sentence to parse that included the line... but I suppose that should indeed be another PR as well. search_dates("this weekend will mark the 1st of may") search_dates("this weekend will mark the first of may") |
Changing the Following code "relative-type-regex": {
"in \\1 year": [
"in {0} year",
"in {0} years",
"in {0} yr"
]Into Yaml results the following relative-type-regex:
in \\1 year:
- string: 'in {0} year'
- string: 'in {0} years'
- string: 'in {0} yr'But write_complete_data.py script raises an Error. So to fix this can I edit the script? And After this is done what is the use of JSON files present in cldr_language_data ? |
Yes. Although I would not include the script change as part of this pull request, I would create a separate one for that.
They come from Unicode CLDR, as described in https://dateparser.readthedocs.io/en/latest/contributing.html#guidelines-for-editing-translation-data We update them automatically from Unicode CLDR. |
|
Hi, guys, is there some progress on this? Maybe it's hard but it's very very valuable for us and I would appreciate very much if this is solved. |
resolves #635
resolves #573
Before
Now