Skip to content

Conversation

@harsh9200
Copy link

@harsh9200 harsh9200 commented Mar 19, 2020

resolves #635
resolves #573

Before

>>>dateparser.parse('after 15 days').strftime('%a %Y-%m-%d') 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'strftime'
>>>dateparser.parse('next tuesday').strftime('%a %Y-%m-%d') 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'strftime'

Now

>>> dateparser.parse('now').strftime('%a %Y-%m-%d')
'Sun 2020-03-22'

>>> dateparser.parse('after 15 days').strftime('%a %Y-%m-%d')
'Mon 2020-04-06'

>>> dateparser.parse('next sunday').strftime('%a %Y-%m-%d')
'Sun 2020-03-29'

>>> dateparser.parse('next tuesday').strftime('%a %Y-%m-%d')
'Tue 2020-03-24'

@codecov
Copy link

codecov bot commented Mar 19, 2020

Codecov Report

Merging #638 into master will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #638      +/-   ##
==========================================
+ Coverage   95.21%   95.24%   +0.03%     
==========================================
  Files         302      302              
  Lines        2507     2523      +16     
==========================================
+ Hits         2387     2403      +16     
  Misses        120      120              
Impacted Files Coverage Δ
dateparser/data/date_translation_data/en.py 100.00% <ø> (ø)
dateparser/languages/validation.py 93.43% <ø> (ø)
dateparser/freshness_date_parser.py 99.09% <100.00%> (+0.14%) ⬆️
dateparser/languages/dictionary.py 99.33% <100.00%> (ø)
dateparser/languages/locale.py 98.66% <100.00%> (+<0.01%) ⬆️
dateparser/search/search.py 99.35% <100.00%> (ø)
dateparser/date.py 97.96% <0.00%> (-0.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb188b6...1a5227b. Read the comment docs.

@jgtimestuff
Copy link

I'm curious why the CLDR libraries - which include translations for 'next Tuesday' for example aren't utilized to their fullest.... like this one for Russian: https://github.com/unicode-cldr/cldr-dates-full/blob/master/main/ru/dateFields.json

These are all the variations given for relative terms related to Tuesday:
"tue": {
"relative-type--1": "в прошлый вторник",
"relative-type-0": "в этот вторник",
"relative-type-1": "в следующий вторник",
"relativeTime-type-future": {
"relativeTimePattern-count-one": "через {0} вторник",
"relativeTimePattern-count-few": "через {0} вторника",
"relativeTimePattern-count-many": "через {0} вторников",
"relativeTimePattern-count-other": "через {0} вторника"
},
"relativeTime-type-past": {
"relativeTimePattern-count-one": "{0} вторник назад",
"relativeTimePattern-count-few": "{0} вторника назад",
"relativeTimePattern-count-many": "{0} вторников назад",
"relativeTimePattern-count-other": "{0} вторника назад"
}
},
"tue-short": {
"relative-type--1": "в прош. вт.",
"relative-type-0": "в этот вт.",
"relative-type-1": "в след. вт.",
"relativeTime-type-future": {
"relativeTimePattern-count-one": "через {0} вт.",
"relativeTimePattern-count-few": "через {0} вт.",
"relativeTimePattern-count-many": "через {0} вт.",
"relativeTimePattern-count-other": "через {0} вт."
},
"relativeTime-type-past": {
"relativeTimePattern-count-one": "{0} вт. назад",
"relativeTimePattern-count-few": "{0} вт. назад",
"relativeTimePattern-count-many": "{0} вт. назад",
"relativeTimePattern-count-other": "{0} вт. назад"
}
},
"tue-narrow": {
"relative-type--1": "в прош. вт.",
"relative-type-0": "в этот вт.",
"relative-type-1": "в след. вт.",
"relativeTime-type-future": {
"relativeTimePattern-count-one": "+{0} вт.",
"relativeTimePattern-count-few": "+{0} вт.",
"relativeTimePattern-count-many": "+{0} вт.",
"relativeTimePattern-count-other": "+{0} вт."
},
"relativeTime-type-past": {
"relativeTimePattern-count-one": "-{0} вт.",
"relativeTimePattern-count-few": "-{0} вт.",
"relativeTimePattern-count-many": "-{0} вт.",
"relativeTimePattern-count-other": "-{0} вт."
}
},

@jgtimestuff
Copy link

Sorry if I'm missing something (and making a lot of noise in the comments) ... I downloaded all the changes above and tested:

dateparser.parse('after 15 days')
relativedelta(days=+15)
datetime.datetime(2020, 4, 4, 8, 33, 19, 830552)
dateparser.parse('next week')
relativedelta(days=+7)
datetime.datetime(2020, 3, 27, 8, 33, 31, 913679)
dateparser.parse('next Tuesday')

So, 'after 15 days' worked fine and next week is still working - but 'next Tuesday' isn't.

@harsh9200
Copy link
Author

@jgtimestuff yeah, I am still trying to find a way for next tuesday. If you have any suggestion please let me know

@jgtimestuff
Copy link

@jgtimestuff yeah, I am still trying to find a way for next tuesday. If you have any suggestion please let me know

I can't figure out how one would incorporate the day of week in the freshness script, but... this works manually:

TODAY = datetime.date.today()
import dateutil.relativedelta as rld
import calendar
TODAY + rld.relativedelta(weekday=calendar.TUESDAY)
datetime.date(2020, 3, 24)

and if today happened to be Tuesday, it would cause a problem because you'd have to add a day... I suppose you could do that by turning TODAY into tomorrow...

@harsh9200
Copy link
Author

@jgtimestuff @Gallaecio, please review

@Gallaecio
Copy link
Member

One of the difficulties with 'next' and a weekday I suppose is determining the probability of which of the upcoming days can be classed as 'next'. One day isn't normally referred to as 'next', so it might be wise to look for the next occurrence of the given day as long as it is greater than say 3 days away?

I think we should look into how “this/next ” is meant to be used in English, go with what we think is the most common interpretation by default, and provide settings to allow customizing that behavior to support all approaches.

That said, it may be better to stick to the most common interpretation in this patch, and leave additional settings for a later patch or patches. That way we can get this merged sooner.

@jgtimestuff
Copy link

In date processing, there is a week number of 1-52, isn't there? Could next Tuesday, use that to determine that it is not the Tuesday of this week?

@harsh9200
Copy link
Author

harsh9200 commented Mar 23, 2020

In date processing, there is a week number of 1-52, isn't there? Could next Tuesday, use that to determine that it is not the Tuesday of this week?

Grammatically ,

  • 'next Monday' -
    means the immediate next Monday

  • 'the Monday after next' -
    If today is Sunday, March 1st, this refers to Monday, March 9th. If today is Tuesday, March 3rd, it refers to Monday, March 16th.

  • 'the Monday after next week' -
    If today is Sunday, March 1st, this refers to Monday, March 16th. If today is Tuesday, March 3rd, it refers to the same day, Monday, March 16th.

'next monday' is correctly applied here, we can try adding the above phrases

@jgtimestuff
Copy link

Yes, next Monday is currently perfect - but next Tuesday is showing tomorrow, instead of next Tuesday. If we are looking at the weeknumber (I think it is currently 13), so Next Monday and Tuesday would be in week 14.

@jgtimestuff
Copy link

Oh, sorry I misread your grammar post... I'm not sure anyone would follow the rule that with today being Sunday - next Monday would be tomorrow... However, in order to get this pushed along - I'm willing to accept this as a reasonable definition because there is quite a bit of ambiguity in how humans would actually apply the context. Next Wednesday or Thursday seems in my mind to fit with the immediate next description, but the tomorrow and the day after are fuzzy. Best to accept what you have. Thanks for listening!

@jgtimestuff
Copy link

Morning all, I've been arguing with myself (something I tend to do often enough). I'm now leaning toward suggesting that we incorporate the week number in the logic to distinguish between This and Next for weekdays. If we document that 'This' followed by a weekday means they exist in the same week of the year, and 'Next' pushes the weekday into the next calendar week by default... So, next will always mean the weekday that occurs after the following Sunday.

I.e. It's Saturday - Next Sunday or Monday will still be the next couple of days, but..
if it's Monday, then Next Tuesday will jump to the following week instead of being 'tomorrow'.

I think this better represents how people would expect the result to look into the future. Thoughts?

@Gallaecio
Copy link
Member

I have no strong opinion either way. As soon as someone complains, we can implement an option to allow changing the interpretation 🙂

@harsh9200
Copy link
Author

I think this better represents how people would expect the result to look into the future. Thoughts?

I think we can first try to add the next keyword to all the languages such as next tuesday works in all languages when parsed

>>> dateparser.parse("next tuesday").strftime('%a %Y-%m-%d')
'Tue 2020-03-31'

next tuesday in hindi is अगले मंगलवार

>>> dateparser.parse("मंगलवार").strftime('%a %Y-%m-%d')
'Tue 2020-03-24'

>>> dateparser.parse("अगले मंगलवार").strftime('%a %Y-%m-%d')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'strftime'

Manually adding the next keyword translation is not the right approach there must be some better and faster way. Thoughts?

After adding the next and this keyword we can do what you are suggesting @jgtimestuff

Btw I think this PR is ready @Gallaecio

@jgtimestuff
Copy link

I've been experimenting offline with:
if _weekday:
newday = self.now.today() + relativedelta(weeks=+1)
self.now = newday
dat = getattr(cal, _weekday.upper())
day_ahead = dat - self.now.weekday()
td = relativedelta(days=day_ahead)

results are as follows (curiously, Monday is the first day of the week - instead of what I expected - Sunday), but it works well for switching to 'next' meaning the following week of the year if you decide to go that route:

dateparser.parse('next tuesday')
datetime.datetime(2020, 3, 31, 8, 12, 18, 326030)
dateparser.parse('next monday')
datetime.datetime(2020, 3, 30, 8, 12, 29, 217894)
dateparser.parse('next wednexday')
dateparser.parse('next wednesday')
datetime.datetime(2020, 4, 1, 8, 12, 50, 163365)
dateparser.parse('next thursday')
datetime.datetime(2020, 4, 2, 8, 13, 1, 818499)

@harsh9200
Copy link
Author

Yeah, you are correct.

@Gallaecio Thoughts? Should we add this and next keyword?

@Gallaecio
Copy link
Member

I would not add “this” support in this pull request, just to try and get it merged as soon as possible. You can add that in a follow-up pull request, or let someone else do it.

As to how to implement “next”, again, no strong opinion. I think it makes sense to eventually have a setting or settings to allow for any possible interpretation.

@harsh9200
Copy link
Author

I'm wondering if we shouldn't use something that specifically looks for next ...day as a complete string so next by anything other than a weekday is treated differently than next followed by a weekday?

next week, next month, next year are all different than next specific day of the week?

pattern = r'next [a-zA-Z]{3,6}day'

Yeah I fixed that you can check now

@jgtimestuff
Copy link

Excellent, my latest [offline] output - provides next week as the same result as next tuesday.

import dateparser
dateparser.parse('next wednesday')
datetime.datetime(2020, 4, 1, 10, 7, 21, 284191)
dateparser.parse('next monday')
datetime.datetime(2020, 3, 30, 10, 7, 38, 86311)
dateparser.parse('next thursday')
datetime.datetime(2020, 4, 2, 10, 7, 46, 797134)
dateparser.parse('next sunday')
datetime.datetime(2020, 4, 5, 10, 8, 12, 963336)
dateparser.parse('next week')
datetime.datetime(2020, 3, 31, 10, 11, 3, 51521)

@jgtimestuff
Copy link

I think the logic for next weekday - whether using immediate-next or next-next aside, works well for English now. The problem of extending this to all languages though is problematic and perhaps another issue to address separately. Again, I wonder about all the available relative terms in CLDR's json compared to the dataparser.py's library and whether we can tap into those somehow with the code provided on this ticket once the terms for next tuesday are translated from all languages to English. There are several different ways to say 'next' in Russian depending on he gender of the noun representing the day of the week.

@harsh9200
Copy link
Author

yeah, on a seperate PR we can add the next keyword in different languages.

@Gallaecio
Copy link
Member

Hmm… Then we might need to make it so that more things can be implemented through YAML. I need to have a closer look…

@Gallaecio
Copy link
Member

I’m looking at the script that you are supposed to run after modifying the YAML file, write_complete_data.py, and it looks like you should be able to write the same type of data in the YAML file, with the same structure as the JSON file (but in YAML syntax), and the script will merge both into the corresponding Python file.

See for example how the YAML file defines september: - sept and the JSON file defines "september": ["september", "sep"], and the Python file has all three.

@harsh9200
Copy link
Author

harsh9200 commented Mar 30, 2020

Modifying the JSON file is bad?

YAML files only contain different abbreviations of skip, pertain, year, month, week, hour, minutes, seconds, days, ago, in, relative-type, simplifications.

@harsh9200
Copy link
Author

See for example how the YAML file defines september: - sept and the JSON file defines "september": ["september", "sep"], and the Python file has all three.

Yeah, that is why I thought of modifying the JSON file.

@Gallaecio
Copy link
Member

JSON files are not meant to be modified by Dateparser contributors, they come from CLDR and are meant to only contain their data. When we want to extend their data, we use the YAML files to do that.

The JSON files should only be modified when updating data from the CLDR.

@harsh9200
Copy link
Author

So what should I do now?

Extend all the YAML files same as JSON files add all the missing data?

@harsh9200 harsh9200 closed this Mar 30, 2020
@harsh9200 harsh9200 reopened this Mar 30, 2020
@harsh9200
Copy link
Author

Closed by mistake 😅

@Gallaecio
Copy link
Member

Reverting the change in the JSON files, applying it to the YAML files (same change, YAML syntax) and running the script I mentioned should generate the same Python file changes, and everything should work the same.

@harsh9200
Copy link
Author

Does that need to be done for all the JSON files? Afterwards, JSON files present in cldr_language_data would be of no use?

@jgtimestuff
Copy link

This is where I was quite confused about the CLDR data... They have next Tuesday in multiple languages under relative-time-types, and I was expected we would use those terms to run across languages, but didn't see how those terms would make it through to the generated .py language files. The write_all_data or get_cldr scripts are selective - I think something needs to change in them to pull the relative-type-types out and into the .py versions?

@jgtimestuff
Copy link

https://github.com/scrapinghub/dateparser/blob/master/scripts/get_cldr_data.py is supposed to fetch the yaml files from this path:
cldr_dates_full_dir = "../raw_data/cldr_dates_full/main/"
which doesn't exist in dateparser's structure... but I think it was meant to be a copy of the actual cldr repository at https://github.com/unicode-cldr/cldr-dates-full/tree/master/main

Of course, the unicode-cldr repository has language specific datefield.json versions that could be copied over in full from - https://github.com/unicode-cldr/cldr-dates-full/blob/master/main/en/dateFields.json

The dateFields versions if brought over 'might' replace the current versions in dateparser and if the write_all_data is run... I think it should update all .py files with the current data from those jsons?

Oddly, I noticed something particularly weird about the various versions:

unicode-cldr/cldr-dates-full/blob/master/main/ru/dateFields.json

"tue": {
"relative-type--1": "в прошлый вторник",
"relative-type-0": "в этот вторник",
"relative-type-1": "в следующий вторник",
"relativeTime-type_datatern-count-many": "через {0} вторников",
"relativeTimePattern-count-other": "через {0} вторника"
},

dateparser_data/cldr_language_data
"tuesday": [
"вторник",
"вт"
],

dateparser_data/supplementary_language_data/date_translation_data/ru.yaml

only has wed, fri, sat, sun?

So, the ru.yaml in dateparser has the least amount of data to the point of missing a few days of the week.

@jgtimestuff
Copy link

jgtimestuff commented Apr 1, 2020

I took all the dateFields.json files from the cldr-dates-full-master\cldr-dates-full-master\main directories for each language and copied them to my local copy of dateparser in the subdirectories: dateparser_data\cldr_language_data\date_translation_data. Problem is that I've done this without switching or creating a new branch on git locally... so it's showing on master here.

I'm assuming, and of course I could be wrong, that having these files available to execute the write_all_data.py would result in all the relative-time data being updated in all the .py language files. Then it would be necessary to re-work the logic for next Tuesday etc... to incorporate the relative-type--1, relative-type-0, relative-type-1 variants of each weekday.

Am I correct in this assumption and if so, I then assume the write_all_data would have to be run locally then upload all the changes to the [language].py files that are created afterward.

2020-04-01 07:04 AM 27,034 af-NA.json
2020-04-01 07:04 AM 27,004 af.json
2020-04-01 07:04 AM 20,645 agq.json
2020-04-01 07:04 AM 20,535 ak.json
2020-04-01 07:04 AM 31,259 am.json
2020-04-01 07:04 AM 57,237 ar-AE.json
2020-04-01 07:04 AM 57,261 ar-BH.json
2020-04-01 07:04 AM 57,261 ar-DJ.json
2020-04-01 07:04 AM 57,261 ar-DZ.json
2020-04-01 07:04 AM 57,261 ar-EG.json
2020-04-01 07:04 AM 57,261 ar-EH.json
2020-04-01 07:04 AM 57,261 ar-ER.json
2020-04-01 07:04 AM 57,261 ar-IL.json, etc...

@jgtimestuff
Copy link

Apologies again... Looking over write_complete_data.py, it seems that it depends on the utils.py and order_languages.py to do some work with cldr - but they need a raw_data directory somewhere in the process. So, creating the json files in the date_translation_data directory doesn't help because it still wants to go back and fetch them from git or this 'raw_data' directory somewhere.

@jgtimestuff
Copy link

Okay, offline I've managed to get all the cldr dateFields.json files rewritten over to language.py files using write_complete_data.py by forcing the open statements to read binary - was getting lots of cp1252 can't decode errors prior to that.

So, for example my ru.py has these options for Sunday:
"sun": {
"relative-type--1": "в прошлое воскресенье",
"relative-type-0": "в это воскресенье",
"relative-type-1": "в следующее воскресенье",
"relativeTime-type-future": {
"relativeTimePattern-count-one": "через {0} воскресенье",
"relativeTimePattern-count-few": "через {0} воскресенья",
"relativeTimePattern-count-many": "через {0} воскресений",
"relativeTimePattern-count-other": "через {0} воскресенья"
},
"relativeTime-type-past": {
"relativeTimePattern-count-one": "{0} воскресенье назад",
"relativeTimePattern-count-few": "{0} воскресенья назад",
"relativeTimePattern-count-many": "{0} воскресений назад",
"relativeTimePattern-count-other": "{0} воскресенья назад"
}
},
"sun-short": {
"relative-type--1": "в прош. вс.",
"relative-type-0": "в это вс.",
"relative-type-1": "в след. вс.",
"relativeTime-type-future": {
"relativeTimePattern-count-one": "через {0} вс.",
"relativeTimePattern-count-few": "через {0} вс.",
"relativeTimePattern-count-many": "через {0} вс.",
"relativeTimePattern-count-other": "через {0} вс."
},
"relativeTime-type-past": {
"relativeTimePattern-count-one": "{0} вс. назад",
"relativeTimePattern-count-few": "{0} вс. назад",
"relativeTimePattern-count-many": "{0} вс. назад",
"relativeTimePattern-count-other": "{0} вс. назад"
}
},
"sun-narrow": {
"relative-type--1": "в прош. вс.",
"relative-type-0": "в это вс.",
"relative-type-1": "в след. вс.",
"relativeTime-type-future": {
"relativeTimePattern-count-one": "+{0} вс.",
"relativeTimePattern-count-few": "+{0} вс.",
"relativeTimePattern-count-many": "+{0} вс.",
"relativeTimePattern-count-other": "+{0} вс."
},
"relativeTime-type-past": {
"relativeTimePattern-count-one": "-{0} вс.",
"relativeTimePattern-count-few": "-{0} вс.",
"relativeTimePattern-count-many": "-{0} вс.",
"relativeTimePattern-count-other": "-{0} вс."
}
},

which matches the English en.py file:

                "sun": {
                    "relative-type--1": "last Sunday",
                    "relative-type-0": "this Sunday",
                    "relative-type-1": "next Sunday",
                    "relativeTime-type-future": {
                        "relativeTimePattern-count-one": "in {0} Sunday",
                        "relativeTimePattern-count-other": "in {0} Sundays"
                    },
                    "relativeTime-type-past": {
                        "relativeTimePattern-count-one": "{0} Sunday ago",
                        "relativeTimePattern-count-other": "{0} Sundays ago"
                    }
                },
                "sun-short": {
                    "relative-type--1": "last Sun.",
                    "relative-type-0": "this Sun.",
                    "relative-type-1": "next Sun.",
                    "relativeTime-type-future": {
                        "relativeTimePattern-count-one": "in {0} Sun.",
                        "relativeTimePattern-count-other": "in {0} Sun."
                    },
                    "relativeTime-type-past": {
                        "relativeTimePattern-count-one": "{0} Sun. ago",
                        "relativeTimePattern-count-other": "{0} Sun. ago"
                    }
                },
                "sun-narrow": {
                    "relative-type--1": "last Su",
                    "relative-type-0": "this Su",
                    "relative-type-1": "next Su",
                    "relativeTime-type-future": {
                        "relativeTimePattern-count-one": "in {0} Su",
                        "relativeTimePattern-count-other": "in {0} Su"
                    },
                    "relativeTime-type-past": {
                        "relativeTimePattern-count-one": "{0} Su ago",
                        "relativeTimePattern-count-other": "{0} Su ago"
                    }
                },

So, theoretically... we could tap into all those variations for past, present, future - right?

@jgtimestuff
Copy link

I see the structure I ended up with doesn't match the original .py sets at all... sigh.

@harsh9200
Copy link
Author

I see the structure I ended up with doesn't match the original .py sets at all... sigh.

We can extend the current structure in a way that all works well with dateFields.json.

Thoughts?

@Gallaecio
Copy link
Member

Maybe open a separate pull request just for updating the CLDR data, and making any necessary changes to the corresponding scripts?

@jgtimestuff
Copy link

Going back in time a bit here, but what's the difference between Tuesday and Next Tuesday? If we accept that a text talking about a given weekday has a 50/50 chance of talking about past or future because present wouldn't use the weekday. On Tuesday, I will or I did - do something are really the only choices.
If we then consider that past tense is more likely to say 'last Tuesday' than future tense to say 'next Tuesday', is it safer to assume that Tuesday without a last or next refers to future more often than not?

In other words, do we really need to add code specifically for 'next Tuesday' if it would most likely always return the same date as 'Tuesday' by itself?

The recent code changes could still be used to force the Tuesday from this week into the future in the output... where dateparser.parse('tuesday') when compared to a current date of Wed through Sunday) would NOT return the date for Tuesday of this week.

i.e. if current.weekday() is >= _weekday then we want the future and not the one from the current week. We wouldn't normally say on Wednesday, when it is Wednesday - or on Tuesday when it's Thursday and be referring to the Tuesday that has passed.

@jgtimestuff
Copy link

How do these look for results given today is Thursday 2020-04-09?

import dateparser
dateparser.parse('next friday')
datetime.datetime(2020, 4, 17, 9, 47, 24, 953166)
dateparser.parse('next saturday')
datetime.datetime(2020, 4, 18, 9, 47, 34, 107875)
dateparser.parse('next sunday')
datetime.datetime(2020, 4, 12, 9, 47, 41, 457723)
dateparser.parse('next monday')
datetime.datetime(2020, 4, 13, 9, 47, 47, 542797)
dateparser.parse('next tuesday')
datetime.datetime(2020, 4, 14, 9, 47, 56, 140951)
dateparser.parse('next wednesday')
datetime.datetime(2020, 4, 15, 9, 48, 3, 290301)
dateparser.parse('next thursday')
datetime.datetime(2020, 4, 16, 9, 48, 10, 112532)
dateparser.parse('thursday')
datetime.datetime(2020, 4, 16, 9, 48, 26, 74853)
dateparser.parse('friday')
datetime.datetime(2020, 4, 10, 9, 48, 33, 994293)
dateparser.parse('saturday')
datetime.datetime(2020, 4, 11, 9, 48, 41, 72212)
dateparser.parse('sunday')
datetime.datetime(2020, 4, 12, 9, 48, 47, 835633)
dateparser.parse('monday')
datetime.datetime(2020, 4, 13, 9, 48, 55, 657067)

I put in a rule that 'next' isn't needed for days of the week that have already passed in the current week or is the current day - With today being Thursday, all the days of Monday-Thursday are automatically 'next week' - so whether next is present or not... it'll return the appropriate date for the upcoming weekday.

For days equal to or greater than today's date it will only apply 'next' if the difference of days of the week is greater than 2... so Friday, and Saturday without 'next' provide the dates for the near future while next Friday and next Saturday jump a week.

def _parse_date(self, date_string, prefer_dates_from):
    _weekday = self.get_weekday_data(date_string)
    
    if not self._are_all_words_units(date_string) and not _weekday:
        return None, None

    kwargs = self.get_kwargs(date_string)
    
    if not kwargs and not _weekday:
        return None, None

    period = 'day'
    if _weekday:  
        newday = self.now.today()
        dat = getattr(cal, _weekday.upper())
        if dat <= newday.weekday():
            nd = newday.weekday()-dat
            td = relativedelta(weeks=+1,  days=-nd)
        else:
            td = relativedelta(weekday=dat)

    else:
        if 'days' not in kwargs:
            
            for k in ['weeks', 'months', 'years']:
                if k in kwargs:
                    period = k[:-1]
                    break

        td = relativedelta(**kwargs)
    if (
        re.search(r'\bin\b', date_string) or
        re.search(r'\bnext\b', date_string) or
        re.search(r'\bafter\b', date_string) or
        ('future' in prefer_dates_from and
         not re.search(r'\bago\b', date_string))
    ):
        if  dat> newday.weekday() and dat-newday.weekday() < 3:
            nd = dat-newday.weekday()
            td = relativedelta(weeks=+1, days=+nd)
        date = self.now + td
    else:
        date = self.now + td
    return date, period

Thoughts?

@Gallaecio
Copy link
Member

If we then consider that past tense is more likely to say 'last Tuesday' than future tense to say 'next Tuesday', is it safer to assume that Tuesday without a last or next refers to future more often than not?

This may be true for natural language, but I’m not so sure about computing systems. I think I’ve seen websites using the name of the day of the week to refer to past events within the last 6 days (then they use ‘1 week ago’ and similar for older events).

@jgtimestuff
Copy link

I tried a couple more additions like making 'Saturday' recognize 'weekend' and noticed that the 'first day of' a month is quirky because I wrote a sentence to parse that included the line... but I suppose that should indeed be another PR as well.

search_dates("this weekend will mark the 1st of may")
1 may <dateparser.conf.Settings object at 0x03E83F90>
1 may <dateparser.conf.Settings object at 0x03E83F90>
[('weekend', datetime.datetime(2020, 4, 18, 10, 15, 41, 404986)), ('the 1st of may', datetime.datetime(2020, 5, 1, 0, 0))] - it works for “1st” but…

search_dates("this weekend will mark the first of may")
may <dateparser.conf.Settings object at 0x03E83F90>
may <dateparser.conf.Settings object at 0x03E83F90>
[('weekend', datetime.datetime(2020, 4, 18, 10, 17, 11, 190944)), ('of may', datetime.datetime(2020, 5, 18, 0, 0))] provides the weekend date (i.e. the upcoming 18th) and forwards that by one month!

@harsh9200
Copy link
Author

harsh9200 commented Apr 28, 2020

Reverting the change in the JSON files, applying it to the YAML files (same change, YAML syntax) and running the script I mentioned should generate the same Python file changes, and everything should work the same.

Changing the Following code

"relative-type-regex": {
        "in \\1 year": [
            "in {0} year",
            "in {0} years",
            "in {0} yr"
        ]

Into Yaml results the following

relative-type-regex:
    in \\1 year: 
        - string: 'in {0} year'
        - string: 'in {0} years'
        - string: 'in {0} yr'

But write_complete_data.py script raises an Error.

So to fix this can I edit the script? And After this is done what is the use of JSON files present in cldr_language_data ?

@Gallaecio
Copy link
Member

But write_complete_data.py script raises an Error.

So to fix this can I edit the script?

Yes. Although I would not include the script change as part of this pull request, I would create a separate one for that.

And After this is done what is the use of JSON files present in cldr_language_data ?

They come from Unicode CLDR, as described in https://dateparser.readthedocs.io/en/latest/contributing.html#guidelines-for-editing-translation-data

We update them automatically from Unicode CLDR.

@bmilovanovic
Copy link

Hi, guys, is there some progress on this? Maybe it's hard but it's very very valuable for us and I would appreciate very much if this is solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Not supporting Last Sunday and after 15 days dateparser not able to parse things like next tuesday.

4 participants