Update for NYT / User Agent change

In the recent few weeks, the NYT scraper no longer works as it appears that they are now blocking requests with the default user agent of "python-requests"

This can be changed by editing ~/parsers/baseparser.py and adding code as follows to the grab_url section. This is setup here to randomly rotate among 10 different user agents. The section between `opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))`  and  ` retry = False` is what has been added. Of course, you can add whatever user agents you want to here.

```python
def grab_url(url, max_depth=5, opener=None):
    if opener is None:
        cj = cookielib.CookieJar()
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
        user_agent_list = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:83.0) Gecko/20100101 Firefox/83.0',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',
        'Googlebot/2.1 (+http://www.google.com/bot.html)',
        'UCWEB/2.0 (compatible; Googlebot/2.1; +google.com/bot.html)',
        'facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Page Speed Insights) Chrome/41.0.2272.118 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
        'Mozilla/5.0 (X11; Linux x86_64)',
        ]
        for i in range(1,10):
            user_agent = random.choice(user_agent_list)
        opener.addheaders = [('User-Agent', user_agent)]
    retry = False
    try:
        text = opener.open(url, timeout=5).read()
        if '<title>NY Times Advertisement</title>' in text:
            retry = True
    except socket.timeout:
        retry = True
    if retry:
        if max_depth == 0:
            raise Exception('Too many attempts to download %s' % url)
        time.sleep(0.5)
        return grab_url(url, max_depth-1, opener)
    return text
```

I've updated this on my fork, but I have several other updates that are somewhat specific that others may or may not want, so I was hesitant to submit the pull request. I wanted to document it here though in case others were wondering why the system isn't catching new articles for NYT.

Thanks,
Vishnu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update for NYT / User Agent change #64

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Update for NYT / User Agent change #64

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions