In the recent few weeks, the NYT scraper no longer works as it appears that they are now blocking requests with the default user agent of "python-requests"
This can be changed by editing ~/parsers/baseparser.py and adding code as follows to the grab_url section. This is setup here to randomly rotate among 10 different user agents. The section between opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) and retry = False is what has been added. Of course, you can add whatever user agents you want to here.
def grab_url(url, max_depth=5, opener=None):
if opener is None:
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
user_agent_list = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:83.0) Gecko/20100101 Firefox/83.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',
'Googlebot/2.1 (+http://www.google.com/bot.html)',
'UCWEB/2.0 (compatible; Googlebot/2.1; +google.com/bot.html)',
'facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Page Speed Insights) Chrome/41.0.2272.118 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (X11; Linux x86_64)',
]
for i in range(1,10):
user_agent = random.choice(user_agent_list)
opener.addheaders = [('User-Agent', user_agent)]
retry = False
try:
text = opener.open(url, timeout=5).read()
if '<title>NY Times Advertisement</title>' in text:
retry = True
except socket.timeout:
retry = True
if retry:
if max_depth == 0:
raise Exception('Too many attempts to download %s' % url)
time.sleep(0.5)
return grab_url(url, max_depth-1, opener)
return text
I've updated this on my fork, but I have several other updates that are somewhat specific that others may or may not want, so I was hesitant to submit the pull request. I wanted to document it here though in case others were wondering why the system isn't catching new articles for NYT.
Thanks,
Vishnu
In the recent few weeks, the NYT scraper no longer works as it appears that they are now blocking requests with the default user agent of "python-requests"
This can be changed by editing ~/parsers/baseparser.py and adding code as follows to the grab_url section. This is setup here to randomly rotate among 10 different user agents. The section between
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))andretry = Falseis what has been added. Of course, you can add whatever user agents you want to here.I've updated this on my fork, but I have several other updates that are somewhat specific that others may or may not want, so I was hesitant to submit the pull request. I wanted to document it here though in case others were wondering why the system isn't catching new articles for NYT.
Thanks,
Vishnu