Question re: cloudfront.net #526

carj · 2022-06-01T09:17:30Z

carj
Jun 1, 2022

I'm trying to crawl a company website using the seed www.example.com and Heritrix is either generating no warc file or just empty warcs.
I'm using a simple one line seed of the company website.

If i do get something back in the warc file it looks like the following. Is there something i should be adding to the beans file to make the crawl work. I'm using the default beans file from the latest release.

Why does the crawl just return the DNS records?

Thanks for any assistance.

WARC/1.0
WARC-Type: response
WARC-Target-URI: dns:www.example.com
WARC-Date: 2022-05-27T16:38:57Z
WARC-IP-Address: 172.30.0.2
WARC-Record-ID: urn:uuid:e2196c71-7dec-4163-94d4-bb64934888a6
Content-Type: text/dns
Content-Length: 224

20220527163857
d1a4lrim8ynrpr.cloudfront.net. 60 IN A 13.249.39.29
d1a4lrim8ynrpr.cloudfront.net. 60 IN A 13.249.39.116
d1a4lrim8ynrpr.cloudfront.net. 60 IN A 13.249.39.81
d1a4lrim8ynrpr.cloudfront.net. 60 IN A 13.249.39.52

Answered by NGTmeaty

Jun 6, 2022

I think some more information is required. What does your crawl log look like when it's running? Could the site have a restrictive robots.txt? Can you connect with something like curl?

View full answer

NGTmeaty · 2022-06-06T04:19:20Z

NGTmeaty
Jun 6, 2022

I think some more information is required. What does your crawl log look like when it's running? Could the site have a restrictive robots.txt? Can you connect with something like curl?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question re: cloudfront.net #526

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Question re: cloudfront.net #526

carj Jun 1, 2022

Replies: 1 comment

NGTmeaty Jun 6, 2022

carj
Jun 1, 2022

NGTmeaty
Jun 6, 2022