-
I'm trying to crawl a company website using the seed www.example.com and Heritrix is either generating no warc file or just empty warcs. If i do get something back in the warc file it looks like the following. Is there something i should be adding to the beans file to make the crawl work. I'm using the default beans file from the latest release. Why does the crawl just return the DNS records? Thanks for any assistance. WARC/1.0 20220527163857 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I think some more information is required. What does your crawl log look like when it's running? Could the site have a restrictive robots.txt? Can you connect with something like |
Beta Was this translation helpful? Give feedback.
I think some more information is required. What does your crawl log look like when it's running? Could the site have a restrictive robots.txt? Can you connect with something like
curl
?