You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Data URIs seem to trip up the Extractor module. Excerpt from log follows with truncated URI:
Jun 18, 2021 6:47:08 PM org.archive.modules.extractor.ExtractorSitemap recordOutlink
WARNING: URIException when recording outlink http://lindaskoli.is/--18K truncated-- (in thread 'ToeThread #54: http://lindaskoli.is/post-sitemap.xml'; in processor 'extractorSitemap')
org.apache.commons.httpclient.URIException: URI length > 2083: http://lindaskoli.is/-- 18K truncated again--
at org.archive.url.UsableURIFactory.fixup(UsableURIFactory.java:357)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:301)
at org.archive.net.UURIFactory.getInstance(UURIFactory.java:55)
at org.archive.modules.extractor.Extractor.addRelativeToBase(Extractor.java:190)
at org.archive.modules.extractor.ExtractorSitemap.recordOutlink(ExtractorSitemap.java:163)
at org.archive.modules.extractor.ExtractorSitemap.innerExtract(ExtractorSitemap.java:105)
at org.archive.modules.extractor.ContentExtractor.extract(ContentExtractor.java:37)
at org.archive.modules.extractor.Extractor.innerProcess(Extractor.java:102)
at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
at org.archive.modules.Processor.process(Processor.java:142)
at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)
These can be quite large and, as seen above, are output into the log twice. In addition to any other remedies it might be best to modify the logging code to truncate offending URI in the log to avoid these excessive log spams. The above was only about 18K. The log of my last large scale crawl contains much larger examples.
More specifically useful would be for the Extractor class to detect data uris and either ignore them or pass them to a service that decodes them and does link extraction if the mimetype warrants it. Although I think data uris are rarely (ever?) used for mimetypes that might contain further URIs. My only experience of them has been images.
Some of this logic may belong in the underlying url libraries in webarchives-common.
The text was updated successfully, but these errors were encountered:
We probably need to make all the extractors use the outlink helper methods in the Extractor base classes consistently as there's a number of them that call curi.getOutlinks().add(link) directly. Then we can change the helper methods to ignore data URIs.
Might be nice to also change the outlink helpers to take a CharSequence instead of a String that way when possible they can filter out large URIs before they get copied to the heap as Strings.
Data URIs seem to trip up the Extractor module. Excerpt from log follows with truncated URI:
These can be quite large and, as seen above, are output into the log twice. In addition to any other remedies it might be best to modify the logging code to truncate offending URI in the log to avoid these excessive log spams. The above was only about 18K. The log of my last large scale crawl contains much larger examples.
More specifically useful would be for the Extractor class to detect data uris and either ignore them or pass them to a service that decodes them and does link extraction if the mimetype warrants it. Although I think data uris are rarely (ever?) used for mimetypes that might contain further URIs. My only experience of them has been images.
Some of this logic may belong in the underlying url libraries in webarchives-common.
The text was updated successfully, but these errors were encountered: