Better handling of extracted URIs that are "data URIs" (base64 encoded media) #422

kris-sigur · 2021-07-29T08:38:52Z

Data URIs seem to trip up the Extractor module. Excerpt from log follows with truncated URI:

Jun 18, 2021 6:47:08 PM org.archive.modules.extractor.ExtractorSitemap recordOutlink
WARNING: URIException when recording outlink http://lindaskoli.is/data:image/jpeg;base64,/9j/4AAQ--18K truncated-- (in thread 'ToeThread #54: http://lindaskoli.is/post-sitemap.xml'; in processor 'extractorSitemap')
org.apache.commons.httpclient.URIException: URI length > 2083: http://lindaskoli.is/data:image/jpeg;base64,/9j/4AAQ-- 18K truncated again--
        at org.archive.url.UsableURIFactory.fixup(UsableURIFactory.java:357)
        at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:301)
        at org.archive.net.UURIFactory.getInstance(UURIFactory.java:55)
        at org.archive.modules.extractor.Extractor.addRelativeToBase(Extractor.java:190)
        at org.archive.modules.extractor.ExtractorSitemap.recordOutlink(ExtractorSitemap.java:163)
        at org.archive.modules.extractor.ExtractorSitemap.innerExtract(ExtractorSitemap.java:105)
        at org.archive.modules.extractor.ContentExtractor.extract(ContentExtractor.java:37)
        at org.archive.modules.extractor.Extractor.innerProcess(Extractor.java:102)
        at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)

These can be quite large and, as seen above, are output into the log twice. In addition to any other remedies it might be best to modify the logging code to truncate offending URI in the log to avoid these excessive log spams. The above was only about 18K. The log of my last large scale crawl contains much larger examples.

More specifically useful would be for the Extractor class to detect data uris and either ignore them or pass them to a service that decodes them and does link extraction if the mimetype warrants it. Although I think data uris are rarely (ever?) used for mimetypes that might contain further URIs. My only experience of them has been images.

Some of this logic may belong in the underlying url libraries in webarchives-common.

The text was updated successfully, but these errors were encountered:

ato · 2021-07-30T05:07:22Z

We probably need to make all the extractors use the outlink helper methods in the Extractor base classes consistently as there's a number of them that call curi.getOutlinks().add(link) directly. Then we can change the helper methods to ignore data URIs.

Might be nice to also change the outlink helpers to take a CharSequence instead of a String that way when possible they can filter out large URIs before they get copied to the heap as Strings.

ato mentioned this issue Jul 30, 2021

Don't extract data URIs #423

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling of extracted URIs that are "data URIs" (base64 encoded media) #422

Better handling of extracted URIs that are "data URIs" (base64 encoded media) #422

kris-sigur commented Jul 29, 2021

ato commented Jul 30, 2021

Better handling of extracted URIs that are "data URIs" (base64 encoded media) #422

Better handling of extracted URIs that are "data URIs" (base64 encoded media) #422

Comments

kris-sigur commented Jul 29, 2021

ato commented Jul 30, 2021