Is the "data" variable from CrawlURI thread-safe? #545

cgr71ii · 2023-02-07T20:29:16Z

cgr71ii
Feb 7, 2023

Hi!

I'm using CLD2 in order to detect the language of the downloaded files, and I was wondering if is possible to avoid to run CLD2 multiple times for a CrawlURI instance where the "via" URI has already been processed using CLD2. For this purpose, I've though in using the "data" map which is provided for storing data, but I'm not sure if it's thread-safe, and if it's not, it'll not be possible to use it in the way I had though because the "data" variable might be smashed or corrupted for the other threads when crawling with multiple threads.

The methods I had though are (from http://builds.archive.org/javadoc/heritrix-3.2.0/org/archive/modules/CrawlURI.html):

getData
containsDataKey
getDataList

Something similar to:

List<String> langs = curi.getDataList("langs");
String lang = "";

if (langs.length() != 0) {
  lang = langs[0];
}
else {
  // Apply CLD2
  // ...
  lang = lang_from_cld2
  langs.add(lang);
}

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the "data" variable from CrawlURI thread-safe? #545

{{title}}

Replies: 0 comments

Select a reply

Is the "data" variable from CrawlURI thread-safe? #545

cgr71ii Feb 7, 2023

Replies: 0 comments

cgr71ii
Feb 7, 2023