Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"RIS already open for ToeThread..." exception during https pages crawl over proxy #191

Closed
WI-IT opened this issue Oct 30, 2017 · 2 comments
Labels

Comments

@WI-IT
Copy link

WI-IT commented Oct 30, 2017

When I try to crawl https pages over a proxy with Heritrix 3, I get following exceptions:

java.io.IOException: RIS already open for ToeThread #5: https://www.XXX/robots.txt at org.archive.io.RecordingInputStream.open(RecordingInputStream.java:84) at org.archive.util.Recorder.inputWrap(Recorder.java:185) at org.archive.modules.fetcher.FetchHTTPRequest$RecordingHttpClientConnection.getSocketInputStream(FetchHTTPRequest.java:648) at org.apache.http.impl.BHttpConnectionBase.ensureOpen(BHttpConnectionBase.java:131) at org.apache.http.impl.DefaultBHttpClientConnection.sendRequestHeader(DefaultBHttpClientConnection.java:140) at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:203) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:254) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.archive.modules.fetcher.FetchHTTPRequest.execute(FetchHTTPRequest.java:751) at org.archive.modules.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:658) at org.archive.modules.Processor.innerProcessResult(Processor.java:175) at org.archive.modules.Processor.process(Processor.java:142) at org.archive.modules.ProcessorChain.process(ProcessorChain.java:138) at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)

@WI-IT WI-IT changed the title RIS already open for ToeThread exception during https pages crawl over proxy "RIS already open for ToeThread..." exception during https pages crawl over proxy Oct 30, 2017
@ato ato added the bug label Aug 2, 2018
@marhop
Copy link

marhop commented Jan 31, 2019

I can confirm this. The exception is thrown only for HTTPS hosts, plain HTTP works fine with a proxy. What's worse though, as soon as Heritrix encounters an HTTPS URL it runs into a -404 ""Empty HTTP response interpreted as a 404" error. (This may be coincidence, but the correlation looks suspicious enough to me.)

This could be related to iipc/webarchive-commons#64 where @kris-sigur hinted at a possible cause:

First thought is that when crawling HTTPS via proxy, Heritrix fails to properly close the RecordingInputStream

Looking at the source code I have to admit though that I have no idea where this happens (or if this is in fact the cause of this behaviour), so I cannot offer you a bugfix ... Would be great if someone else can! 😄

Thanks,
Martin

@danielbicho
Copy link

Any update about this? I am just facing the same problem.

I notice several problems here:

CONNECT command problem

I noticed that Heritrix/HttpClient is sending the CONNECT command wrongly and some proxies don't accept it. I tried with Warcprox and Charles proxy and both complain about it.
Heritrix is sending something like: CONNECT sobre.arquivo.pt HTTP/1.0, which is wrong because it should specify the port number: CONNECT sobre.arquivo.pt:443 HTTP/1.0. (Can someone clarify this, it was my interpretation of the specification).

Changing the ROUTE_PLANNER in FetchHTTPRequest to specify the HttpHost port instead of passing -1 value solves this problem, the CONNECT command is sent in the right way then.

The RIS already open problem.

What I concluded is that while opening a TUNNEL with HTTPS the HttpClient will call the getSocketInputStream() 2 times, wrapping a java.net.SocketInputStream first and then wrapping a sun.security.ssl.AppInputStream. There is no way here Heritrix can know about this behaviour since its delegating the connection operations to the HttpClient.

Also if I try to properly close the java.net.SocketInputStream before wrapping the sun.security.ssl.AppInputStream it will then complain that the Socket is closed when it tries to write.
The solution here iipc/webarchive-commons#64 seems enough, and I agree that there is no need to throw an exception if the RecordingInputStream is already wrapping a stream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants