-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Function to get the status response code and headers of a warc response? #198
Comments
I am using this function at the moment:
If you tell me how your prefer to refactor this code to fit your library, I can make a pull-request if you want. |
Although there is an error parsing the following archive, generate by do you know if this any idea about this?
|
@dportabella I was going to look at producing something to resolve this issue today. Are you still interested in providing a pull request? With @lintool & @ianmilligan1 's okay, I would say that this function should be in the matchbox (https://github.com/archivesunleashed/aut/tree/master/src/main/scala/io/archivesunleashed/matchbox) as An alternative is to include access to the response code in the ArchiveRecord trait and ArchiveRecordImpl as .getHttpResponse so that it is included in all ArchiveRecords by default. This seems the more user-friendly approach, but we would want to test the effect the function has on overall run time. (It should be nil due to laziness, but as a rule we should check run time anytime we change the ArchiveRecord trait). I think I prefer the second option. If you are not able to prep a PR at this time, I can take a crack at it. Either way, I am not sure how, but we would want to be sure you were given appropriate credit for this idea and proposed implementation. Maybe @ruebot or @ianmilligan1 knows the best way to make this happen. Thanks so much for your help! Ryan. . |
Branch issue-198 covers the header response code, but not the full headers, as I could not get the full header details at this stage. The following will have results for time differences. Using the same warc collection 17.0
17.1 (same script)
17.1 (add statusHeader)(note network script includes an additional map compared to above)
17.1 (add fileName)(note network script includes an additional map compared to above)
tl;dr - there is no effect on the ArchiveRecord / RecordLoader when not using .getHttpStatus or .getFilename and there is minimal effect when using it. |
This is the code I used to produce the above. Original code
add .getHttpStatus
produces: links: Array[((String, String, String), Int)] = Array(((200,nanaimodailynews.com,nanaimodailynews.com),445785), ((200,nanaimodailynews.com,blackpress.ca),188676), ((200,nanaimodailynews.com,bclocalnews.com),111400), ((200,nanaimodailynews.com,drivewaycanada.ca),53796), ((200,nanaimodailynews.com,facebook.com),53500), ((200,nanaimodailynews.com,bcclassified.com),52922), ((200,nanaimodailynews.com,usednanaimo.com),27067), ((200,nanaimodailynews.com,iservices.blackpress.ca),26953), ((200,nanaimodailynews.com,localworkbc.ca),26546), ((200,nanaimodailynews.com,twitter.com),24853))
produces
|
Cool, thanks! |
is there a function to get the status response code and headers of a warc response?
such as...
The text was updated successfully, but these errors were encountered: