Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function to get the status response code and headers of a warc response? #198

Closed
dportabella opened this issue Apr 20, 2018 · 6 comments
Closed
Labels

Comments

@dportabella
Copy link
Contributor

is there a function to get the status response code and headers of a warc response?

such as...

RecordLoader.loadArchives(warcFile, sc)
.filter(_.warcResponse.statusCode = 200)
.filter(_.warcResponse.headers.get("Server") == "Apache/2.4.6")
@dportabella
Copy link
Contributor Author

dportabella commented Apr 20, 2018

I am using this function at the moment:

import java.io.ByteArrayInputStream
import io.archivesunleashed.spark.archive.io.ArchiveRecord
import org.apache.commons.httpclient.{Header, HttpParser, StatusLine}
import org.apache.commons.io.IOUtils

case class Response(archiveRecord: ArchiveRecord, statusLine: StatusLine, headers: List[Header], content: Array[Byte])

object WarcUtils {
  def parseResponse(r: ArchiveRecord): Response = {
    val response = new ByteArrayInputStream(r.getContentBytes)
    val line = HttpParser.readRawLine(response)
    val statusLine = new StatusLine(new String(line))
    val headers = HttpParser.parseHeaders(response, "US-ASCII").toList
    val responseContent: Array[Byte] = IOUtils.toByteArray(response)
    Response(r, statusLine, headers, responseContent)
  }
}

RecordLoader.loadArchives(warcFile, sc)
  .filter(FilterArchive.isHTML)
  .flatMap(r => Try(WarcUtils.parseResponse(r)).toOption)
  .filter(_.statusLine.getStatusCode = 200)
  .filter(_.headers.collectFirst {case h if h.getName == "Server" => h.getValue }.contains("Apache/2.4.6"))

If you tell me how your prefer to refactor this code to fit your library, I can make a pull-request if you want.

@dportabella
Copy link
Contributor Author

Although there is an error parsing the following archive, generate by
$ wget --warc-file=test https://www.linkedin.com/

do you know if this 1000 (just before the html content) is part of the response header? it looks quite strange to me. my parseResponse function detects this 1000 as part of the html content.

any idea about this?

WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:5D495122-1DD9-472B-B619-A7EDADF03070>
WARC-Warcinfo-ID: <urn:uuid:65D43BDC-AC23-4515-A951-6DE201680871>
WARC-Concurrent-To: <urn:uuid:CB15E601-0128-4D95-9436-BDF271F23CBF>
WARC-Target-URI: <https://www.linkedin.com/>
WARC-Date: 2018-04-20T07:57:54Z
WARC-IP-Address: 185.63.145.1
WARC-Block-Digest: sha1:7OCQ4D4PIAWT2CDJ3K7P6HZ4QGBFAYMP
WARC-Payload-Digest: sha1:ANGMHLF4ZVXSWXMWJQLSCNUIRJEWBZLS
Content-Type: application/http;msgtype=response
Content-Length: 46408

HTTP/1.1 200 OK
Date: Fri, 20 Apr 2018 07:57:54 GMT
Content-Type: text/html; charset=utf-8
...
Set-Cookie: lidc="b=VGST04:g=802:u=1:i=1524211008:t=1524297408:s=AQHTwVE0xQkI0A3-ifWSgmSS3EeyGTjx"; Expires=Sat, 21 Apr 2018 07:56:48 GMT; domain=.linkedin.com; Path=/

1000
<!DOCTYPE html>
<!--[if lt IE 7]> <html lang="en" class="ie ie6 lte9 lte8 lte7 os-win"> <![endif]-->
...

@greebie
Copy link
Contributor

greebie commented Nov 20, 2018

@dportabella I was going to look at producing something to resolve this issue today. Are you still interested in providing a pull request?

With @lintool & @ianmilligan1 's okay, I would say that this function should be in the matchbox (https://github.com/archivesunleashed/aut/tree/master/src/main/scala/io/archivesunleashed/matchbox) as ExtractHttpResponse or something similar and maybe make .parseResponse be the apply function.

An alternative is to include access to the response code in the ArchiveRecord trait and ArchiveRecordImpl as .getHttpResponse so that it is included in all ArchiveRecords by default. This seems the more user-friendly approach, but we would want to test the effect the function has on overall run time. (It should be nil due to laziness, but as a rule we should check run time anytime we change the ArchiveRecord trait).

I think I prefer the second option. If you are not able to prep a PR at this time, I can take a crack at it.

Either way, I am not sure how, but we would want to be sure you were given appropriate credit for this idea and proposed implementation. Maybe @ruebot or @ianmilligan1 knows the best way to make this happen.

Thanks so much for your help!

Ryan. .

@greebie
Copy link
Contributor

greebie commented Nov 21, 2018

Branch issue-198 covers the header response code, but not the full headers, as I could not get the full header details at this stage. The following will have results for time differences.

Using the same warc collection

17.0

Text Network Domain
4222 163173 unknown
292 167488 113007
297 164422 114284

17.1 (same script)

Text Network Domain
2569 177328 120834
237 160474 112580
227 168222 112792

17.1 (add statusHeader)

(note network script includes an additional map compared to above)

Text Network Domain
229 188917 160654
341 202115 124252
247 175594 116306

17.1 (add fileName)

(note network script includes an additional map compared to above)

Text Network Domain
230 213184 18282
249 165582 113008
239 180437 123404

tl;dr - there is no effect on the ArchiveRecord / RecordLoader when not using .getHttpStatus or .getFilename and there is minimal effect when using it.

@greebie
Copy link
Contributor

greebie commented Nov 21, 2018

This is the code I used to produce the above.

Original code

timed {
import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("/Users/ryandeschamps/warcs/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
  .take(10)
  }

  timed {

  import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.util._

val links = RecordLoader.loadArchives("/Users/ryandeschamps/warcs/*gz", sc)
  .keepValidPages()
  .keepContent(Set("apple".r))
  .flatMap(r => ExtractLinks(r.getUrl, r.getContentString))
  .map(r => (ExtractDomain(r._1).removePrefixWWW(), ExtractDomain(r._2).removePrefixWWW()))
  .filter(r => r._1 != "" && r._2 != "")
  .countItems()
  .filter(r => r._2 > 5).take(10)
  }

  timed {
  import io.archivesunleashed._
import io.archivesunleashed.matchbox._

val r = RecordLoader.loadArchives("/Users/ryandeschamps/warcs/*gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)
  }

add .getHttpStatus

  import io.archivesunleashed._
  import io.archivesunleashed.matchbox._
  import io.archivesunleashed.util._

  val links = RecordLoader.loadArchives("/Users/ryandeschamps/warcs/*gz", sc)
    .keepValidPages()
    .keepContent(Set("apple".r))
    .map(r => (r.getHttpStatus, (ExtractLinks(r.getUrl, r.getContentString))))
    .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), 
      ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
    .filter(r => r._2 != "" && r._3 != "")
    .countItems()
    .filter(r => r._2 > 5).take(10)

produces: links: Array[((String, String, String), Int)] = Array(((200,nanaimodailynews.com,nanaimodailynews.com),445785), ((200,nanaimodailynews.com,blackpress.ca),188676), ((200,nanaimodailynews.com,bclocalnews.com),111400), ((200,nanaimodailynews.com,drivewaycanada.ca),53796), ((200,nanaimodailynews.com,facebook.com),53500), ((200,nanaimodailynews.com,bcclassified.com),52922), ((200,nanaimodailynews.com,usednanaimo.com),27067), ((200,nanaimodailynews.com,iservices.blackpress.ca),26953), ((200,nanaimodailynews.com,localworkbc.ca),26546), ((200,nanaimodailynews.com,twitter.com),24853))

    // add filename

  import io.archivesunleashed._
  import io.archivesunleashed.matchbox._
  import io.archivesunleashed.util._

  val links = RecordLoader.loadArchives("/Users/ryandeschamps/warcs/*gz", sc)
        .keepValidPages()
        .keepContent(Set("apple".r))
        .map(r => (r.getFilename, (ExtractLinks(r.getUrl, r.getContentString))))
        .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), 
          ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
        .filter(r => r._2 != "" && r._3 != "")
        .countItems()
        .filter(r => r._2 > 5).take(10)
      }

produces

links: Array[((String, String, String), Int)] = Array(((file:/Users/ryandeschamps/warcs/ARCHIVEIT-4656-CRAWL_SELECTED_SEEDS-JOB193391-20160127222913427-00000.warc.gz,nanaimodailynews.com,nanaimodailynews.com),439503), 
((file:/Users/ryandeschamps/warcs/ARCHIVEIT-4656-CRAWL_SELECTED_SEEDS-JOB193391-20160127222913427-00000.warc.gz,nanaimodailynews.com,blackpress.ca),186028), 
((file:/Users/ryandeschamps/warcs/ARCHIVEIT-4656-CRAWL_SELECTED_SEEDS-JOB193391-20160127222913427-00000.warc.gz,nanaimodailynews.com,bclocalnews.com),106107), 
((file:/Users/ryandeschamps/warcs/ARCHIVEIT-4656-CRAWL_SELECTED_SEEDS-JOB193391-20160127222913427-00000.warc.gz,nanaimodailynews.com,drivewaycanada.ca),53040),
...

@dportabella
Copy link
Contributor Author

Cool, thanks!

@ruebot ruebot closed this as completed in 7731b6d Nov 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants