Function to get the status response code and headers of a warc response? #198

dportabella · 2018-04-20T09:14:15Z

is there a function to get the status response code and headers of a warc response?

such as...

RecordLoader.loadArchives(warcFile, sc)
.filter(_.warcResponse.statusCode = 200)
.filter(_.warcResponse.headers.get("Server") == "Apache/2.4.6")

The text was updated successfully, but these errors were encountered:

dportabella · 2018-04-20T14:44:43Z

I am using this function at the moment:

import java.io.ByteArrayInputStream
import io.archivesunleashed.spark.archive.io.ArchiveRecord
import org.apache.commons.httpclient.{Header, HttpParser, StatusLine}
import org.apache.commons.io.IOUtils

case class Response(archiveRecord: ArchiveRecord, statusLine: StatusLine, headers: List[Header], content: Array[Byte])

object WarcUtils {
  def parseResponse(r: ArchiveRecord): Response = {
    val response = new ByteArrayInputStream(r.getContentBytes)
    val line = HttpParser.readRawLine(response)
    val statusLine = new StatusLine(new String(line))
    val headers = HttpParser.parseHeaders(response, "US-ASCII").toList
    val responseContent: Array[Byte] = IOUtils.toByteArray(response)
    Response(r, statusLine, headers, responseContent)
  }
}

RecordLoader.loadArchives(warcFile, sc)
  .filter(FilterArchive.isHTML)
  .flatMap(r => Try(WarcUtils.parseResponse(r)).toOption)
  .filter(_.statusLine.getStatusCode = 200)
  .filter(_.headers.collectFirst {case h if h.getName == "Server" => h.getValue }.contains("Apache/2.4.6"))

If you tell me how your prefer to refactor this code to fit your library, I can make a pull-request if you want.

dportabella · 2018-04-20T15:04:01Z

Although there is an error parsing the following archive, generate by
$ wget --warc-file=test https://www.linkedin.com/

do you know if this 1000 (just before the html content) is part of the response header? it looks quite strange to me. my parseResponse function detects this 1000 as part of the html content.

any idea about this?

WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:5D495122-1DD9-472B-B619-A7EDADF03070>
WARC-Warcinfo-ID: <urn:uuid:65D43BDC-AC23-4515-A951-6DE201680871>
WARC-Concurrent-To: <urn:uuid:CB15E601-0128-4D95-9436-BDF271F23CBF>
WARC-Target-URI: <https://www.linkedin.com/>
WARC-Date: 2018-04-20T07:57:54Z
WARC-IP-Address: 185.63.145.1
WARC-Block-Digest: sha1:7OCQ4D4PIAWT2CDJ3K7P6HZ4QGBFAYMP
WARC-Payload-Digest: sha1:ANGMHLF4ZVXSWXMWJQLSCNUIRJEWBZLS
Content-Type: application/http;msgtype=response
Content-Length: 46408

HTTP/1.1 200 OK
Date: Fri, 20 Apr 2018 07:57:54 GMT
Content-Type: text/html; charset=utf-8
...
Set-Cookie: lidc="b=VGST04:g=802:u=1:i=1524211008:t=1524297408:s=AQHTwVE0xQkI0A3-ifWSgmSS3EeyGTjx"; Expires=Sat, 21 Apr 2018 07:56:48 GMT; domain=.linkedin.com; Path=/

1000
<!DOCTYPE html>
<!--[if lt IE 7]> <html lang="en" class="ie ie6 lte9 lte8 lte7 os-win"> <![endif]-->
...

greebie · 2018-11-20T14:56:13Z

@dportabella I was going to look at producing something to resolve this issue today. Are you still interested in providing a pull request?

With @lintool & @ianmilligan1 's okay, I would say that this function should be in the matchbox (https://github.com/archivesunleashed/aut/tree/master/src/main/scala/io/archivesunleashed/matchbox) as ExtractHttpResponse or something similar and maybe make .parseResponse be the apply function.

An alternative is to include access to the response code in the ArchiveRecord trait and ArchiveRecordImpl as .getHttpResponse so that it is included in all ArchiveRecords by default. This seems the more user-friendly approach, but we would want to test the effect the function has on overall run time. (It should be nil due to laziness, but as a rule we should check run time anytime we change the ArchiveRecord trait).

I think I prefer the second option. If you are not able to prep a PR at this time, I can take a crack at it.

Either way, I am not sure how, but we would want to be sure you were given appropriate credit for this idea and proposed implementation. Maybe @ruebot or @ianmilligan1 knows the best way to make this happen.

Thanks so much for your help!

Ryan. .

greebie · 2018-11-21T18:17:01Z

Branch issue-198 covers the header response code, but not the full headers, as I could not get the full header details at this stage. The following will have results for time differences.

Using the same warc collection

17.0

Text	Network	Domain
4222	163173	unknown
292	167488	113007
297	164422	114284

17.1 (same script)

Text	Network	Domain
2569	177328	120834
237	160474	112580
227	168222	112792

17.1 (add statusHeader)

(note network script includes an additional map compared to above)

Text	Network	Domain
229	188917	160654
341	202115	124252
247	175594	116306

17.1 (add fileName)

(note network script includes an additional map compared to above)

Text	Network	Domain
230	213184	18282
249	165582	113008
239	180437	123404

tl;dr - there is no effect on the ArchiveRecord / RecordLoader when not using .getHttpStatus or .getFilename and there is minimal effect when using it.

greebie · 2018-11-21T20:51:44Z

This is the code I used to produce the above.

Original code

timed {
import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("/Users/ryandeschamps/warcs/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
  .take(10)
  }

  timed {

  import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.util._

val links = RecordLoader.loadArchives("/Users/ryandeschamps/warcs/*gz", sc)
  .keepValidPages()
  .keepContent(Set("apple".r))
  .flatMap(r => ExtractLinks(r.getUrl, r.getContentString))
  .map(r => (ExtractDomain(r._1).removePrefixWWW(), ExtractDomain(r._2).removePrefixWWW()))
  .filter(r => r._1 != "" && r._2 != "")
  .countItems()
  .filter(r => r._2 > 5).take(10)
  }

  timed {
  import io.archivesunleashed._
import io.archivesunleashed.matchbox._

val r = RecordLoader.loadArchives("/Users/ryandeschamps/warcs/*gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)
  }

add .getHttpStatus

  import io.archivesunleashed._
  import io.archivesunleashed.matchbox._
  import io.archivesunleashed.util._

  val links = RecordLoader.loadArchives("/Users/ryandeschamps/warcs/*gz", sc)
    .keepValidPages()
    .keepContent(Set("apple".r))
    .map(r => (r.getHttpStatus, (ExtractLinks(r.getUrl, r.getContentString))))
    .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), 
      ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
    .filter(r => r._2 != "" && r._3 != "")
    .countItems()
    .filter(r => r._2 > 5).take(10)

produces: links: Array[((String, String, String), Int)] = Array(((200,nanaimodailynews.com,nanaimodailynews.com),445785), ((200,nanaimodailynews.com,blackpress.ca),188676), ((200,nanaimodailynews.com,bclocalnews.com),111400), ((200,nanaimodailynews.com,drivewaycanada.ca),53796), ((200,nanaimodailynews.com,facebook.com),53500), ((200,nanaimodailynews.com,bcclassified.com),52922), ((200,nanaimodailynews.com,usednanaimo.com),27067), ((200,nanaimodailynews.com,iservices.blackpress.ca),26953), ((200,nanaimodailynews.com,localworkbc.ca),26546), ((200,nanaimodailynews.com,twitter.com),24853))

    // add filename

  import io.archivesunleashed._
  import io.archivesunleashed.matchbox._
  import io.archivesunleashed.util._

  val links = RecordLoader.loadArchives("/Users/ryandeschamps/warcs/*gz", sc)
        .keepValidPages()
        .keepContent(Set("apple".r))
        .map(r => (r.getFilename, (ExtractLinks(r.getUrl, r.getContentString))))
        .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), 
          ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
        .filter(r => r._2 != "" && r._3 != "")
        .countItems()
        .filter(r => r._2 > 5).take(10)
      }

produces

links: Array[((String, String, String), Int)] = Array(((file:/Users/ryandeschamps/warcs/ARCHIVEIT-4656-CRAWL_SELECTED_SEEDS-JOB193391-20160127222913427-00000.warc.gz,nanaimodailynews.com,nanaimodailynews.com),439503), 
((file:/Users/ryandeschamps/warcs/ARCHIVEIT-4656-CRAWL_SELECTED_SEEDS-JOB193391-20160127222913427-00000.warc.gz,nanaimodailynews.com,blackpress.ca),186028), 
((file:/Users/ryandeschamps/warcs/ARCHIVEIT-4656-CRAWL_SELECTED_SEEDS-JOB193391-20160127222913427-00000.warc.gz,nanaimodailynews.com,bclocalnews.com),106107), 
((file:/Users/ryandeschamps/warcs/ARCHIVEIT-4656-CRAWL_SELECTED_SEEDS-JOB193391-20160127222913427-00000.warc.gz,nanaimodailynews.com,drivewaycanada.ca),53040),
...

dportabella · 2018-11-23T08:45:09Z

Cool, thanks!

ruebot added the question label Aug 20, 2018

greebie mentioned this issue Nov 22, 2018

Add .getHttpStatus and .getFilename to ArchiveRecordImpl class #198 & #164 #292

Merged

ruebot closed this as completed in 7731b6d Nov 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Function to get the status response code and headers of a warc response? #198

Function to get the status response code and headers of a warc response? #198

dportabella commented Apr 20, 2018

dportabella commented Apr 20, 2018 •

edited

Loading

dportabella commented Apr 20, 2018

greebie commented Nov 20, 2018

greebie commented Nov 21, 2018 •

edited

Loading

greebie commented Nov 21, 2018 •

edited

Loading

dportabella commented Nov 23, 2018

Function to get the status response code and headers of a warc response? #198

Function to get the status response code and headers of a warc response? #198

Comments

dportabella commented Apr 20, 2018

dportabella commented Apr 20, 2018 • edited Loading

dportabella commented Apr 20, 2018

greebie commented Nov 20, 2018

greebie commented Nov 21, 2018 • edited Loading

17.0

17.1 (same script)

17.1 (add statusHeader)

17.1 (add fileName)

greebie commented Nov 21, 2018 • edited Loading

Original code

add .getHttpStatus

dportabella commented Nov 23, 2018

dportabella commented Apr 20, 2018 •

edited

Loading

greebie commented Nov 21, 2018 •

edited

Loading

greebie commented Nov 21, 2018 •

edited

Loading