Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Last modified headers #547

Merged
merged 5 commits into from
Nov 7, 2022
Merged

Last modified headers #547

merged 5 commits into from
Nov 7, 2022

Conversation

ruebot
Copy link
Member

@ruebot ruebot commented Nov 2, 2022

GitHub issue(s): #546

What does this Pull Request do?

Implements extracting last_modified_date of a resource where available.

Example:

import io.archivesunleashed._
val data = "/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143300-00114-ia400112.us.archive.org.warc.gz"
RecordLoader.loadArchives(data, sc).all().select($"crawl_date", $"last_modified_date").show(20, false)

// Exiting paste mode, now interpreting.

[2022-11-02T16:05:35.325Z - 00000 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143300-00114-ia400112.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
+--------------+------------------+                                             
|crawl_date    |last_modified_date|
+--------------+------------------+
|20091027143300|                  |
|20091027143259|20000923233454    |
|20091027143259|20020913163029    |
|20091027143300|20020211154553    |
|20091027143259|19980919164703    |
|20091027143259|20080125150303    |
|20091027143300|20010921224658    |
|20091027143258|20081009015203    |
|20091027143300|                  |
|20091027143259|20020416145103    |
|20091027143300|20090223022835    |
|20091027143300|20030928090558    |
|20091027143300|20091027143300    |
|20091027143300|20021203212451    |
|20091027143300|                  |
|20091027143300|20040530033010    |
|20091027143300|                  |
|20091027143259|20090223022352    |
|20091027143300|                  |
|20091027143300|20010608202736    |
+--------------+------------------+
only showing top 20 rows

import io.archivesunleashed._
data: String = /home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143300-00114-ia400112.us.archive.org.warc.gz

How should this be tested?

  • Tests should take care of it.
  • I'm also going to test this at scale with the GeoCities dataset.

Additional Notes:

This is going to require A LOT of documentation updates.

Interested parties

@digitalshawn 👋

@codecov
Copy link

codecov bot commented Nov 2, 2022

Codecov Report

Merging #547 (c3af611) into main (8a4bf54) will increase coverage by 0.00%.
The diff coverage is 95.23%.

Additional details and impacted files
@@            Coverage Diff            @@
##               main     #547   +/-   ##
=========================================
  Coverage     93.87%   93.88%           
- Complexity       48       49    +1     
=========================================
  Files            44       45    +1     
  Lines           980     1030   +50     
  Branches         52       55    +3     
=========================================
+ Hits            920      967   +47     
- Misses           36       38    +2     
- Partials         24       25    +1     

@ruebot ruebot marked this pull request as ready for review November 2, 2022 16:12
@ruebot
Copy link
Member Author

ruebot commented Nov 2, 2022

Oh, that blew up at scale real quick 😅

java.time.format.DateTimeParseException: Text 'Fri, 23 Oct 2009 20:0830 GMT' could not be parsed: Invalid value for MinuteOfHour (valid values 0 - 59): 830

java.time.format.DateTimeParseException: Text 'Sat, 24 Oct 2009 14:3224 GMT' could not be parsed: Invalid value for MinuteOfHour (valid values 0 - 59): 3224

java.time.format.DateTimeParseException: Text 'Tue Aug 22 17:18:37 2000 GMT' could not be parsed at index 0

java.time.format.DateTimeParseException: Text 'Thu, 22 Oct 2009 23:1407 GMT' could not be parsed: Invalid value for MinuteOfHour (valid values 0 - 59): 1407

java.time.format.DateTimeParseException: Text 'Sun, 25 Oct 2009 00:3030 GMT' could not be parsed: Invalid value for MinuteOfHour (valid values 0 - 59): 3030

It's what I feared, that these dates are gonna not conform to RFC 1123, and be pretty wild.

@helgeho
Copy link
Contributor

helgeho commented Nov 3, 2022

Here's a more fuzzy approach to convert these dates (RFC 1123 or similar) into yyyyMMddHHmmss:

val months = Seq("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec").zipWithIndex.map{case (s,d) => (s, ("0" + (d + 1)).takeRight(2))}
"""
Fri, 23 Oct 2009 20:0830 GMT
Sat, 24 Oct 2009 14:3224 GMT
Tue Aug 22 17:18:37 2000 GMT
Thu, 22 Oct 2009 23:1407 GMT
Sun, 25 Oct 2009 00:3030 GMT
""".split("\n").map(_.trim).filter(_.nonEmpty).flatMap { str =>
    val lc = str.toLowerCase
    months.find(m => lc.contains(m._1)).map(_._2).flatMap { m =>
        val d = str.replace(":", "").split(' ').drop(1).map(d => (d.length, d)).toMap
        for (y <- d.get(4); n <- d.get(2); t <- d.get(6)) yield y + m + n + t
    }
}

output:

Array(
  "20091023200830",
  "20091024143224",
  "20000822171837",
  "20091022231407",
  "20091025003030"
)

@ruebot
Copy link
Member Author

ruebot commented Nov 4, 2022

At scale test with GeoCities was successful.

import io.archivesunleashed._
val data = "/tuna1/scratch/nruest/geocites/warcs"
val test = RecordLoader.loadArchives(data, sc).all().select($"crawl_date", $"last_modified_date", $"mime_type_web_server", $"mime_type_tika")
test.write.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").format("csv").option("escape", "\"").option("encoding", "utf-8").save("/tuna1/scratch/nruest/aut-547-test")
$ wc -l aut-547-test.csv 
317151386 aut-547-test.csv
$ head -n25 aut-547-test.csv 
crawl_date,last_modified_date,mime_type_web_server,mime_type_tika
20091025200809,20090223021853,text/html,text/html
20091025200809,20010725152023,text/html,text/html
20091025200809,19980216015402,image/gif,image/gif
20091025200809,20090223022127,text/html,text/html
20091025200809,20021211215621,image/gif,image/gif
20091025200809,20050419153723,text/html,text/html
20091025200809,20090223023317,text/html,text/html
20091025200809,20090218194412,text/html,text/html
20091025200809,20060508201301,text/html,text/html
20091025200809,20090223023317,text/html,text/html
20091025200809,19980308054339,image/gif,image/gif
20091025200809,20090223023317,text/html,text/html
20091025200809,20001212220521,image/jpeg,image/jpeg
20091025200809,20071102161255,text/html,text/html
20091025200809,20090218194412,text/html,text/html
20091025200809,20090223021623,text/html,text/html
20091025200809,20090223021623,text/html,text/html
20091025200809,20001211234323,text/html,text/html
20091025200809,20090218194412,text/html,text/html
20091025200809,19990102025342,image/jpeg,image/jpeg
20091025200809,20090223022127,text/html,text/html
20091025200809,20090223022127,text/html,text/html
20091025200809,"",text/html,text/plain
20091025200809,"",text/html,text/plain
20091025200809,20000917151720,image/gif,image/gif

If you're good with this @ianmilligan1, let me know and I'll squash this all down and merge. Then work on documentation updates, and a release next week, as well as getting it pulled into ARCH.

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @ruebot , and the GeoCities test is illuminating. It would be great to down the line do a little study to check on the reliability of this approach.

@ruebot
Copy link
Member Author

ruebot commented Nov 7, 2022

@ianmilligan1 yeah, I was looking at all the dates come in from GeoCities, my brain was being swarmed with a whole bunch of new research questions to ask the dataset! I guess I should update this dataset again too 😃

@ruebot ruebot merged commit cdf8e76 into main Nov 7, 2022
@ruebot ruebot deleted the last-modified-headers branch November 7, 2022 17:43
ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request Nov 9, 2022
- Update to include `last_modified_date` were applicable
- Rewrite text-analysis (extraction) documentation
ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request Nov 16, 2022
- Update to include `last_modified_date` were applicable
- Rewrite text-analysis (extraction) documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include last modified date for a resource
3 participants