Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include last modified date for a resource #546

Closed
ruebot opened this issue Oct 31, 2022 · 2 comments · Fixed by #547
Closed

Include last modified date for a resource #546

ruebot opened this issue Oct 31, 2022 · 2 comments · Fixed by #547

Comments

@ruebot
Copy link
Member

ruebot commented Oct 31, 2022

Working on trying to get better date support for a given resource in the last-modified-headers branch. It requires an upstream update to Sparkling.

This is what it looks like now:

scala> RecordLoader.loadArchives(data, sc).all().select($"crawl_date", $"last_modified").show(10, false)
[2022-10-07T18:40:00.043Z - 00002 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143300-00114-ia400112.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
+--------------+-----------------------------+
|crawl_date    |last_modified                |
+--------------+-----------------------------+
|20091027143300|                             |
|20091027143259|Sat, 23 Sep 2000 23:34:54 GMT|
|20091027143259|Fri, 13 Sep 2002 16:30:29 GMT|
|20091027143300|Mon, 11 Feb 2002 15:45:53 GMT|
|20091027143259|Sat, 19 Sep 1998 16:47:03 GMT|
|20091027143259|Fri, 25 Jan 2008 15:03:03 GMT|
|20091027143300|Fri, 21 Sep 2001 22:46:58 GMT|
|20091027143258|Thu, 09 Oct 2008 01:52:03 GMT|
|20091027143300|                             |
|20091027143259|Tue, 16 Apr 2002 14:51:03 GMT|
+--------------+-----------------------------+
only showing top 10 rows

The big question before proceeding with implementing this, is do we want to keep the original date format from the headers, or convert it to YYYYMMDDHHMMSS? The only reason I haven't moved forward with it is dealing with the potential locale format changes of the header responses. According the spec, it should always use GMT as the timezone, but I think the days and months could be done in any language. So things could get tricky there. So, should we implement a YYYYMMDDHHMMSS or leave it up to the researcher to modify the date format?

@ianmilligan1
Copy link
Member

I have to admit I don’t fully understand the diversity of things that we could find in the headers! My gut tells me that we should standardize/convert to YYYYMMDDHHMMSS. But that is not a strongly held conviction. What do you think @ruebot ?

@ruebot
Copy link
Member Author

ruebot commented Nov 2, 2022

If we're trying to help researchers, and I'm thinking of folks that have gone through our cohorts, we should format it the same as the crawl_date. So, I created a new matchbox utility that just formats the date. This is what it looks like now (I haven't pushed up this work yet):

import io.archivesunleashed._
val data = "/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143300-00114-ia400112.us.archive.org.warc.gz"
RecordLoader.loadArchives(data, sc).all().select($"crawl_date", $"last_modified").show(20, false)

// Exiting paste mode, now interpreting.

[2022-11-02T04:38:50.287Z - 00000 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143300-00114-ia400112.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
+--------------+--------------+                                                 
|crawl_date    |last_modified |
+--------------+--------------+
|20091027143300|              |
|20091027143259|20000923233454|
|20091027143259|20020913163029|
|20091027143300|20020211154553|
|20091027143259|19980919164703|
|20091027143259|20080125150303|
|20091027143300|20010921224658|
|20091027143258|20081009015203|
|20091027143300|              |
|20091027143259|20020416145103|
|20091027143300|20090223022835|
|20091027143300|20030928090558|
|20091027143300|20091027143300|
|20091027143300|20021203212451|
|20091027143300|              |
|20091027143300|20040530033010|
|20091027143300|              |
|20091027143259|20090223022352|
|20091027143300|              |
|20091027143300|20010608202736|
+--------------+--------------+
only showing top 20 rows

import io.archivesunleashed._
data: String = /home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143300-00114-ia400112.us.archive.org.warc.gz

@digitalshawn you were basically looking for this, right? If so, I'm sorry I completely missed I could grab this date from the headers, and this would have really helped out your team.

ruebot added a commit that referenced this issue Nov 7, 2022
…ble. (#547)

* Adds `getLastModified` for `SparklingArchiveRecord`
* Adds `CovertLastModifiedDate` to convert RFC 1123 dates to `yyyyMMddHHmmss`
  * See: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Last-Modified
* Implement `last_modified_date` column for
  * `.all()`
  * `.webpages()`
  * `.images()`
  * `.pdfs()`
  * `.audio()`
  * `.videos()`
  * `.spreadsheets()`
  * `.presentationProgramFiles()`
  * `.wordProcessorFiles()`
  * `.css()`
  * `.html()`
  * `.js()`
  * `.json()`
  * `.plainText()`
  * `.xml()`
* Update tests
* Resolves #546
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants