-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Last modified headers #547
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #547 +/- ##
=========================================
Coverage 93.87% 93.88%
- Complexity 48 49 +1
=========================================
Files 44 45 +1
Lines 980 1030 +50
Branches 52 55 +3
=========================================
+ Hits 920 967 +47
- Misses 36 38 +2
- Partials 24 25 +1 |
Oh, that blew up at scale real quick 😅
It's what I feared, that these dates are gonna not conform to RFC 1123, and be pretty wild. |
Here's a more fuzzy approach to convert these dates (RFC 1123 or similar) into val months = Seq("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec").zipWithIndex.map{case (s,d) => (s, ("0" + (d + 1)).takeRight(2))}
"""
Fri, 23 Oct 2009 20:0830 GMT
Sat, 24 Oct 2009 14:3224 GMT
Tue Aug 22 17:18:37 2000 GMT
Thu, 22 Oct 2009 23:1407 GMT
Sun, 25 Oct 2009 00:3030 GMT
""".split("\n").map(_.trim).filter(_.nonEmpty).flatMap { str =>
val lc = str.toLowerCase
months.find(m => lc.contains(m._1)).map(_._2).flatMap { m =>
val d = str.replace(":", "").split(' ').drop(1).map(d => (d.length, d)).toMap
for (y <- d.get(4); n <- d.get(2); t <- d.get(6)) yield y + m + n + t
}
} output: Array(
"20091023200830",
"20091024143224",
"20000822171837",
"20091022231407",
"20091025003030"
) |
At scale test with GeoCities was successful. import io.archivesunleashed._
val data = "/tuna1/scratch/nruest/geocites/warcs"
val test = RecordLoader.loadArchives(data, sc).all().select($"crawl_date", $"last_modified_date", $"mime_type_web_server", $"mime_type_tika")
test.write.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").format("csv").option("escape", "\"").option("encoding", "utf-8").save("/tuna1/scratch/nruest/aut-547-test")
If you're good with this @ianmilligan1, let me know and I'll squash this all down and merge. Then work on documentation updates, and a release next week, as well as getting it pulled into ARCH. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great @ruebot , and the GeoCities test is illuminating. It would be great to down the line do a little study to check on the reliability of this approach.
@ianmilligan1 yeah, I was looking at all the dates come in from GeoCities, my brain was being swarmed with a whole bunch of new research questions to ask the dataset! I guess I should update this dataset again too 😃 |
- Update to include `last_modified_date` were applicable - Rewrite text-analysis (extraction) documentation
- Update to include `last_modified_date` were applicable - Rewrite text-analysis (extraction) documentation
GitHub issue(s): #546
What does this Pull Request do?
Implements extracting
last_modified_date
of a resource where available.getLastModified
forSparklingArchiveRecord
CovertLastModifiedDate
to convert RFC 1123 dates toyyyyMMddHHmmss
last_modified_date
column for.all()
.webpages()
.images()
.pdfs()
.audio()
.videos()
.spreadsheets()
.presentationProgramFiles()
.wordProcessorFiles()
.css()
.html()
.js()
.json()
.plainText()
.xml()
Example:
How should this be tested?
Additional Notes:
This is going to require A LOT of documentation updates.
Interested parties
@digitalshawn 👋