Change crawl_date format to YYYYMMDDHHMMSS, update hasDate filter. #526

ruebot · 2022-01-20T18:09:16Z

GitHub issue(s): #525

What does this Pull Request do?

Change crawl_date format to YYYYMMDDHHMMSS, update hasDate filter.

Update hasDate filter to match patterns since it only matched literals
previously
Resolves Include timestamp in crawl date #525
Update tests as required

How should this be tested?

Build system
Tested locally:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.13)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val test = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/ars-cloud/in/14462/arcs",sc)
  .webpages()
  .select($"url", $"crawl_date")

// Exiting paste mode, now interpreting.

import io.archivesunleashed._
import io.archivesunleashed.udfs._
test: org.apache.spark.sql.DataFrame = [url: string, crawl_date: string]

scala> test.cache()
res0: test.type = [url: string, crawl_date: string]

scala> test.count()
res1: Long = 2811                                                              

scala> test.show(25)
+--------------------+--------------+
|                 url|    crawl_date|
+--------------------+--------------+
|https://www.youtu...|20201124212851|
|https://www.youtu...|20201124212903|
|https://www.youtu...|20201124212918|
|https://archivesu...|20201124212920|
|https://www.youtu...|20201124212930|
|https://archivesu...|20201224212956|
|https://www.ianmi...|20201224212643|
| https://schema.org/|20201224212809|
|https://www.ianmi...|20201224213124|
|https://www.ianmi...|20201224213146|
|https://www.ianmi...|20201224213212|
|https://www.ianmi...|20201224213230|
|https://www.ianmi...|20201224213300|
|https://www.ianmi...|20201224213319|
|https://www.ianmi...|20201224213335|
|https://www.ianmi...|20201224213353|
|https://www.ianmi...|20201224213425|
|https://www.ianmi...|20201224213443|
|https://www.youtu...|20201224213456|
|https://m.youtube...|20201224213516|
|https://www.ianmi...|20201224213546|
|https://www.ianmi...|20201224213627|
|https://www.ianmi...|20201224213701|
|https://www.ianmi...|20201224213727|
|https://www.ianmi...|20201224213757|
+--------------------+--------------+
only showing top 25 rows


scala> val date = Array("20201124212851")

scala> test.filter(hasDate($"crawl_date", lit(date))).count()
res4: Long = 1

scala> test.filter(!hasDate($"crawl_date", lit(date))).count()
res6: Long = 2810

scala> test.filter(hasDate($"crawl_date", lit(Array("20201224212.*")))).count()
res22: Long = 6

scala> test.filter(!hasDate($"crawl_date", lit(Array("20201224212.*")))).count()
res26: Long = 2805

scala> test.filter(hasDate($"crawl_date", lit(Array("2020.*")))).count()
res27: Long = 2625

scala> test.filter(!hasDate($"crawl_date", lit(Array("2020.*")))).count()
res28: Long = 186

Additional Notes:

I can cut a release if y'all want, and I have updated documentation ready to push up once this is merged.

Δ docs/filters-df.md

───────────────────────────────────────────────────┐
38: WebArchive(sc, sqlContext, "/path/to/warcs") \ │
───────────────────────────────────────────────────┘
│ 38 │                                                               │ 38 │
│ 39 │## Has Dates                                                   │ 39 │## Has Dates
│ 40 │                                                               │ 40 │
│ 41 │Filters or keeps all data that does or does not match the date→│ 41 │Filters or keeps all data that does or does not match the time→
│ 42 │                                                               │ 42 │
│ 43 │### Scala DF                                                   │ 43 │### Scala DF
│ 44 │                                                               │ 44 │

─────────────────────────────────────────────────────────────────────────────────┐
46: Filters or keeps all data that does or does not match the date(s) specified. │
─────────────────────────────────────────────────────────────────────────────────┘
│ 46 │import io.archivesunleashed._                                  │ 46 │import io.archivesunleashed._
│ 47 │import io.archivesunleashed.udfs._                             │ 47 │import io.archivesunleashed.udfs._
│ 48 │                                                               │ 48 │
│ 49 │val dates = Array("2008", "200908", "20070502")                │ 49 │val dates = Array("2008.*", "200908.*", "20070502231159")
│ 50 │                                                               │ 50 │
│ 51 │RecordLoader.loadArchives("/path/to/warcs",sc)                 │ 51 │RecordLoader.loadArchives("/path/to/warcs",sc)
│ 52 │  .all()                                                       │ 52 │  .all()

───────────────────────────────────────────────────┐
60: RecordLoader.loadArchives("/path/to/warcs",sc) │
───────────────────────────────────────────────────┘
│ 60 │from aut import *                                              │ 60 │from aut import *
│ 61 │from pyspark.sql.functions import col                          │ 61 │from pyspark.sql.functions import col
│ 62 │                                                               │ 62 │
│ 63 │dates = ["2008", "200908", "20070502"]                         │ 63 │dates = ["2008.*", "200908.*", "20070502231159"]
│ 64 │                                                               │ 64 │
│ 65 │WebArchive(sc, sqlContext, "/path/to/warcs") \                 │ 65 │WebArchive(sc, sqlContext, "/path/to/warcs") \
│ 66 │  .all() \                                                     │ 66 │  .all() \

- Update hasDate filter to match patterns since it only matched literals previously - Resolves #525 - Update tests as required

codecov · 2022-01-20T18:25:04Z

Codecov Report

Merging #526 (9f5a46b) into main (8104a65) will increase coverage by 0.07%.
The diff coverage is 87.09%.

@@             Coverage Diff              @@
##               main     #526      +/-   ##
============================================
+ Coverage     88.83%   88.91%   +0.07%     
  Complexity       57       57              
============================================
  Files            43       43              
  Lines          1012     1046      +34     
  Branches         85       86       +1     
============================================
+ Hits            899      930      +31     
- Misses           74       75       +1     
- Partials         39       41       +2

ruebot · 2022-01-20T18:47:14Z

@ianmilligan1 you can ignore he codecov/patch check.

ianmilligan1

Builds nicely locally and tested it out. 👍

ruebot requested a review from ianmilligan1 January 20, 2022 18:09

Change crawl_date format to YYYYMMDDHHMMSS, update hasDate filter.

9f5a46b

- Update hasDate filter to match patterns since it only matched literals previously - Resolves #525 - Update tests as required

ruebot force-pushed the issue-525 branch from f44ae38 to 9f5a46b Compare January 20, 2022 18:11

ianmilligan1 approved these changes Jan 20, 2022

View reviewed changes

ianmilligan1 merged commit 73354e8 into main Jan 20, 2022

ianmilligan1 deleted the issue-525 branch January 20, 2022 19:16

ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request Jan 20, 2022

Documentation update for archivesunleashed/aut#526

7137e6b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change crawl_date format to YYYYMMDDHHMMSS, update hasDate filter. #526

Change crawl_date format to YYYYMMDDHHMMSS, update hasDate filter. #526

Uh oh!

ruebot commented Jan 20, 2022

Uh oh!

codecov bot commented Jan 20, 2022 •

edited

Loading

Uh oh!

ruebot commented Jan 20, 2022

Uh oh!

ianmilligan1 left a comment

Uh oh!

Uh oh!

Change crawl_date format to YYYYMMDDHHMMSS, update hasDate filter. #526

Change crawl_date format to YYYYMMDDHHMMSS, update hasDate filter. #526

Uh oh!

Conversation

ruebot commented Jan 20, 2022

What does this Pull Request do?

How should this be tested?

Additional Notes:

Uh oh!

codecov bot commented Jan 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ruebot commented Jan 20, 2022

Uh oh!

ianmilligan1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Jan 20, 2022 •

edited

Loading