Refactor xpath scraper code. Add fixed and map #616

WithoutPants · 2020-06-17T00:08:36Z

Refactored the xpath code so that it uses structs instead of generic maps.

Changed post-processing format so that actions can be performed in any order. Maintains backward compatibility with existing post-processing actions: parseDate, replace, subScraper; but should be considered deprecated. New post-processing actions should only be implemented in the postProcess field.

Example of the old vs new structure:

# old
Birthdate: 
  selector: //span[@itemprop="birthDate"]
  parseDate: Jan 2, 2006

# new
Birthdate: 
  selector: //span[@itemprop="birthDate"]
  postProcess:
    - parseDate: Jan 2, 2006
    - # add other post-process actions as needed

Adds a Fixed field. This field replaces Selector in a field and resolves to the string value. For example:

Gender:
  Fixed: Female

Adds a map post-processing action. This is only allowed in the new postProcess field. It maps values to other values. Example:

Gender:
  selector: //div[class="example element"]
  postProcess:
    - map:
      F: Female
      M: Male

The above config changes F to Female, M to Male. Eliminates the need for multiple regex actions.

Resolves #457

WithoutPants · 2020-06-17T00:10:06Z

Going to mark as draft until #579 is merged.

WithoutPants · 2020-06-18T04:13:39Z

Merged and adjusted for #579 changes.

bnkai · 2020-06-27T17:38:49Z

I finished doing some regression tests using this build and the community scrapers (tested ~150 urls).
One issue i found is evident with this scraper https://github.com/stashapp/CommunityScrapers/blob/master/scrapers/Xartxxx.yml
The Details get an extra space at the start for the specific site with this build (compared to dev). I think the postprocess after the concat is missing or something , i went through the code but didn't notice anything suspicious, maybe something when converting to the new format?
sample urls that show the issue

https://www.xart.xxx/anny-aurora-and-niki-surprise-sex-for-three/
https://www.xart.xxx/izzy-hot-russian-fucking/
https://www.xart.xxx/mary-kalisy-summertime-sex/
https://www.xart.xxx/rebecca-and-viktoria-cum-in-for-an-orgy/
https://www.xart.xxx/susie-sexy-movies-cum-inside/
https://www.xart.xxx/tiffany-and-amaris-sex-and-fashion-a-threeway-project/
https://www.xart.xxx/veronica-when-you-least-expect-it/
https://www.xart.xxx/zazie-hot-office-sex/

bnkai · 2020-06-27T18:43:49Z

The example for the Fixed field shoud be

Gender:
  fixed: Female

And a couple of scrapers that were requested here stashapp/CommunityScrapers#73
with the new format

name: "hucows"
sceneByURL:
  - action: scrapeXPath
    url:
      - hucows.com
    scraper: sceneScraper
xPathScrapers:
  sceneScraper:
    scene:
      Title: //h1[@class="entry-title"]
      Details: //div[@class="entry-content"]/p
      Date:
        selector: //div[@itemprop="datePublished"]//text()
        concat: " "
        postProcess:
          - replace:
              - regex: ^\s+
                with:
          - parseDate: 2 Jan 2006
      Image: //center/a[1]/img[@class="lightboxhover"]/@src
      Studio:
        Name:
          fixed: Hucows
      Tags: 
        Name: //section[@id="categories-2"]//a
# Last Updated June 27, 2020

name: "xhamster"
sceneByURL:
  - action: scrapeXPath
    url:
      - xhamster.com
    scraper: sceneScraper
xPathScrapers:
  sceneScraper:
    common: 
      $player: //div[@class="width-wrap with-player-container"]
    scene:
      Title: $player/h1
      Details: //div[@class="ab-info controls-info__item xh-helper-hidden"]/p
      Date:
        selector: //div[@class="entity-info-container__date tooltip-nocache"]/@data-tooltip
        postProcess:
          - replace:
              - regex: (\d{4}-\d{2}-\d{2})\s.+
                with: $1
          - parseDate: 2006-01-02
      Image: //link[@as="image"]/@href
      Studio:
        Name:
          selector: $player/ul/li/a[contains(@href,"/channels/")]
      Tags: 
        Name: 
          selector: $player/ul/li/a[contains(@href,"/categories/")]
      Performers:
        Name:
          selector: $player/ul/li/a[contains(@href,"/pornstars/")]

# Last Updated June 27, 2020

EDIT
Adding a non valid entry in the postProcess field causes stash to panic
I think with the current implementation if something is not valid it just gets ignored.

Birthdate: 
  selector: //span[@itemprop="birthDate"]
  postProcess:
    - parseDate: Jan 2, 2006
    - anything:

Birthdate: 
  selector: //span[@itemprop="birthDate"]
  postProcess:
    - concat: " "
    - parseDate: Jan 2, 2006

WithoutPants · 2020-07-13T05:41:47Z

The commits aren't showing in the feed properly, but I did another significant refactor. I've converted it to use the same cache-style system that the plugin system PR uses. I've moved a lot of generic code into mapped.go which I plan to reuse for an upcoming json scraping change.

I've rethought the whitespace removal code. It now removes leading and trailing whitespace, and currently will only remove newlines when it first parses the node text. I'll need another round of regression tests, and I'll also consider disabling the newline removal when a specific flag is set on the attribute, but I'll need an example of when it is needed.

bnkai · 2020-07-13T21:04:35Z

Just finished the regression tests. I haven't gone through the code yet.

Most differences i found were an extra leading or trailing whitespace on some scrapers.
Urls tested
scenes - https://pastebin.com/yq26r9RX
performers - https://pastebin.com/3YLSdCdG

scrapers that had differences in results
SCENES
21naturals-> https://pastebin.com/EcuTDu4Z
The only difference is for a single URL https://www.21naturals.com/en/video/Under-The-Coat/144299

103c103
<     Details: Kinky Lana Roy is excited to meet up with her hot boyfriend Raul Costa. Lana surprises him by wearing lingerie under her new coat. She wraps her hands around his throbbing cock and deep throats it. Lana then presents her round ass for Raul to fuck. She takes every inch of his cock in her tight asshole and moans in pleasure before getting a hot blast of cum all over her pussy and ass.
---
>     Details: 'Kinky Lana Roy is excited to meet up with her hot boyfriend Raul Costa. Lana surprises him by wearing lingerie under her new coat. She wraps her hands around his throbbing cock and deep throats it. Lana then presents her round ass for Raul to fuck. She takes every inch of his cock in her tight asshole and moans in pleasure before getting a hot blast of cum all over her pussy and ass. '

Notice the trailing whitespace in the details field for the refactor build.

cherrypimps -> https://pastebin.com/NgKABVrS
Again a single trailing whitespace for the refactor build in this url https://cherrypimps.com/trailers/16588-kaneziereeves-vinasky.html

22c22
<     Title: Wild Girls Kenzie Reeves And Vina Sky
---
>     Title: 'Wild Girls Kenzie Reeves And Vina Sky '

mypervmom -> https://pastebin.com/upbRP0Mb
All scene details for tested scenes get an extra leading whitespace with the refactor build
diff file -> https://pastebin.com/GQafkRmX

PERFORMERS
freeones
all performers tested get an extra leading whitespace in their birthdate field ( failing the parsedate as a result )
diff file -> https://pastebin.com/kGFNgwnR

babepedia -> https://pastebin.com/pqbfxb4v
all performers tested get an extra leading or trailing whitespace either in aliases or height
diff file -> https://pastebin.com/v2HTDdsu

EDIT
For the newline removal i have to get back to you after i go over the code changes. The only useful case that it is needed is in the description / details field. In that case though it would be helpful to keep the newlines even from the first node parse. ( This was not so easy to do with the original code prior to this refactor ). I'll look for some examples after looking over the latest refactor code.

WithoutPants · 2020-07-13T23:20:11Z

Ideally, I'd leave it as is and change the scrapers - the replace regex should include the space to remove. However, I obviously don't want to break existing behaviour, so I've committed a change to trim the whitespace after replace operations. It shouldn't cause any problems down the track, but it's the sort of magic behaviour I'd like to avoid.

bnkai · 2020-07-14T19:23:26Z

Ideally, I'd leave it as is and change the scrapers - the replace regex should include the space to remove. However, I obviously don't want to break existing behaviour, so I've committed a change to trim the whitespace after replace operations. It shouldn't cause any problems down the track, but it's the sort of magic behaviour I'd like to avoid.

If it wasn't for the babepedia and freeones scrapers i would opt for removing the unpredicted "magic" behaviour even if it means changing/fixing the rest of the scene scrapers. Can we revisit this issue ( the space removal ) after 0.3 ?

bnkai · 2020-07-18T22:50:05Z

With the latest commit it seems to be ok ( backwards compatible) with the existing scrapers, at least with the tested URLs

* Refactor xpath scraper code * Make post-process a list * Add map post-process action * Add fixed xpath values * Refactor scrapers into cache * Refactor into mapped config * Trim test html

WithoutPants added 5 commits June 16, 2020 19:00

Refactor xpath scraper code

3841a54

Make post-process a list

bf44425

Add map post-process action

5243c60

Add fixed xpath values

872fec3

Fix for naming conventions

bdad9fe

WithoutPants added the improvement Something needed tweaking. label Jun 17, 2020

WithoutPants added this to the Version 0.3.0 milestone Jun 17, 2020

WithoutPants marked this pull request as draft June 17, 2020 00:10

Merge remote-tracking branch 'upstream/develop' into refactor-xpath

fc068e9

WithoutPants marked this pull request as ready for review June 18, 2020 04:13

mmenanno mentioned this pull request Jun 23, 2020

Adds IsThisReal xPath scraper stashapp/CommunityScrapers#54

Merged

WithoutPants added 11 commits July 12, 2020 15:49

Refactor scrapers into cache

5dd5f3b

WIP refactor

5f5ccd3

Refactor into mapped config

3b043a5

Documentation

8c23575

Trim test html

5faacfe

Refactor post process

486b0b3

Refactor applyCommon

7aa133a

Add xpath actions

9c2b0f7

Merge remote-tracking branch 'upstream/develop' into refactor-xpath

ed481fd

Bug fixing

8f1728e

Cleanup

316db6d

WithoutPants closed this Jul 13, 2020

WithoutPants reopened this Jul 13, 2020

Trim space after replace

9709b5f

WithoutPants added 2 commits July 21, 2020 13:47

Update manual and changelog

a70ed0d

Merge remote-tracking branch 'upstream/develop' into refactor-xpath

1768405

WithoutPants merged commit 2b92157 into stashapp:develop Jul 21, 2020

This was referenced Jul 21, 2020

Add Hucows, Xhamster xPath scene scrapers stashapp/CommunityScrapers#113

Merged

! Stash xPath scraper code refactoring ! stashapp/CommunityScrapers#114

Closed

WithoutPants deleted the refactor-xpath branch February 4, 2021 03:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor xpath scraper code. Add fixed and map #616

Refactor xpath scraper code. Add fixed and map #616

WithoutPants commented Jun 17, 2020

WithoutPants commented Jun 17, 2020

WithoutPants commented Jun 18, 2020

bnkai commented Jun 27, 2020

bnkai commented Jun 27, 2020 •

edited

Loading

WithoutPants commented Jul 13, 2020

bnkai commented Jul 13, 2020 •

edited

Loading

WithoutPants commented Jul 13, 2020

bnkai commented Jul 14, 2020 •

edited

Loading

bnkai commented Jul 18, 2020

Refactor xpath scraper code. Add fixed and map #616

Refactor xpath scraper code. Add fixed and map #616

Conversation

WithoutPants commented Jun 17, 2020

WithoutPants commented Jun 17, 2020

WithoutPants commented Jun 18, 2020

bnkai commented Jun 27, 2020

bnkai commented Jun 27, 2020 • edited Loading

WithoutPants commented Jul 13, 2020

bnkai commented Jul 13, 2020 • edited Loading

WithoutPants commented Jul 13, 2020

bnkai commented Jul 14, 2020 • edited Loading

bnkai commented Jul 18, 2020

bnkai commented Jun 27, 2020 •

edited

Loading

bnkai commented Jul 13, 2020 •

edited

Loading

bnkai commented Jul 14, 2020 •

edited

Loading