Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor xpath scraper code. Add fixed and map #616

Merged
merged 20 commits into from
Jul 21, 2020

Conversation

WithoutPants
Copy link
Collaborator

Refactored the xpath code so that it uses structs instead of generic maps.

Changed post-processing format so that actions can be performed in any order. Maintains backward compatibility with existing post-processing actions: parseDate, replace, subScraper; but should be considered deprecated. New post-processing actions should only be implemented in the postProcess field.

Example of the old vs new structure:

# old
Birthdate: 
  selector: //span[@itemprop="birthDate"]
  parseDate: Jan 2, 2006

# new
Birthdate: 
  selector: //span[@itemprop="birthDate"]
  postProcess:
    - parseDate: Jan 2, 2006
    - # add other post-process actions as needed

Adds a Fixed field. This field replaces Selector in a field and resolves to the string value. For example:

Gender:
  Fixed: Female

Adds a map post-processing action. This is only allowed in the new postProcess field. It maps values to other values. Example:

Gender:
  selector: //div[class="example element"]
  postProcess:
    - map:
      F: Female
      M: Male

The above config changes F to Female, M to Male. Eliminates the need for multiple regex actions.

Resolves #457

@WithoutPants WithoutPants added the improvement Something needed tweaking. label Jun 17, 2020
@WithoutPants WithoutPants added this to the Version 0.3.0 milestone Jun 17, 2020
@WithoutPants
Copy link
Collaborator Author

Going to mark as draft until #579 is merged.

@WithoutPants WithoutPants marked this pull request as draft June 17, 2020 00:10
@WithoutPants
Copy link
Collaborator Author

Merged and adjusted for #579 changes.

@WithoutPants WithoutPants marked this pull request as ready for review June 18, 2020 04:13
@bnkai
Copy link
Collaborator

bnkai commented Jun 27, 2020

I finished doing some regression tests using this build and the community scrapers (tested ~150 urls).
One issue i found is evident with this scraper https://github.com/stashapp/CommunityScrapers/blob/master/scrapers/Xartxxx.yml
The Details get an extra space at the start for the specific site with this build (compared to dev). I think the postprocess after the concat is missing or something , i went through the code but didn't notice anything suspicious, maybe something when converting to the new format?
sample urls that show the issue

https://www.xart.xxx/anny-aurora-and-niki-surprise-sex-for-three/
https://www.xart.xxx/izzy-hot-russian-fucking/
https://www.xart.xxx/mary-kalisy-summertime-sex/
https://www.xart.xxx/rebecca-and-viktoria-cum-in-for-an-orgy/
https://www.xart.xxx/susie-sexy-movies-cum-inside/
https://www.xart.xxx/tiffany-and-amaris-sex-and-fashion-a-threeway-project/
https://www.xart.xxx/veronica-when-you-least-expect-it/
https://www.xart.xxx/zazie-hot-office-sex/

@bnkai
Copy link
Collaborator

bnkai commented Jun 27, 2020

The example for the Fixed field shoud be

Gender:
  fixed: Female

And a couple of scrapers that were requested here stashapp/CommunityScrapers#73
with the new format

name: "hucows"
sceneByURL:
  - action: scrapeXPath
    url:
      - hucows.com
    scraper: sceneScraper
xPathScrapers:
  sceneScraper:
    scene:
      Title: //h1[@class="entry-title"]
      Details: //div[@class="entry-content"]/p
      Date:
        selector: //div[@itemprop="datePublished"]//text()
        concat: " "
        postProcess:
          - replace:
              - regex: ^\s+
                with:
          - parseDate: 2 Jan 2006
      Image: //center/a[1]/img[@class="lightboxhover"]/@src
      Studio:
        Name:
          fixed: Hucows
      Tags: 
        Name: //section[@id="categories-2"]//a
# Last Updated June 27, 2020
name: "xhamster"
sceneByURL:
  - action: scrapeXPath
    url:
      - xhamster.com
    scraper: sceneScraper
xPathScrapers:
  sceneScraper:
    common: 
      $player: //div[@class="width-wrap with-player-container"]
    scene:
      Title: $player/h1
      Details: //div[@class="ab-info controls-info__item xh-helper-hidden"]/p
      Date:
        selector: //div[@class="entity-info-container__date tooltip-nocache"]/@data-tooltip
        postProcess:
          - replace:
              - regex: (\d{4}-\d{2}-\d{2})\s.+
                with: $1
          - parseDate: 2006-01-02
      Image: //link[@as="image"]/@href
      Studio:
        Name:
          selector: $player/ul/li/a[contains(@href,"/channels/")]
      Tags: 
        Name: 
          selector: $player/ul/li/a[contains(@href,"/categories/")]
      Performers:
        Name:
          selector: $player/ul/li/a[contains(@href,"/pornstars/")]

# Last Updated June 27, 2020

EDIT
Adding a non valid entry in the postProcess field causes stash to panic
I think with the current implementation if something is not valid it just gets ignored.

Birthdate: 
  selector: //span[@itemprop="birthDate"]
  postProcess:
    - parseDate: Jan 2, 2006
    - anything:
Birthdate: 
  selector: //span[@itemprop="birthDate"]
  postProcess:
    - concat: " "
    - parseDate: Jan 2, 2006    

@WithoutPants
Copy link
Collaborator Author

The commits aren't showing in the feed properly, but I did another significant refactor. I've converted it to use the same cache-style system that the plugin system PR uses. I've moved a lot of generic code into mapped.go which I plan to reuse for an upcoming json scraping change.

I've rethought the whitespace removal code. It now removes leading and trailing whitespace, and currently will only remove newlines when it first parses the node text. I'll need another round of regression tests, and I'll also consider disabling the newline removal when a specific flag is set on the attribute, but I'll need an example of when it is needed.

@bnkai
Copy link
Collaborator

bnkai commented Jul 13, 2020

Just finished the regression tests. I haven't gone through the code yet.

Most differences i found were an extra leading or trailing whitespace on some scrapers.
Urls tested
scenes - https://pastebin.com/yq26r9RX
performers - https://pastebin.com/3YLSdCdG

scrapers that had differences in results
SCENES
21naturals-> https://pastebin.com/EcuTDu4Z
The only difference is for a single URL https://www.21naturals.com/en/video/Under-The-Coat/144299

103c103
<     Details: Kinky Lana Roy is excited to meet up with her hot boyfriend Raul Costa. Lana surprises him by wearing lingerie under her new coat. She wraps her hands around his throbbing cock and deep throats it. Lana then presents her round ass for Raul to fuck. She takes every inch of his cock in her tight asshole and moans in pleasure before getting a hot blast of cum all over her pussy and ass.
---
>     Details: 'Kinky Lana Roy is excited to meet up with her hot boyfriend Raul Costa. Lana surprises him by wearing lingerie under her new coat. She wraps her hands around his throbbing cock and deep throats it. Lana then presents her round ass for Raul to fuck. She takes every inch of his cock in her tight asshole and moans in pleasure before getting a hot blast of cum all over her pussy and ass. '

Notice the trailing whitespace in the details field for the refactor build.

cherrypimps -> https://pastebin.com/NgKABVrS
Again a single trailing whitespace for the refactor build in this url https://cherrypimps.com/trailers/16588-kaneziereeves-vinasky.html

22c22
<     Title: Wild Girls Kenzie Reeves And Vina Sky
---
>     Title: 'Wild Girls Kenzie Reeves And Vina Sky '

mypervmom -> https://pastebin.com/upbRP0Mb
All scene details for tested scenes get an extra leading whitespace with the refactor build
diff file -> https://pastebin.com/GQafkRmX

PERFORMERS
freeones
all performers tested get an extra leading whitespace in their birthdate field ( failing the parsedate as a result )
diff file -> https://pastebin.com/kGFNgwnR

babepedia -> https://pastebin.com/pqbfxb4v
all performers tested get an extra leading or trailing whitespace either in aliases or height
diff file -> https://pastebin.com/v2HTDdsu

EDIT
For the newline removal i have to get back to you after i go over the code changes. The only useful case that it is needed is in the description / details field. In that case though it would be helpful to keep the newlines even from the first node parse. ( This was not so easy to do with the original code prior to this refactor ). I'll look for some examples after looking over the latest refactor code.

@WithoutPants
Copy link
Collaborator Author

Ideally, I'd leave it as is and change the scrapers - the replace regex should include the space to remove. However, I obviously don't want to break existing behaviour, so I've committed a change to trim the whitespace after replace operations. It shouldn't cause any problems down the track, but it's the sort of magic behaviour I'd like to avoid.

@bnkai
Copy link
Collaborator

bnkai commented Jul 14, 2020

Ideally, I'd leave it as is and change the scrapers - the replace regex should include the space to remove. However, I obviously don't want to break existing behaviour, so I've committed a change to trim the whitespace after replace operations. It shouldn't cause any problems down the track, but it's the sort of magic behaviour I'd like to avoid.

If it wasn't for the babepedia and freeones scrapers i would opt for removing the unpredicted "magic" behaviour even if it means changing/fixing the rest of the scene scrapers. Can we revisit this issue ( the space removal ) after 0.3 ?

@bnkai
Copy link
Collaborator

bnkai commented Jul 18, 2020

With the latest commit it seems to be ok ( backwards compatible) with the existing scrapers, at least with the tested URLs

@WithoutPants WithoutPants merged commit 2b92157 into stashapp:develop Jul 21, 2020
Tweeticoats pushed a commit to Tweeticoats/stash that referenced this pull request Feb 1, 2021
* Refactor xpath scraper code
* Make post-process a list
* Add map post-process action
* Add fixed xpath values
* Refactor scrapers into cache
* Refactor into mapped config
* Trim test html
@WithoutPants WithoutPants deleted the refactor-xpath branch February 4, 2021 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Something needed tweaking.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Add fixed and mapped xpath scraper fields
2 participants