-
-
Notifications
You must be signed in to change notification settings - Fork 801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor xpath scraper code. Add fixed and map #616
Conversation
Going to mark as draft until #579 is merged. |
Merged and adjusted for #579 changes. |
I finished doing some regression tests using this build and the community scrapers (tested ~150 urls).
|
The example for the Gender:
fixed: Female And a couple of scrapers that were requested here stashapp/CommunityScrapers#73 name: "hucows"
sceneByURL:
- action: scrapeXPath
url:
- hucows.com
scraper: sceneScraper
xPathScrapers:
sceneScraper:
scene:
Title: //h1[@class="entry-title"]
Details: //div[@class="entry-content"]/p
Date:
selector: //div[@itemprop="datePublished"]//text()
concat: " "
postProcess:
- replace:
- regex: ^\s+
with:
- parseDate: 2 Jan 2006
Image: //center/a[1]/img[@class="lightboxhover"]/@src
Studio:
Name:
fixed: Hucows
Tags:
Name: //section[@id="categories-2"]//a
# Last Updated June 27, 2020
name: "xhamster"
sceneByURL:
- action: scrapeXPath
url:
- xhamster.com
scraper: sceneScraper
xPathScrapers:
sceneScraper:
common:
$player: //div[@class="width-wrap with-player-container"]
scene:
Title: $player/h1
Details: //div[@class="ab-info controls-info__item xh-helper-hidden"]/p
Date:
selector: //div[@class="entity-info-container__date tooltip-nocache"]/@data-tooltip
postProcess:
- replace:
- regex: (\d{4}-\d{2}-\d{2})\s.+
with: $1
- parseDate: 2006-01-02
Image: //link[@as="image"]/@href
Studio:
Name:
selector: $player/ul/li/a[contains(@href,"/channels/")]
Tags:
Name:
selector: $player/ul/li/a[contains(@href,"/categories/")]
Performers:
Name:
selector: $player/ul/li/a[contains(@href,"/pornstars/")]
# Last Updated June 27, 2020 EDIT Birthdate:
selector: //span[@itemprop="birthDate"]
postProcess:
- parseDate: Jan 2, 2006
- anything: Birthdate:
selector: //span[@itemprop="birthDate"]
postProcess:
- concat: " "
- parseDate: Jan 2, 2006 |
The commits aren't showing in the feed properly, but I did another significant refactor. I've converted it to use the same cache-style system that the plugin system PR uses. I've moved a lot of generic code into I've rethought the whitespace removal code. It now removes leading and trailing whitespace, and currently will only remove newlines when it first parses the node text. I'll need another round of regression tests, and I'll also consider disabling the newline removal when a specific flag is set on the attribute, but I'll need an example of when it is needed. |
Just finished the regression tests. I haven't gone through the code yet. Most differences i found were an extra leading or trailing whitespace on some scrapers. scrapers that had differences in results 103c103
< Details: Kinky Lana Roy is excited to meet up with her hot boyfriend Raul Costa. Lana surprises him by wearing lingerie under her new coat. She wraps her hands around his throbbing cock and deep throats it. Lana then presents her round ass for Raul to fuck. She takes every inch of his cock in her tight asshole and moans in pleasure before getting a hot blast of cum all over her pussy and ass.
---
> Details: 'Kinky Lana Roy is excited to meet up with her hot boyfriend Raul Costa. Lana surprises him by wearing lingerie under her new coat. She wraps her hands around his throbbing cock and deep throats it. Lana then presents her round ass for Raul to fuck. She takes every inch of his cock in her tight asshole and moans in pleasure before getting a hot blast of cum all over her pussy and ass. ' Notice the trailing whitespace in the details field for the refactor build. cherrypimps -> https://pastebin.com/NgKABVrS 22c22
< Title: Wild Girls Kenzie Reeves And Vina Sky
---
> Title: 'Wild Girls Kenzie Reeves And Vina Sky ' mypervmom -> https://pastebin.com/upbRP0Mb PERFORMERS babepedia -> https://pastebin.com/pqbfxb4v EDIT |
Ideally, I'd leave it as is and change the scrapers - the replace regex should include the space to remove. However, I obviously don't want to break existing behaviour, so I've committed a change to trim the whitespace after replace operations. It shouldn't cause any problems down the track, but it's the sort of magic behaviour I'd like to avoid. |
If it wasn't for the babepedia and freeones scrapers i would opt for removing the unpredicted "magic" behaviour even if it means changing/fixing the rest of the scene scrapers. Can we revisit this issue ( the space removal ) after 0.3 ? |
With the latest commit it seems to be ok ( backwards compatible) with the existing scrapers, at least with the tested URLs |
* Refactor xpath scraper code * Make post-process a list * Add map post-process action * Add fixed xpath values * Refactor scrapers into cache * Refactor into mapped config * Trim test html
Refactored the xpath code so that it uses structs instead of generic maps.
Changed post-processing format so that actions can be performed in any order. Maintains backward compatibility with existing post-processing actions:
parseDate
,replace
,subScraper
; but should be considered deprecated. New post-processing actions should only be implemented in thepostProcess
field.Example of the old vs new structure:
Adds a
Fixed
field. This field replacesSelector
in a field and resolves to the string value. For example:Adds a
map
post-processing action. This is only allowed in the newpostProcess
field. It maps values to other values. Example:The above config changes
F
toFemale
,M
toMale
. Eliminates the need for multiple regex actions.Resolves #457