31 Jan 01:58

D4Vinci

db068f4

v0.2.93 Latest

Latest

This is an essential update for everyone to fully enjoy Scrapling as it's intended

What's changed

The return type is now consistent across all the parser engine so you will always get a return type as one of these Adaptor, Adaptors, TextHandler, TextHandlers, None, and a list in case you have mixed results like combined CSS selector. This allows a better coding experience with minimum manual type checking, makes the library more stable, and makes chaining methods always possible.
Most of the parser engine especially the Adaptor class got refactored to a cleaner version and most importantly a faster version. So now almost all the methods/properties, especially the searching methods, got a speed increase between 5-40%. Some methods got bigger speed boosts like find_by_regex got a ~60% speed boost! The automatch feature got a small ~5% speed boost.
Fixed logic bugs with the find_all/find methods that made the passed filters used in OR fashion and other times as an AND. So now all elements returned need to fulfill all filters you pass.
Now all regex-related methods return TextHandler/TextHandlers for easier methods chaining.
Added a new below_elements property that returns an Adaptors object of all elements under the current element in the DOM tree.
Now all methods/properties that were returning HTML source as string are now returning it as TextHandler so you can do regex easily on it etc...
StealthyFetcher is now a bit faster and more stealthy. Also, now it's possible to click Captchas in iframes like Cloudflare Turnstile.
The auto-completion and type hints improved a lot in nearly half the library. Especially Adaptor, TextHandler, and TextHandlers.
Now slicing TextHandler, accessing by index, or using the split method returns another TextHandler instead of the standard Python string. Now almost all standard string operations/methods return other Texthandler instead of standard string to make chaining methods/functions always possible.
Fixed some small bugs and typos. For example, the Fetcher async_put was doing post request instead of put request 😶‍🌫️
Improved the README a bit till I finish the documentation website.

This was supposed to be a small update till version 0.3 but thought to make it better.

Thanks for all your support!

Shoutout to our biggest Sponsor: Scrapeless

Assets 2

26 Dec 18:05

D4Vinci

v0.2.92

32d9660

v0.2.92

What's changed

Now response returned by browser-based fetchers uses more reliable data sources in cases where the page loaded uses many Iframes.
Now installing Scrapling is made even easier, you install it with pip then run scrapling install in the terminal and you are ready!
Fixed an inaccurate type hint in the parser.

Note

A friendly reminder that maintaining and improving Scrapling takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling and want it to keep improving, you can help by supporting me through the Sponsor button.

Assets 2

19 Dec 11:52

D4Vinci

v0.2.91

ee59914

v0.2.91

What's changed

Fixed a bug where the logging fetch logging sentence was showing in the first request only.
The default behavior for Playwright API while browsing a page is returning the first response that fulfills the load state given to the goto method ["load", "domcontentloaded", "networkidle"] so if a website has a wait page like Cloudflare's one that redirects you to the real website afterward, Playwright will return the first status code which in this case would be something like 403. This update solves this issue for both PlaywrightFetcher and StealthyFetcher as both are using Playwright API so the result depends on Playwright's default behavior no more.
Added support for proxies that use SOCKS proxies in the Fetcher class.
Fixed the type hint for the wait_selector_state argument so now it will show the accurate values you should use while auto-completing.

Note

Assets 2

16 Dec 12:43

D4Vinci

v0.2.9

60df72c

v0.2.9

What's changed

New features

Introducing the long-awaited async support for Scrapling! Now you have the AsyncFetcher class version of Fetcher, and both StealthyFetcher and PlayWrightFetcher have a new method called async_fetch with the same options.

>> from scrapling import StealthyFetcher
>> page = await StealthyFetcher().async_fetch('https://www.browserscan.net/bot-detection')  # the async version of fetch
>> page.status == 200
True

Now the StealthyFetcher class has the geoip argument in its fetch methods which when enabled makes the class automatically use IP's longitude, latitude, timezone, country, and locale, then spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region.
Added the retries argument to Fetcher/AsyncFetcher classes so now you can set the number of retries of each request done by httpx.
Added the url_join method to Adaptor and Fetchers which takes a relative URL and joins it with the current URL to generate an absolute full URL!
Added the keep_cdata method to Adaptor and Fetchers to stop the parser from removing cdata when needed.
Now Adaptor/Response body method returns the raw HTML response when possible (without processing it in the library).

Adding logging for the Response class so now when you use the Fetchers you will get a log that gives info about the response you got.
Example:

>> from scrapling.defaults import Fetcher
>> Fetcher.get('https://books.toscrape.com/index.html')
[2024-12-16 13:33:36] INFO: Fetched (200) <GET https://books.toscrape.com/index.html> (referer: https://www.google.com/search?q=toscrape)
>>

Now using all standard string methods on a TextHandler like .replace() will result in another TextHandler. It was returning the standard string before.
Big improvements to speed across the library and improvements to stealth in Fetchers classes overall.
Added dummy functions like extract_first`extract` which returns the same result as the parent. These functions are added only to make it easy to copy code from Scrapy/Parsel to Scrapling when needed as these functions are used there!
Due to refactoring a lot of the code and using caching at the right positions, now doing requests in bulk will have a big speed increase.

Breaking changes

Now the support for Python 3.8 has been dropped. (Mainly because Playwright stopped supporting it but it was a problematic version anyway)
The debug argument has been removed from all the library, now if you want to set the library to debugging, do this after importing the library:
```
>>> import logging
>>> logging.getLogger("scrapling").setLevel(logging.DEBUG)
```

Bugs Squashed

Now WebGL is enabled by default as a lot of protections are checking if it's enabled now.
Some mistakes and typos in the docs/README.

Quality of life changes

All logging is now unified under the logger name scrapling for easier and cleaner control. We were using the root logger before.
Restructured the tests folder into a cleaner structure and added tests for the new features. All the tests were rewritten to a cleaner version and more tests were added for higher coverage.
Refactored a big part of the code to be cleaner and easier to maintain.

All these changes were part of the changes I decided before to add with 0.3 but decided to add them here because it will be some time till the next version. Now the next step is to finish the detailed documentation website and then work on version 0.3

Note

Assets 2

30 Nov 16:16

D4Vinci

v0.2.8

012820c

v0.2.8

What's changed

This is a small update that includes some must-have quality-of-life changes to the code and fixed a typo in the main README file (#20)

Note

Assets 2

26 Nov 21:11

D4Vinci

v0.2.7

26aebba

v0.2.7

What's changed

New features

Now if you used the wait_selector argument with StealthyFetcher and PlayWrightFetcher classes, Scrapling will wait again for the JS to fully load and execute like normal. If you used the network_idle argument, Scrapling will wait for it again too after waiting for all of that. If the states are all fulfilled then no waiting happens, of course.
Now you can enable and disable ads on StealthyFetcher with the disable_ads argument. This is enabled by default and it installs the ublock origin addon.
Now you can set the locale used by PlayWrightFetcher with the locale argument. The default value is still en-US.
Now the basic requests done through Fetcher can accept proxies in this format http://username:password@localhost:8030.
The stealth mode improved a bit for PlayWrightFetcher.

Bugs Squashed/Improvements

Now enabling proxies on the PlayWrightFetcher class is not tied to the stealth mode being on or off (Thanks to @AbdullahY36 for pointing that out)
Now the ResponseEncoding tests if the encoding returned from the response can be used with the page or not. If the returned encoding triggered an error, Scrapling defaults to utf-8

Note

Assets 2

24 Nov 13:37

D4Vinci

v0.2.6

bbbc97a

v0.2.6

What's changed

New features

Now the PlayWrightFetcher can use the real browser directly with the real_chrome argument passed to the PlayWrightFetcher.fetch function but this requires you to have Chrome browser installed. Scrapling will launch an instance of your Chrome browser and you can use most of the options as normal. (Before you only had the cdp_url argument to do so)
Pumped up the version of headers generated for real browsers.

Bugs Squashed

Turns out the format of the browser headers generated by BrowserForge was outdated which made Scrapling detected by some protections so now BrowserForge is only used to generate real useragent.
Now the hide_canvas argument is turned off by default as it's being detected by Google's ReCaptcha.

Note

Assets 2

23 Nov 15:56

D4Vinci

v0.2.5

e94c503

v0.2.5

What's changed

Bugs Squashed

Handled an error that happens with the 'wait_selector' argument if it resolved to more than 1 element. This affects the StealthyFetcher and the PlayWrightFetcher classes.
Fixed the encoding type in cases where the content_type header gets value with parameters like charset (Thanks to @andyfcx for #12 )

Quality of life

Added more tests to cover new parts of the code and made tests run in threads.
I updated the docs strings to be readable correctly on Sphinx's apidoc or similar tools.

New Contributors

@andyfcx made their first contribution at #13

Note

Contributors

andyfcx

Assets 2

20 Nov 11:35

D4Vinci

v0.2.4

e9b0102

v0.2.4

What's changed

Bugs Squashed

Fixed a bug when retrieving response bytes after using the network_idle argument in both the StealthyFetcher and PlayWrightFetcher classes.
That was causing the following error message:

Response.body: Protocol error (Network.getResponseBody): No resource with given identifier found

The PlayWright API sometimes returns empty status text with responses, so now Scrapling will calculate it manually if that happens. This affects both the StealthyFetcher and PlayWrightFetcher classes.

Note

Assets 2

19 Nov 23:49

D4Vinci

v0.2.3

1473803

v0.2.3

What's changed

Bugs Squashed

Fixed a bug with pip installation that prevented the stealth mode on PlayWright Fetcher from working entirely.

Note

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's changed

Shoutout to our biggest Sponsor: Scrapeless

What's changed

What's changed

What's changed

New features

Breaking changes

Bugs Squashed

Quality of life changes

What's changed

What's changed

New features

Bugs Squashed/Improvements

What's changed

New features

Bugs Squashed

What's changed

Bugs Squashed

Quality of life

New Contributors

Contributors

What's changed

Bugs Squashed

What's changed

Bugs Squashed

Releases: D4Vinci/Scrapling

v0.2.93

What's changed

Shoutout to our biggest Sponsor: Scrapeless

v0.2.92

What's changed

v0.2.91

What's changed

v0.2.9

What's changed

New features

Breaking changes

Bugs Squashed

Quality of life changes

v0.2.8

What's changed

v0.2.7

What's changed

New features

Bugs Squashed/Improvements

v0.2.6

What's changed

New features

Bugs Squashed

v0.2.5

What's changed

Bugs Squashed

Quality of life

New Contributors

Contributors

v0.2.4

What's changed

Bugs Squashed

v0.2.3

What's changed

Bugs Squashed