Skip to content

Commit bbc2de3

Browse files
committed
Add notes on Scrapy
1 parent 6fc9d9f commit bbc2de3

File tree

2 files changed

+32
-0
lines changed

2 files changed

+32
-0
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
*/.venv/*
2+
*/__pycache__

README.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2805,3 +2805,33 @@ Using tags includes newline characters, so you need to remeber to skip over them
28052805
`soup.body.contents[1].next_sibling.next_sibling`
28062806

28072807
The searching methods, such as `find_next_sibling`, skip newline characters.
2808+
2809+
### Web Scraping with Scrapy
2810+
2811+
Scrapy is a Python framework for web crawling and web scraping, used to crawl websites and extract structured data from their pages.
2812+
2813+
It operates like BeautifulSoup, parsing HTML for data, even though it has a different syntax.
2814+
2815+
```python
2816+
import scrapy
2817+
2818+
class BookSpider(scrapy.Spider):
2819+
name = "bookspider"
2820+
start_urls = ["https://books.toscrape.com"]
2821+
2822+
def parse(self, response):
2823+
for article in response.css("article.product_pod"):
2824+
yield {
2825+
"price": article.css(".price_color::text").extract_first(),
2826+
"title": article.css("h3 > a::attr(title)").extract_first()
2827+
}
2828+
next = response.css(".next > a::attr(href)").extract_first()
2829+
if next:
2830+
yield response.follow(next, self.parse)
2831+
```
2832+
2833+
To run Scrapy, call it and pass it the Python file to run and the destination file to save the data to.
2834+
2835+
```
2836+
scrapy runspider -o books.csv book_scraper.py
2837+
```

0 commit comments

Comments
 (0)