Scrapy
Scrapy (/ˈskreɪpaɪ/) is an
application framework for crawling web sites and extracting structured data
which can be used for a wide range of useful applications, like data mining,
information processing or historical archival.
In scrapy requests are
scheduled and processed asynchronously.
While this enables you to do very fast crawls (sending multiple
concurrent requests at the same time, in a fault-tolerant way) Scrapy also
gives you control over the politeness of the crawl through a few settings. You
can do things like setting a download delay between each request, limiting
amount of concurrent requests per domain or per IP, and even using an
auto-throttling extension that tries to figure out these automatically.
Our first Spider:
Spiders are classes that you define and that Scrapy uses to scrape
information from a website (or a group of websites). They must subclass Spider
and define the initial requests to make, optionally how to follow links in the
pages, and how to parse the downloaded page content to extract data.
This is the code for our first Spider. Save it in a file named
quotes_spider.py under the tutorial/spiders directory in your project:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'quotes-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {filename}')
As you can see, our Spider subclasses scrapy.Spider and defines
some attributes and methods:
name: identifies the Spider. It must be unique
within a project, that is, you can’t set the same name for different Spiders.
start_requests(): must return an iterable of Requests
(you can return a list of requests or write a generator function) which the
Spider will begin to crawl from. Subsequent requests will be generated
successively from these initial requests.
parse(): a method that will be called to handle the
response downloaded for each of the requests made. The response parameter is an
instance of TextResponse that holds the page content and has further helpful
methods to handle it.
The parse() method usually parses the response, extracting the
scraped data as dicts and also finding new URLs to follow and creating new
requests (Request) from them.
Scraping Using xpath selector:
Then, back to your web browser, right-click on the span tag,
select Copy > XPath and
paste it in the Scrapy shell like so:
![]()
With this knowledge we can refine our XPath: Instead of a
path to follow, we’ll simply select all span tags with the class="text" by
using the has-class-extension:


Quote Text: //blockquote[@class='quoteBody']/text()
Quote Author: //span[@class='quoteAuthor']/text()
Quote Tags: //div[@class='quoteTags']/child::a/text()
//*[@id="shell"]/div[7]/div[1]/article[4]/div/div[2]/div/blockquote
How to start a project:
1.
Scrapy startproject project_name
2.
Cd project_name
3.
Scrapy genspider -t crawl spidername web_url(Without
https://)
Xpaths for github repos:
Github repo download button xpath:
//li[@class='Box-row Box-row--hover-gray p-3
mt-0']/a[@class='d-flex flex-items-center color-fg-default text-bold
no-underline']/@href
Next page
//a[@class='next_page']/@href
Comments
Post a Comment