Scrapy

Scrapy (/ˈskreɪpaɪ/) is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

In scrapy requests are scheduled and processed asynchronously.

While this enables you to do very fast crawls (sending multiple concurrent requests at the same time, in a fault-tolerant way) Scrapy also gives you control over the politeness of the crawl through a few settings. You can do things like setting a download delay between each request, limiting amount of concurrent requests per domain or per IP, and even using an auto-throttling extension that tries to figure out these automatically.

Our first Spider:

Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.

This is the code for our first Spider. Save it in a file named quotes_spider.py under the tutorial/spiders directory in your project:

import scrapy

 

class QuotesSpider(scrapy.Spider):

    name = "quotes"

 

    def start_requests(self):

        urls = [

            'https://quotes.toscrape.com/page/1/',

            'https://quotes.toscrape.com/page/2/',

        ]

        for url in urls:

            yield scrapy.Request(url=url, callback=self.parse)

 

    def parse(self, response):

        page = response.url.split("/")[-2]

        filename = f'quotes-{page}.html'

        with open(filename, 'wb') as f:

            f.write(response.body)

        self.log(f'Saved file {filename}')

 

 

As you can see, our Spider subclasses scrapy.Spider and defines some attributes and methods:

name: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.

start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.

parse(): a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.

The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.

Scraping Using xpath selector:

Then, back to your web browser, right-click on the span tag, select Copy > XPath and paste it in the Scrapy shell like so:

With this knowledge we can refine our XPath: Instead of a path to follow, we’ll simply select all span tags with the class="text" by using the has-class-extension:

A picture containing text

Description automatically generated

 

Graphical user interface, text, application, email

Description automatically generated

 

Quote Text: //blockquote[@class='quoteBody']/text()

Quote Author: //span[@class='quoteAuthor']/text()

Quote Tags: //div[@class='quoteTags']/child::a/text()

//*[@id="shell"]/div[7]/div[1]/article[4]/div/div[2]/div/blockquote

 

How to start a project:

1.       Scrapy startproject project_name

2.       Cd project_name

3.       Scrapy genspider -t crawl spidername web_url(Without https://)

 

Xpaths for github repos:

Github repo download button xpath:

//li[@class='Box-row Box-row--hover-gray p-3 mt-0']/a[@class='d-flex flex-items-center color-fg-default text-bold no-underline']/@href

Next page

//a[@class='next_page']/@href

 

Comments