Web scraping (Scrapy)

Sunilkumar Prajapati
4 min readMay 18, 2022

--

Scrapy is an open-source framework for web crawling and web scraping that is used to crawl websites and extract structured data from their pages. In addition to data mining, it can also be used for monitoring and automated testing.

Scrapy is a method of data extraction from different websites. Scrapy helps to collect data from various sites, in the form of CSV, and JSON. Using Scrapy we create our own dataset. Scrapy is a tool that helps with data scraping or web scraping. Data scraping and web scraping both the things are same.

The question coming to mind that why Scrapy, not Beautifull soup?

Scrappy is scraping any kind of data like Text, Audio, Video, and Emails, and it helps to scrap a bulks amount of data. The reason Why we are not using beautiful soup is that it can only extract data for simple HTML web pages. It is not suitable for large and complex projects. It is based on a request python library.

If you want to install a Scrapy library on our system, we need to visit a Scrapy website the link is https://scrapy.org/ and a Github link is https://github.com/scrapy/scrapy.

I not going to tell you how to install Scrapy because if you visit on Scrapy site you will get all the information about installation.

After installing the Scrapy library if we want to check which version of Scrapy is installed in the python environment we have to use write in cmd ‘Scrapy’

We can also check in our python environment by simply writing “python” in cmd. After that, we have to press Enter button and then we have written an “import Scrapy”.

If there is no error occurred that means Scrapy library has been installed successfully.

Let’s discuss what are the components of Scrapy.

Scrapy components

1] Spiders

a] Scrapy

b] Crawl Spider

c] XML Feed Spider

d ]CSV Feed spider

e] Sitemap Spider

2] Pipelines

3] Middleware

4] Engine

5] Scheduler

A total of five main components are there, which worked in the backend, for the data collection, data preprocessing, and data storage.

1] Spiders: Spiders are responsible for, What you want to extract from the web page. What kind of data want to extract from the website.

2] Pipelines: Pipelines are responsible to process the data. For example, Cleaning the data, removing duplication, and storing data in a database. So without a pipeline, we cannot do cleaning, removing duplication, and storing data.

3] Middleware: Middleware handles requests or responses from the target website.

4] Engine: The engine is responsible for coordinating all the other components and taking care that everything goes according to plan.

5] Scheduler: Scheduler is responsible for preserving the order of operations. The work of the scheduler is like request and response are handled but is it happening in order or not that check scheduler.

Scrapy Commands

In this part, I will share with you about Scrapy commands. We will see commands in the Scrapy library. We have to open anaconda Prompt (Anaconda3) and write Scrapy after that press Enter.

In that, we will see the current version of Scrapy and the active project on this particular version. Below that Usage is given. Usage help how we can use command and options. After we can see available commands are given. These commands are the most request use commands in Scrapy.

From the above commands, I will share how we can use these commands one by one.

Bench command

a] bench: Bench commands help to test the benchmark on our local system. Benchmarks mean local testing, which means that we basically check how it is working, which middleware work, what are the request and response ratio, and how many pages per minute are fetching. That is all the information we get in the Bench command.

b] fetch: Fetch command help to fetch a Url using the Scrapy downloader.

Fetch command

c] genspider: genspider commands are used to generate a new spider.

d] runspider: Whatever new spider we have generated, to run that spider we use runspider.

e] startproject: Let me say that this is the most important command which we use to create a new Scrapy project.

--

--