Whenever we prepare any delicious dish, a lot of preparation is required for it. Collecting all the raw items, cleaning and making them apt for usage in the dish. Then we have to use different spices to get the delicious flavors out of the raw materials.
Above analogy is applicable to the ubiquitous data too. Most of the times we can get the data from various sources like kaggle etc. but there are scenarios where we need customized data. For this, we have to choose the path of web scraping i.e. getting the data from websites using either the API’s provided or through python and its libraries.
Once done with the step of getting the data, we would be required to clean and handle it. Thus making it appropriate for the extraction of information from it. At last, we would be infusing flavors into it i.e. getting the features of the data for information extraction.
In this article Web Scraping using Python, We will be covering a series of articles where all the data preparation steps will be covered which are as follows:
- Data Collection – Web Scraping
- Data Cleaning
- Data Handling and Feature Extraction.
In this first article, we will be learning about Web Scraping. The points covered in this article given below:-
- What is Web Scraping?
- Different ways opted for Web Scraping.
- Libraries used for Web Scraping.
- Practical Implementation of Web Scraping.
- Through BeautifulSoup
- Through Scrapy
What is Web Scraping?
Web scraping is a way to extract the information from web pages which is present in HTML format. The data is present in an unstructured format, so web scraping helps to get this data along with this we can convert it into a structured format.
Different ways opted for Web Scraping.
There are numerous ways through which we can scrape the web. Some of them are as follows:-
- Using API’s: Many websites/organizations provide API for extracting data from their website. But there are a lot of limitations to the kind of data which can be extracted.
- Through in-built libraries/frameworks of Python: Python is a home of many libraries for distinguished tasks and web scraping can also be achieved using those libraries. Moreover, there are frameworks as well which facilitate this process.
Libraries used for Web Scraping.
It’s time to have a look at the libraries which are used for web scraping.
- BeautifulSoup: This library is used to extract information from the webpage which is present in various tags like table, paragraph etc.
- Urllib: Using this library we extract the webpages through the URL of the desired webpages.
- Requests: This is another library used for obtaining the URL’s from the webpage.
- Scrapy: Scrapy is a collaborative and open-source python framework which is used for large scale web scraping.
Practical Implementation of Web Scraping
Web scraping through Beautiful Soup. Here we will be scraping the web through the Beautiful Soup library. For scraping purposes, we are using a weather forecast website. We will be scraping the weather forecast data of San Francisco. So let’s start this journey!!!!
First, we will be importing BeautifulSoup library as bs4 and requests library which is used for extracting URL.
Using the requests library we are downloading the desired web page which consists of URL of the web page, latitude, and longitude of the city i.e. San Francisco.
After downloading the page, we will be parsing the HTML content using BeautifulSoup.
Forgetting the forecast from the web page, we will have to inspect the webpage and recognize the ‘id’ tag value and assign it to a variable.
Our next task is to get the class attribute of the ‘id’ tag and we assign this to forecast_items variable. Here the find_all () function gets all the class attributes of the page.
We are printing the HTML code and then using prettify () function to get in a structured manner.
Next task is to identify the class attributes for extracting more information. The three variables used are period, short description and temperature.
For obtaining the title of the forecasts, we are using the ‘title’ attribute of ‘img’ tag. After obtaining data, we will be using prettify () to get the data structured and then printing the title.
We have to iterate over period tags for getting the period names of further days. Here we have used list comprehension for this.
Furthermore, using the ‘tombstone-container’ we are extracting the short description, temperature, and description of different days of forecast.
For better and clear representation of data, we will be mapping the values to a dataframe.
Therefore, this dataframe is the final result which is obtained through this web scraping.
Web scraping through Scrapy
Coming to another way of scraping, we will be using Scrapy framework. Scrapy creates new classes called Spider that define how a website will be scraped by providing the starting URLs and what to do on each crawled page.
Using scrapy we will be scraping URL of the images of headphones from amazon.com and getting them in text file.
For installing scrapy.
: pip install scrapy
Let’s start scraping with scrapy.
First, we have to open a command prompt and then create a new project by using
Here ‘startproject’ command creates a new folder with a name as ‘headphones’. This folder will contain 4 already created files which are items.py, settings.py, middleware.py and pipelines.py and these are required for creating a spider. These files can be customized if required.
Now we will create ‘spider’ by ‘genspider’ command where we specify the name of
NOTE: Keep in mind that the project name and spider name should be different.
We are starting with the most basic scrapper python class which is using scrapy. The Spider which is spider class provided by Scrapy. Here we are using the name of the spider and then using the init () function.
In the start_requests () function we are specifying the URL which is to be crawled. Then we iterate over each URL and we yield the URL’s using Request() function of scrapy.
In this parse () function, we are extracting the URL’s of the images through the ‘
Continuing this parse () function, we are using the try and except block. The try part is used to get the next link which is present in ‘span’ tag and yielding the links which are followed.
Lastly, in except block we check if there are no more links available. Then we are creating a file i.e. text file which consists of URL’s of images. Here we are converting the URL’s into strings. The code for this article can be found here.