Top 15 Smartphones are web scraped from Amazone using R
What is Web Scraping?
Web scraping is a technique for converting the data present in an unstructured format (HTML tags) over the web to the structured format which can easily be accessed and used.
Almost all the main languages provide ways for performing web scraping. In this article, we’ll use R for scraping the data for the most popular feature smartphones of 2019 from the Amazone website.
We’ll get a number of features for each of the 15 popular feature smartphones released in 2019. Also, we’ll look at the most common problems that one might face while scraping data from the internet because of the lack of consistency in the website code and look at how to solve these problems. If you are more comfortable using Python, I’ll recommend you to go through this website for getting started with web scraping using Python
Ways to scrape data
There are several ways of scraping data from the web. Some of the popular ways are:
- Human Copy-Paste: This is a slow and efficient way of scraping data from the web. This involves humans themselves analyzing and copying the data to local storage.
- Text pattern matching: Another simple yet powerful approach to extract information from the web is by using regular expression matching facilities of programming languages. You can learn more about regular expressions in this website.
- API Interface: Many websites like Facebook, Twitter, LinkedIn, etc. provides public and/or private APIs which can be called using the standard code for retrieving the data in the prescribed format.
- DOM Parsing: By using web browsers, programs can retrieve the dynamic content generated by client-side scripts. It is also possible to parse web pages into a DOM tree, based on which programs can retrieve parts of these pages.
Title:- Name, storage, and color of the top 15 smartphones.
Price:- Price of smartphones.
Rating:-Ratings of the smartphones
Steps of Scraping Amazone webpage using R
Now, let’s get started with scraping the Amazone website for the 15 most popular feature smartphones released in 2019. You can access them on this website Click.
rvest:- Hadley Wickham authored the rvest package for web scraping in R. rvest is useful in extracting the information you need from web pages.
- Click the Tools button and install packages.
- Then click the Install button.
rvest contains the basic web scraping functions, which are quite effective. Using the following functions, we will try to extract the data from web sites.
- read_html(url) : scrape HTML content from a given URL
- html_nodes(): identifies HTML wrappers.
- html_nodes(“.class”): calls node based on CSS class
- html_nodes(“#id”): calls node based on <div> id
- html_nodes(xpath=”xpath”): calls node based on xpath (we’ll cover this later)
- html_attrs(): identifies attributes (useful for debugging)
- html_table(): turns HTML tables into data frames
- html_text(): strips the HTML tags and extracts only the text
Let’s implement it and see how it works. We will scrape the Amazon website for comparision of top 15 smartphones.
Loading the packages we need
#loading the package:
Reading the HTML content from Amazon
#Specifying the URL for the desired website to be scrapped
url <- https://www.amazon.in/s?k=top+smartphones&dc&crid=3JN9QKV0R5211&sprefix=top+smart%2Caps%2C376&ref=a9_sc_1
#Reading the html content from Amazon
webpage <- read_html(url)
In this code, we read the HTML content from the given URL and assign that HTML into the
Scrape product details from Amazon
Now, as the next step, we will extract the following information from the website:
Title: The title of the product.
Price: The price of the product.
Rating: The user rating of the product.
Size: The size of the product.
Color: The color of the product.
Next, we will make use of HTML tags, like the title of the product and price, for extracting data using Inspect Element. In order to find out the class of the HTML tag, use the following steps:
=> go to chrome browser => go to this URL => right-click => inspect element
NOTE: If you are not using the Chrome browser, check out this article.
Based on CSS selectors such as class and id, we will scrape the data from the HTML. To find the CSS class for the product title, we need to right-click on the title and select “Inspect” or “Inspect Element”.
As you can see below, I extracted the title of the product with the help of html_nodes in which I passed the id of the title — span#title — and webpage which had stored HTML content. I could also get the title text using html_text and print the text of the title with the help of the head () function.
#scrape title of the product
title_html <- html_nodes(webpage, ‘.a-size-medium’)
title <- html_text(title_html)
We could get the title of the product using spaces and \n.
The next step would be to remove spaces and new line with the help of the gsub function.
# remove all space and newlines
Price of the product:
price_html <- html_nodes(webpage, ‘.a-price-whole’)
price <- html_text(price_html) head(price)
Rating of the products:
rating_html <- html_nodes(webpage, ‘.a-icon-alt’)
rating <- html_text(rating_html)
Now we have successfully scraped all the 4 features for the 15 most popular feature smartphones from amazon released in 2019. Let’s combine them to create a dataframe and inspect its structure.
I believe this article would have given you a complete understanding of the web scraping in R. Now, you also have a fair idea of the problems which you might come across and how you can make your way around them. As most of the data on the web is present in an unstructured format, web scraping is a really handy skill for any data scientist.
Also, you can post the answers to the above three questions in the comment section below. Did you enjoy reading this article? Do share your views with me. If you have any doubts/questionsns feel free to drop them below.
To read more about R and its implementation Click