To make you realize the importance of web scraping, I will take you through one example that happened to me in the recent past. One fine day one of my Chinese friends called me to discuss our university MoU norms. But as I was not in the home and my mother received the call but as she was not able to understand his language, as a result, I was not able to help him. He misunderstands me, then I realize the necessity of a translator in human life.
If I would have been present at home, I could have translated the information delivered in Chinese and then interpreted and clarify his concerns. On a similar note, web scrapping plays the same roll in the present online, social and the digital world.
In websites and other online sources, different types of commodities information are available but sometimes we cannot use that information for analysis and then draw valuable insights from the raw data. So web scraping comes to rescue here. It becomes necessary to scrap or to read the ungrouped data in a grouped format, as a result, we make graphical views using the available information and comparisons on different websites.
Now we will discuss what the web scrapping is. Web scraping is a technique for converting the data in an unstructured format with HTML tags from the web to the structured format which can easily be accessed, used and analyzed.
Why we need a Web Scrapping?
This is because it is not possible to collect information physically. But we can get the information easily by internet surveys or through social media. So, it is necessary to access the data from the internet.
Some applications of web scrapping in real life:
- For the purpose of comparison between different movies, medicines etc. to analyze the data
- For the purpose of scraping an image from different websites to train image classification
- For the purpose of scrapping data from social media sites such as Facebook, Twitter for sentiment analysis
- Scrapping user reviews and feedbacks from Flipkart and Amazon etc. for the purpose of e-commerce
- To know which laptop is better among HP, DELL etc. we use different reviews and the web scrapping method to compare the data
- Web scraping is also necessary for the purposes of reviewing which bus is better for transportation or which hotel is cheap for a tourist
Different Types of Web Scrapping:
Human Copy-Paste: This is a slow and efficient way of scraping data from the web. This involves humans for copy-paste the data from different websites.
Text pattern matching: Another simple yet powerful approach to extract information from the web is by using regular expression matching facilities of programming languages.
API Interface: Many websites like Facebook, Twitter, LinkedIn, etc. provides public and/ or private APIs which can be called using the standard code for retrieving the data in the prescribed format.
DOM Parsing: By using the web browsers, programs can retrieve the dynamic content generated by client-side scripts.
There are many software for web scraping. Here we will discuss how we can scrap the data from a website using R and R Studio software.
Suppose we search “mobile” in flipkart.com and have search results. Now, I want to collect the data of the mobiles from that page. Suppose we want the name, price, and ratings of those mobiles. Now first we have to copy the link of that webpage which I want to scrape and then follow the r codes,
url <- 'https://www.flipkart.com/search?q=Mobile&marketplace=FLIPKART&otracker=start&as-show=on&as=off'
webpage <- read_html(url)
name_html = html_nodes(webpage,'._3wU53n')
names <- html_text(name_html)
price_html = html_nodes(webpage,'._2rQ-NK')
price <- html_text(price_html)
price <- as.numeric(gsub(x = gsub(x = price,pattern = "\u20b9",replacement = ""),pattern = ",",replacement = ""))
rating_html = html_nodes(webpage,'._2beYZw')
rating <- html_text(rating_html)[1:24]
rating <- as.numeric(sub(x = rating,pattern = " ★",replacement = ""))
data <- data.frame(Product.description = names, Price = price, Rating = rating)
Running the above codes we get the following output:
Product.description Price Rating
1 Redmi Note 5 Pro (Black, 64 GB) 14999 4.5
2 Asus Zenfone Max Pro M1 (Black, 32 GB) 10999 4.2
3 Redmi Note 5 Pro (Gold, 64 GB) 14999 4.5
4 Samsung Galaxy J6 (Black, 32 GB) 12990 4.4
5 Asus Zenfone Max Pro M1 (Black, 64 GB) 12999 4.3
6 Infinix HOT 6 Pro (Sandstone Black, 32 GB) 7999 4.3
7 Honor 7A (Black, 32 GB) 8999 4.2
8 Honor 7A (Blue, 32 GB) 8999 4.2
9 Honor 7A (Gold, 32 GB) 8999 4.2
10 Redmi Y1 (Grey, 32 GB) 8999 4.2
11 Redmi 5A (Rose Gold, 16 GB) 5999 4.4
12 Redmi 5A (Grey, 16 GB) 5999 4.4
13 Redmi 5A (Gold, 16 GB) 5999 4.4
14 Samsung Galaxy On6 (Blue, 64 GB) 14490 4.3
15 Redmi 5A (Gold, 32 GB) 6999 4.4
16 Redmi 5A (Blue, 16 GB) 5999 4.4
17 Redmi 5A (Rose Gold, 32 GB) 6999 4.4
18 Redmi 5A (Blue, 32 GB) 6999 4.4
19 Redmi 5A (Grey, 32 GB) 6999 4.4
20 Samsung Galaxy J8 (Blue, 64 GB) 18990 4.4
21 Asus Zenfone Max Pro M1 (Grey, 64 GB) 14999 4.5
22 Redmi Note 5 (Gold, 32 GB) 9999 4.4
23 Asus Zenfone Max Pro M1 (Grey, 32 GB) 10999 4.2
24 Samsung Galaxy On6 (Black, 64 GB) 14490 4.3
Now, one should ask what is the second element in the function html_nodes(). The second element is the “CSS” which we have to select. You may think of it as an address of the elements which you to select in the webpage. Now you can find these “CSS” by the following method –
- Install “selector gadget” extension to your Google Chrome.
- Now, go to the webpage.
- Then click on the extension.
- Now, just select the one you are interested in.
- You will get the CSS selector in a tab below.
You can follow the visit https://selectorgadget.com/ to understand how “selector gadget” works.
Another question that may arise is why I have used some replacement for the variables “price” and “rating”. Basically, when we scrap a data from the webpage it may not be saved as we need. So, we have to modify it to serve our purpose.
So, this is how we easily scarp a data from a webpage using R. After getting these we can analyze the data as our need.