Understanding Natural Language Processing

In this article, we will discuss in detail on Natural Language Processing (NLP). Human Civilization has a very rich and vivid history of evolution. This evolution also includes the art of communication which human beings have mastered over the years.

Usage of various languages for verbal communication is an integral part of communication. Now with the advent of Artificial Intelligence and Machine Learning, we are trying to implement the capability of communication in machines as well.

This article will help in the understanding of Natural Language Processing which is a collection of different principles used for performing various tasks related to NLP.

Through this article, we will look at following topics:

  1. What is Natural Language Processing?
  2. Tasks performed using NLP.
  3. Essential NLP Libraries
  4. Fundamentals of NLP
    • Data Retrieval through Web Scraping.
    • Preprocessing the text.
    • Feature Engineering on text data.
    • Modeling the data and/or Making predictions.

What is Natural Language Processing?
Natural Language Processing is a field which is created by amalgamating computer science and artificial intelligence. Using NLP, we are concerned with interactions between computers and human natural languages.

NLP encompasses varied tasks which are aiming for systematic processes. These processes are used to analyze, understand and derive useful data i.e. knowledge from text data which is available from numerous sectors.

NLP is especially used when we have unstructured data like text, image, and videos. To work with this data, NLP is used which provides tools, techniques, and algorithms to process this natural – language data.

Essential libraries used for NLP in python
With python, we have the advantage of performing some of the complex tasks using libraries. Similarly, there are multiple libraries which facilitate various operations in NLP.

Let’s have a look at the following libraries:-

Natural Language Toolkit(NLTK)
NLTK is foremost platform for constructing python programs when we deal with linguistic data. NLTK is open source, community-driven project.

Scikit-learn
This library is not just used for NLP but it’s applied widely in Machine learning.

TextBlob
TextBlob provides with NLP tools API which can be easily handled by the users.

spaCy
Using spacy, we can implement concepts of NLP by using Python and Cython. It has some excellent capabilities for named entity recognition.

Genism
Genism is specifically used for Topic Modelling while we deal with text documents.

Stanford Core NLP
One of the most famous university i.e. Stanford University has contributed a lot towards the development of NLP. There are numerous NLP packages and services which are built by Stanford NLP Group for the same purpose.

Fundamentals of NLP
Getting Data through various sources.
Data is the primary requirement for any kind of operation in NLP. Therefore, we have two options which can be either to get datasets from various sources related to different domains. The other option is to scrape data from a website which is known as Web Scraping.

Preprocessing the text
The data obtained from different sources is full of noise i.e. errors which will affect the precision of the result.

So it is always recommended to perform cleaning and standardization of the text. This will make the text noise free and ready for further analysis.

There are numerous methods which are opted to perform preprocessing of the text. Some of them are as follows:

Noise Removal.Any text present in the document which is impertinent to the context of data and to the final output can be termed as noise.

For example:  There are numerous stopwords which are used extensively in our language (is, am, the, of, in etc.) but these are of no significance when we work with text data. Moreover URL’s, links, social media handles, punctuations are of no usage.

Text Normalization.Many times we encounter a single word with different representations and this is considered as another form of noise. For example: sing, singer, singing, sings, sang are multiple variations of the word – sing.

To deal with this, we use two methods which are Stemming and Lemmatization.

Feature Engineering on text data.
After completing preprocessing of data, we have to perform feature engineering which will help in further predictions. On the basis of the features used, we will have to decide the model, if built.

Again there are various ways through which we can extract features from the data. Some of those methods are as follows:

Syntactic Parsing: In the parsing we are analysing words in the sentence for grammar and even deduce the relationship among the words. Here Dependency parsing and Parts of Speech tagging are the preferred methods for analysing the syntax.

  • Dependency Trees: The sentence formation takes place by joining words together. This relationship which words possess within themselves is depicted through dependency grammar.
  • Parts of Speech Tagging: Every word used in the sentence is depicting a part of speech (pos) tag (nouns, verbs, adjectives, adverbs). Through pos tag we are able to understand how a word is used and for what is it used.
  • Entity Extraction. Entities are considered of paramount importance when we are considering any sentence – These entities are noun phrases, verb phrases or both. The various algorithms which are used for entity extraction are using techniques like rule-based parsing, dictionary lookups, pos tagging, and dependency parsing.

There are many known ways which are used for this purpose but Topic Modelling and Named Entity Recognition are the most widely used methods.

Named Entity Recognition: Through named entity recognition we are detecting the named entities like person names, names of different locations, company names, names of organizations from sentences. For example:

Sentence: Tim Cook, the CEO of Apple Inc. was present in the launch of new iPhone at New York.

Named Entities: (“person”: “Tim Cook”), (“org”: “Apple Inc.”), (“location”: “New York”)

Topic Modelling: In topic modelling our main aim is to identify the topics in a text document. Topic modelling uses unsupervised method in which it determines hidden patterns among the words of a corpus.

Here topics are defined as a frequent pattern which has similar terms occurring together in a corpus.

Statistical Features. All of the above mentioned ways are used to convert the text data into numerical data so that the algorithms can understand it. But there are ways through which we can quantify the data directly into numbers.

One of the method is Term Frequency – Inverse Document Frequency (TF-IDF).

This is a weighted model which is used for information extraction. Through this we convert the text into vector models which is based on the occurrence of words in documents.

Term Frequency (TF) – This tells the number of times a term has been present in the document.

Inverse Term Frequency (IDF) – IDF is used to assign more weightage to those words which are less in number but are of greater significance. Using the TF-IDF method, we can eliminate those frequent words which are more in quantity in text corpus but are of no usage.

Tasks performed using NLP.
Once we have completed the preprocessing feature engineering of the text data, we can do so many different experiments with this data and get different results.

As already mentioned, NLP comprises of disparate tasks through which we have been able to make big strides in this linguistic interaction with the machine.

Let’s have a look at the following tasks:
Speech Recognition
This application of NLP has made the actual communication between us and the machine. Some of the most famous voice operated virtual assistants like Apple’s Siri, Amazon’s Alexa, and Microsoft’s Cortana are few examples of practical implementation of speech recognition.

Automatic Summarization
With advancement of NLP, it has provided with the feature of summarizing large texts into small ones. This has helped in getting the gist of any kind of information available in bulk. Especially, Inshorts is one of those popular applications where they provide the users with each news within 60 words.

Machine Translation
Using this, we have been capable of converting given text from one language to another desired language. One of the most popular example is Google translate and along with this, many apps have been using concept to help visitors understand the native language.

Named Entity Recognition In any sentence, there is usage of some real-world objects like people, places, organizations which are depicted using proper names. Thus, identification of all those entities in a sentence which will help in extracting information and then segmenting these entities on the basis of some predefined classes.

Sentiment Analysis
One of the most popular applications of NLP is analysing sentiments of large and unique datasets varying from surveys to reviews. Through sentiment analysis we are aiming to understand the views expressed it.

These views are either classified as positive, negative or neutral. This classification is done by quantifying the polarity score.

The main requirement for sentiment analysis is it should have subjective text i.e. there should be sentences/phrases expressing emotions/opinions/views. Thus, any kind of objective text will not provide a desired results. This application is extensively used for social media analysis of any field, trying to gauge the public’s response in elections, reactions towards movie etc.

Topic Segmentation/Topic Modelling
Each and every document/article is comprised of various topics. The process of identifying the topics in text corpus is known as Topic Modelling/Topic Segmentation.

For deriving topics, different algorithms are used which mainly provides results by looking at the frequencies of all the words.

Practical Implementation of Natural Language Processing

We have already discussed most of the fundamentals of NLP. Now it’s time to use them to implement a real-world scenario.

Here we will be scraping the news articles from Inshorts, a website which is built to provide short 60-words news articles on wide range of topics.

Here in this article, we will be text data from news articles on sports, science Frequency-Inverse and technology.

Step: 1 – Scraping the data from news articles.
First, we have to create web scraper which will extract the data from the news articles. For this, let’s import the dependencies i.e. the necessary libraries are imported.

For this extraction of data we require requests and Beautiful soup libraries.

Now we have to use requests library to access and fetch the content of HTML tags from the webpages of the three news categories. After this, we will be using BeautifulSoup to parse and extract news headline and article content for the news articles of three categories.

Through build_dataset () function we are fetching the news headline, article text and category. By building a dataframe, we are specifying a news article with each row and then invoking the function and creating the dataset.

We can see the dataset of news articles which is created and to check the number of news articles fetched, we can use the code mentioned below.

It can be seen that 25 articles are extracted of each of the three categories.

Step: 2 – Preprocessing the text
Now to make the text data competent for analysis, we will have to perform numerous steps for preprocessing the text. For performing the preprocessing we will be requiring the following libraries and thus, we’ll have to import them.

Here the text is tokenized i.e. the words are converted into tokens. Moreover, the stopwords ‘no’ and ‘not’ are also removed as they are of significance.

Removing HTML Tags: The text data obtained through web scraping contains noisy data. Especially, HTML tags which are not adding any value to the understanding of the data. Thus, we should remove it.

So the above code helps us to remove the unwanted HTML tags and get the text from the document.

Expansion of Contracted words.In our day-to-day spoken language, we use a huge number of short forms of words. This shortened words many times will not make sense for our analysis. Thus, we have to make these short forms into their original representation.

To understand the various contractions, there is a contraction.py file in my repository which is used here.

For example: We’ll to we will

Removal of Special Characters: All the non-alphanumeric characters are said to be the special characters. On many occasions we even remove the numeric characters from the text data, but this depends on the problem we have on our hands.

For removing special characters we use regular expressions which have the ability to remove these symbols pretty easily.

Lemmatization
Lemmatization is used to get the original form of the word i.e. root form of the word. This removes the ambiguity between the various forms of same word. Here we have used nltk for lemmatization of words.

Stemming: Stemming is similar to lemmatization, but here we are focusing on the word stem and not the root. So the main aim is to remove the suffixes (“ing”, “ed”) attached with a word.

The point which we should note is that after stemming many times we obtain a word which is not lexicographically correct. But stemming helps in standardization of text.

Again, for stemming we have inbuilt functions of nltk library which we will use here as well.

Removal of Stopwords: The most frequent words used in our communication like a, an, the, and etc. are of no significance. In general the articles, conjunctions and prepositions contribute to list of stopwords.

So it is recommended to remove these “stopwords” because they are not conveying any kind of information and their count is greater in comparison to useful words. The nltk library has some predefined lists of stopwords which is used here. We also have the option of creating our own stopwords list.

Text Normalizer: According to our dataset, we can perform many preprocessing steps but here we will end this. Now all the above techniques will combined under one function and then used for our dataset.

Now it’s time to use this text normalizer but before this we will have to create document by combining news headline and news article text. This document will be created for each and every piece of news.

This is how the text normalizer works for cleaning of the data. The clean text and full text have clear differences which is because of text preprocessing. Now we can create a dataset of this pre-processed data and store it for further use in form of .csv file

Step: 3 – Sentiment Analysis: 
For sentiment analysis of the news articles we will have to use unsupervised learning technique. This is because our dataset is not labelled.

To manage this unlabelled data we will use lexicons. Lexicon is a dictionary, vocabulary, or book of words. Here we will be using lexicons for sentiment analysis.

There are different types of lexicons and one of them is AFINN Lexicon. AFINN lexicon is most easy and widely used lexicon. So to use this lexicon, we will be using the inbuilt afinn library. We will also create a corpus which will be used for this emotion and sentiment analysis.

The above created corpus is used for generating sentiment scores which is then used to classify them into three categories.

Let’s look at what these sentiment scores of the three categories of news articles are conveying by using pandas library.

We can deduce from the results that sports news articles average sentiments are positive whereas news articles of technology have average which indicates negative sentiment.

Now we will be creating visualizations of these news articles and then analyse the results.

Here we are creating strip plot and box plots using the Seaborn library.

Both of these visualizations simplify the results obtained by us. We can see that technology and science has news articles spread over negative and positive range.

We can even create a depiction of count of the sentiment labels for each category of news articles. The code for that is following.

This chart corroborates the fact that technology has most number of articles with negative sentiment, whereas science has most articles with neutral sentiment and most of the sports news articles are inclined towards positive sentiment.


Sentiment Analysis with TextBlob:
Another library which is open-source and extensively used for NLP tasks is TextBlob. Now we’ll perform sentiment analysis with this library.

Again we are generating polarity of sentiments and then classifying them into the three categories.

We’ll repeat the process of analysing the sentiments.

Here we can see that average of sentiment scores is positive for all the three categories with sports news articles with highest average. Now we will visualize the results using Seaborn library.

Here technology has most number of positive sentiment news articles whereas science news articles consists of most number of negative sentiment of news articles.

The Jupyter notebook consisting of the code and dataset for this article can be found here.

You might also like More from author