stringR package in R for Handling Strings

stringr’ is a very handy package in R, used to handle string data and manipulate it according to the form that we require for our models.

Suppose we want to count the length of the individual states

Mainly four kinds of string manipulations can be performed by the functions incorporated in this package:

  1. Allow us to manipulate individual characters within a string in character vectors.
  2. Whitespace tools to add, remove, and manipulate whitespace.
  3. Locale-sensitive operations whose operations will vary from locale to locale.
  4. Pattern matching functions can recognize four engines of pattern description.

Installation:

There are a large number of functions incorporated in this package. Few important of them are discussed below.

Case functions

There are three functions that are used to convert the case of strings.

  1. str_to_upper() – This converts the entered string into upper case. The syntax is:
    str_to_upper(string, locale = “en”)

Here “en” stands for English, which is by default.

We see immediately all the characters in the strings turn into upper case.

  1. str_to_lower() – This converts the entered string into lower case. The syntax is:

str_to_lower(string, locale = “en”)

Here “en” stands for English, which is by default.

  1. str_to_title() – This converts the entered string into the proper case, in the sense, the first character in each term of the string is in capital and rest are in lower case. The syntax is:

str_to_title(string, locale = “en”)

Here “en” stands for English, which is by default.

str_c()

This is used to concatenate multiple strings into a single string. The syntax is:

str_c(…, sep = “”)

The “sep” stands for the separator. It is used when we want to concatenate the strings keeping any kind of separator between each term.

Here we see that output is shown but it is too congested to read. No space is there between words. Now, to make it look much more readable, we include a separator.

Now, this looks good.

str_length()

This is used to find out the length of the input string, or in other words, the total number of characters in the string. To be noted that, all the spaces between words are also calculated. The syntax is :

str_length(string)

This works well as the number of characters in the above string is 16, which includes spaces also.

str_count()

This is used to count the number of occurrences of the specified pattern in the given string. The syntax is :

str_count(string, pattern = “”)

The pattern to be mentioned can be anything, characters, numbers or special characters.

Here we have entered names of four fruits and we have asked to count the number of times “p” occurs in these individual fruit names. The output shows there are 2 times “p” occurred in “apple”, 1 time in “pears” and did not occur anytime in the rest of the names, which is true.

str_detect()

A similar to the above function is this. The only difference here is that it returns output in boolean datatype. This function tells us whether the given strings contain the given pattern or not. Hence the output comes to be “True” or “False”. The syntax is:

str_detect(string, pattern)

The syntax remains the same as the previous one. Hence we keep the same example as before and try to identify the difference between both the cases.

Here the output comes to be “True” or “False”,i.e., the first two strings contain the pattern and the rest don’t.

str_split()

This does exactly the opposite work of str_c(). It splits the given string, by the separator given. The syntax is:

str_split(string, pattern, n = Inf, simplify = FALSE)

Here we split the string by the space. So we get all the individual words.

Replace()

  1. str_replace() – This replaces the first occurrence of the pattern by some other given pattern. All the other occurrences remain as it is. The syntax is:

str_replace(string, pattern, replacement)

We see only the first occurrence of “p” gets replaced by “b”, the next one remains.

  1. str_replace_all() – This replaces all the occurrence of the pattern by some other given pattern. The syntax is:

str_replace_all(string, pattern, replacement)

So we see that both the “p” got replaced by “b”.

str_order()

This orders the given string in a certain order, either increasing or decreasing. The nature of it is to place the index numbers in order of the occurrence. The syntax is:

str_order(x, decreasing = FALSE, na_last = TRUE, locale = “”, …)

Here decreasing = False means that the order should not be decreasing, i.e., it should be increasing order.

As the inputs are in strings so the arrangements would be in alphabetical basis. As the decreasing order is mentioned true, therefore “pears” will come first (because “p” comes first in decreasing) then “orange” will come, and so on. Accordingly, we see that the 2nd element comes first, next is the 3rd element, then 4th element and lastly, the first element. Indexing wise the output is shown.

str_sort()

It does the same work as order(), but the difference is that here the strings are shown instead of their indexes. This looks much more convenient to understand as it would be difficult to remember all the indexes when the input set is large. Returning the names itself in sorted form is shown here.

str_pad()

It pads the string with the given argument, by the number of times mentioned and to the side of the string mentioned. The syntax is:

str_pad(string, width, side = c(“left”, “right”, “both”), pad = ” “)

Here “width” denotes the number of times the “pad” element is to be repeated. Also “side” denotes on which side it is to be incorporated.

So here we see that 5 spaces are added to the left of abc, which was desired.

str_trim()

It does exactly the opposite of what pad() is used for. Here this function removes all the extra spaces around the string. The syntax is:

str_trim(string, side = c(“both”, “left”, “right”))

So here we remove all the extra spaces that were present on both the sides of abc, which was desired.

Case Study

Now let us import a data set and imply these string operations on the fields.

Download: Dataset

In this dataset, there is personal information of 50 customers, with respect to 5 fields. One by one I repeat the above-mentioned functions implying them on the current dataset.

1. Suppose we want to keep the State names in all capital letters,

2. Suppose we want to concatenate the Education and Employment fields into one whole field, separated by “-”,

3. Suppose we want to count the length of the individual states,

4. Suppose we want to match the pattern that the corresponding customers are Male,

All the “1”s represent that they are MAle, “0” says not male.

5. Suppose want to replace all the occurrences of “W” in the customer’s id’s with “Y”,

6. Suppose we want to sort the customers in ascending order,

7. We can see there are lots of unwanted extra spaces in EmploymentStatus column. To remove them,

Hence we can see it’s so easy to manipulate strings in R using stringr package

You might also like More from author