“Without data you are just another person with an opinion.”
Be it a presentation at work, or a report submitted to your boss or just as simple as asking your parents for permission to go that trip you’ve always wanted to go to, it all boils down to the simple idea of communication. Humans are built to communicate, we strive on communication, using it as a very handy tool as we go about our daily lives.
The way you communicate an idea and put it across the table somehow determines the result and the outcome of the situation.
Very similarly, communication is a very integral part of Data Science. Communicating what a data actually wants to convey and what meaning the data holds in a way that is extremely comprehensible and understandable even by a layman is one of the things Data Scientists aim at. And graphics provide exactly that; an excellent approach towards data exploration and data presentation.
Even though data visualization and graphics have been used in statistics and for data representation for ages, the conceptualization and a body of formal theory seem to be missing.
The visual processing center or the part of the cerebral cortex which processes and helps us interpret visual information is indirectly what we talk about when we discuss the depiction of complex ideas, uncanny fluctuations, upward and downward trends, undulated structures etc; with so much clarity, precision, and orderly coherence.
These are exactly the adjectives of how graphical structures or statistical graphs, as we say it, should be.
The graphical reveal of data gives us a clear picture, presents so many ideas and numbers in a small space, helps us avoid the distortion of the idea the data seeks to present and eventually serve a reasonably clear purpose of description, exploration, tabulation, and decoration.
Statistical graphics, just like the calculations, are only as good as what goes into them. A badly defined graph or an unfitting model or even an undernourished data set cannot be rescued by a graphic, no matter how fancy or attractive it may be.
So, let’s have a look at the table of Content for this article:
- What is Visualization and what role it plays in Data Science
- Intro to R and Tableau Visualization (If possible integration of both How?)
- Things that cannot be or difficult in R and that can be easily done in Tableau
- Talk about Introduction to a tableau and try to create that eagerness to the viewer for your next article
We need to understand, learn and adopt the practice of graphical excellence. In other words, using the correct graphs in the correct place at the correct time leading to the flawless communication of various quantitative ideas; that is what the motivation should be.
Understanding Visualization with R and Tableau
Now, one of the two most popular tools and languages used for data analytics and visualization these days is Tableau and R. In this section we will try and understand and see for ourselves how both of these fair as far as visualization is concerned.
This is not a comparison between R and Tableau in any way, just a representation of how they deal with very basic visualizations and the kind of outputs they produce.
For a graphic analysis with both these graphics tools, we will use one standard dataset; the Cable TV subscription dataset. This dataset comprises 300 rows and 7 variables.
The variables are age, gender, income, kids, ownHome, subscribe and segment.
The variable description is as follows:
Age – the age of the TV subscriber
Gender – the gender of the TV subscriber
Income – the income of the TV subscriber
kids – the number of kids the TV subscriber has
ownHome – if the TV subscriber owns the home or not
subscribe – if they have subscribed to the TV services or not
segment – the segment of the TV subscriber’s subscription
You can access the dataset and read more about it on the link below.
Now let us start our visualization using R first. We will use the lattice package in R to carry out our visualization. We’ll also carry out some pre-visualization analysis to understand the data a little bit better. Tableau, on the other hand, does not provide with any means or ways of basic statistical analysis (Although we can still do it by making graphs and interpreting them)
Reading and Viewing the Data into R Studio
tvdata <- read.csv(choose.files()) #imporing data
head(tvdata) #viewing first 5 rows
str(tvdata) #displying the internal structure of on the objects
Finding overall descriptives of the dataset
Both the summary function and the description function in the psych library tell us about the mean, median, max, min, quartile values etc. which gives us a general idea about the dataset that we are going to be working upon.
Now that we have carried out the basic descriptive analysis and we know certain things about the data, let us go forward and see what kind of visualizations we can build.
Visualizations using R
histogram(~subscribe | Segment, data=tvdata,
This is a histogram depicting subscriber on the X-axis and the count (number of people) on the Y-axis, the subscriber has binary values ie. subYes and subNo. This plot is broken down by the four segments represented in the quadrants.
Let us look at the segment-wise analysis:
- Travelers segment – Here we can see that there are around 70 people who have not applied for the subscription and around 10 who have.
- Urban Hip segment – In this segment, we can see that 40 people do not have the subscription whereas 10 people do.
- Moving Up segment – Moving up segment depicts that around 58 people do not have the subscription and 15 do.
- Suburb Mix segment – In this segment, we can see a stark difference in the count of people who have the subscription i.e. 90 with the ones who don’t i.e. 10.
histogram(~subscribe | Segment + ownHome, data=tvdata,
These histograms represent the same idea as the above graph; subscribers vs count of people with the breakdown on the basis of the segment and if the people own a home i.e. ownYes or they don’t own a home i.e. ownNo.
For example, we can see that the Moving Up segment is divided into two parts, ownYes and ownNo respectively. And in one segment with the two divisions on the basis of ownHouse, we can see the count of people who have taken the TV subscription or now.
For the moving up segment, among the people who own a house, we can observe that around 20 have not taken the subscription while 5 have taken the subscription.
In the same segment among the people who do not own a house 38 people have not taken any subscription and 10 have taken the subscription.
We can carry out the similar analysis for the other segments.
seg.agg <- aggregate(income ~ Segment + ownHome, data=tvdata, mean)
barchart(income ~ Segment, data=seg.agg,
par.settings = simpleTheme(col=c("gray95", "gray50")))
To create this bar chart which is the mean income on the Y-axis vs the segment types on the X-axis is broken down between the ownHouse variable i.e. if people own the house or not. I calculated the mean income for each segment by taking into account, both the ownYes and ownNo values on the ownHome variable.
We can see in the table below what the mean incomes are for each segment as per ownYes and ownNo house situation.
Now, coming back to the bar charts, we can easily see that the average income in the Moving Up segment for the people who own a house, ownYes represented in dark grey, is around 50,000 and for ownNo, represented in light grey, is around 54,000.
And similarly, we can carry out the analysis for the other segments.
Now that we have gone through some descriptive and visual analysis by using R; we will move to Tableau and try and plot plots narrating the same idea and analysis the above plots did. This will give a thorough comprehensive idea of Tableau’s capabilities to deal with the same situations. Another thing that we will focus upon here is Tableau’s ability to make much more interesting and intuitive plots with extreme ease.
Note: As the Tableau visualizations are dynamic and display options and information by hovering over plot and the data points, it is best to view them by using the links mentioned along with the visuals, the links will redirect you to my Tableau Public profile where these visuals are saved.
Visualizations using Tableau
This is a packed bubble chart, they are used to display data in a cluster of circles. Dimensions define the individual bubbles, and measures define the size and color of the individual circles.
Here the size of the bubbles represents the income of the people in the data and the orange circle represents the ones who own a home and the blue ones represent the ones who do not own a home.
Now the general hypothesis was that people with less income would not own a home, but that has been proven false because we can see that there are big blue circles and even smaller orange circles. This would also imply that there is no correlation between income and if a person owns a home or not.
This collection of packed bubbles broken down into the four segments represent the people who own a home in orange bubbles and the ones who don’t in blue bubbles. Again the size of bubbles represents the income value.
This plot is significant as it gives us an idea about the distribution of the incomes and the house owners among the various segments.
We can observe that on the most travelers own a home as compared to the other segments and the travelers are also the ones with the most income. Also, the urban hip segment mostly does not own a home and people in that category also have the least income.
This plot depicts individual bars under each segment category, on the X-axis, which represents the income of the people, which is on the Y-axis.
The segments are also color coded. The blue representing Moving Up, orange representing Suburb Mix, Red depicting Travelers and finally the turquoise color represents the Urban Hip segment.
Basically, the income for each segment is broken down by subscribe and ownHome and the lateral division is between the genders. The vertical division is done on the basis of subNo and subYes, which is if people have subscribed or not; that division is then subdivided into ownYes and ownNo, which is if people own a home or not.
So let us consider the top left the most cell, that is (1,1); if this graph is considered to be a 2 by 4 matrix. Cell (1,1) represents the income of all the females in all the four segments who have not subscribed to the cable TV and who also do not own a house.
Let us take another example of the bottom most right cell, which would be (2,4). Cell (2,4) represents the income of all the males in all the four segments who have subscribed to the cable TV and who own a house.
R vs Tableau – A General Overview
Technically R and Tableau cannot be compared as such. R is basically used mainly for exploratory analysis while Tableau on the other hand functions as a visualization tool, making attractive dashboards being its forte.
R and Tableau in many situations work in tandem; Tableau makes it faster and easier to identify patterns and then the practical models can be built using R.
With a variety of libraries in R, we can make almost any kind of chart; but the features Tableau provides for visual information dissemination are outstanding. Moreover creating a dissemination flow in R will be time-consuming.
Most importantly R is a language whereas Tableau is a software/tool. That essentially means that R requires writing scripts whereas Tableau is basically menu driven. But the fact that R requires scripts makes it much more customizable and more flexible in terms of the type of visualizations you can generate.
As far as the outcome/effort ratio is concerned Tableau delivers clearly above R, with minimal effort required to create phenomenal visualizations.
Considering the out of the box connection capabilities, Tableau seems to have an upper hand as it can immediately consume multiple file types, can connect to multiple databases and had pre-build connections to various services.
Geospatial data handling capabilities are unsurmountable in Tableau. The ease with which the longitudes and latitudes are automatically classified and a simple drag and drop can create maps. There are various packages in R which provide the capability to perform cartographic or geospatial analysis and make such visualizations. But Tableau easily beats R in the over experience and ease.
R is open source and new additions and new libraries for almost every utilitarian function. It can be found whereas the advanced versions of Tableau are paid.
As a bonus for Tableau users, Tableau Desktop can now connect to R through calculated fields and take advantage of R functions, libraries, packages and even saved models.
Understanding Tableau Better
Now that we have seen what tableau is about and how is it as compared to R and the kind of basic visualizations we can make using it, let’s just completely dive into it.
Now the question arises, that why do we choose Tableau from all the other data visualization tools available? What does it have that others don’t?
Here’s the answer – https://www.youtube.com/watch?v=37Mx3uZRwBE