# Data Visualization in Statistics and Data Science

In this Data Visualization article, we will cover some basics and important ways of Data Presentation. When it comes to the burning topic of Data Science, information from Data collected in a raw format isn’t easily comprehensible and is difficult to understand. Hence, that format needs to be condensed, organized and then analyzed. And even the though insights may be well manipulated it should have a proper presentation in such a format that it can be readily interpreted by readers.

Decision makers are adept to analytics being represented in the visual format so that they can get hold of underlying principles and concepts therefore appropriate presentation of data serves the purpose just right. Basic motive behind the presentation of data in different formats is to make use of the visual arrangement for the communication of obtained information productively, at the same time avoiding inaccuracy in findings.

Methods of presentation should be decided in line with the format of data, the tactics to be employed in analysis, and the information that needs to be conveyed. Inaccurately presented data fails to communicate info to its readers therefore different strategies of presentation must be sought after depending on the segment of the data needs to be focused more in comparison to other content.

Even though it’s such a common topic for everyone familiar to working with data but still individuals tend to make the wrong choice regarding the way of presentation to summarize their findings, hence this article aims at providing a concise description of the commonly used methods and their respective applications.

**Textual Representation:** This method is employed by most official agencies. This mode of representation has a particular appeal to readers with a literary bent of mind who prefer points of special importance to the monotonous tables or trends that a graph might reveal. Emphasizing on the critical points through textual representation is like adding colors to a black and white sketch.

For example: “*In 2018, a total of 130 students have taken admission to M.Sc. Operational Research course offered by the University of Delhi which comprises 90 students from North Campus & 40 students from South Campus. Out of the 90 students in North Campus, 20 students are enrolled in Hindu College, 25 students in Kirori Mal College, 5 students from St. Stephen’s College, 20 students in Hansraj College, 10 from Ramjas College & finally 10 other enrolled in Indraprastha College for Women.”*

**Tabular Representation:** The basic first step towards analyzing or interpreting data is its representation in a tabular form. It’s the most common technique of data presentation. Often in Data Science methodology, tables are usually used for analysis or comparison, it’s relatively easier to construct and has a fairly good readability, bringing out the essential features into a clearer perspective for the person working on the given dataset.

Source: Fictitious data, for illustration purposes only

**Bar Graph:** It is used for comparison and indication of values in a discrete dataset. The bars may be depicted horizontally or vertically depending upon the complexity of the categories. They are incorporated to summarize categorical data displayed by rectangles of the same width drawn at a gap between each rectangle.

A simple analysis of the lengths of the bars can be done to identify the smallest and largest set of values in each category. Bar graphs are one of the most commonly used methods of presentation of data in Statistics because of their straightforward application.

**Code to add Bar graph in R (using the mtcars dataset)**

1 2 3 4 5 6 |
# Grouped Bar Plot counts <- table(mtcars$vs, mtcars$gear) barplot(counts, main="Car Distribution by Gears and VS", xlab="Number of Gears", col=c("darkblue","red"), legend = rownames(counts), beside=TRUE) |

**Histogram:** It is a way to present and summarize statistical information that can be measured on an interval consisting of series of blocks with class intervals being plotted on the horizontal axis and respective frequencies being plotted on the vertical axis. It gives us an indication of the number of values lying between a certain range of values.

Useful in exploratory data analysis. While a bar graph is more of a comparison between discrete variables, a Histogram, on the other hand, is the representation of frequency distribution of continuous variables.

**Code to add Histogram in R**

1 2 3 |
# Simple Histogram hist(mtcars$mpg) |

**Pie Chart:** It is used to visualize and distribute data belonging to different categories. It is most extensively applied in cases involving office statistics where percentages and proportions form an integral part of most presentation works. Represented by a circle divided into different segments where each segment depicts a particular category and the area of a segment is directly related to the number of observations lying in that category.

Pie charts give a quick readability of the proportions in data but they are effective only when the number of different sections in data is not too large.

**Code to add a simple Pie Chart in R**

1 2 3 4 5 6 7 8 9 |
# Pie Chart with Percentages slices <- c(10, 12, 4, 16, 8) lbls <- c("US", "UK", "Australia", "Germany", "France") pct <- round(slices/sum(slices)*100) lbls <- paste(lbls, pct) # add percents to labels lbls <- paste(lbls,"%",sep="") # ad % to labels pie(slices,labels = lbls, col=rainbow(length(lbls)), main="Pie Chart of Countries") |

**Line Graph:** These are basically used to depict the trend of the variable over a time period or a continuous interval. Particularly useful in observing patterns in data such as climatic changes, rainfall statistics, unemployment rates etc.

The horizontal axis is used to represent the continuous variable and the vertical axis gives us the measurement scale or quantitative value of the data. The graph usually includes a series of upward and downward slopes which indicate an increasing and decreasing trend of values respectively.

**Code to add Line Graphs in R**

1 2 3 4 5 6 7 8 9 10 |
x <- c(1:5); y <- x # create some data par(pch=22, col="red") # plotting symbol and color par(mfrow=c(2,4)) # all plots on one page opts = c("p","l","o","b","c","s","S","h") for(i in 1:length(opts)){ heading = paste("type=",opts[i]) plot(x, y, type="n", main=heading) lines(x, y, type=opts[i]) } |

**Scatter Plots:** Scatter plots make use of coordinates to plot the values from two variables on the respective axes. With the help of Scatterplots, we can infer whether the two variables are correlated or not. If with the increase in the value of one variable the other variable also increases then a positive correlation is said to exist between the two, similarly if the increase in the value of one variable is directly related to the depreciation in the value of another value then the two are said to be negatively correlated.

There can also be cases of no correlation, exponential relation etc. The proximity of points determines the strength of the correlation between the variables. It is useful in cases where we have a paired data set and we need to see how changes in one variable affect the values of other variables.

**Code to add Scatter Plots in R**

1 2 3 4 5 6 7 8 |
# Simple Scatterplot attach(mtcars) plot(wt, mpg, main="Scatterplot Example", xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19) # Add fit lines abline(lm(mpg~wt), col="red") # regression line (y~x) lines(lowess(wt,mpg), col="blue") # lowess line (x,y) |

**Box and Whisker Chart:** This chart conveniently displays critical measures of a data set by making use of quartiles. The extending lines to the boxes which are termed as ‘whiskers’ indicate the extent of variability of data beyond the upper and lower quartile ranges which is then followed by a dash of line depicting the extreme values at both ends. Outliers are accounted for by single/individual points.

The line inside the box is the median value and with the help of box charts we can obtain an idea of whether the data is symmetrical or not, whether it is skewed or not, if yes then in which direction etc. Hence Box and Whiskers Chart have the advantage of giving more insights of a data set without any cluttering up of space like other measures.

**Code to add Box and Whiskers in R**

1 2 3 4 |
# Boxplot of MPG by Car Cylinders boxplot(mpg~cyl,data=mtcars, main="Car Milage Data", xlab="Number of Cylinders", ylab="Miles Per Gallon") |