The post Web Scraping Tutorial Using Python – Part 1 appeared first on StepUp Analytics.
]]>Above analogy is applicable to the ubiquitous data too. Most of the times we can get the data from various sources like kaggle etc. but there are scenarios where we need customized data. For this, we have to choose the path of web scraping i.e. getting the data from websites using either the API’s provided or through python and its libraries.
Once done with the step of getting the data, we would be required to clean and handle it. Thus making it appropriate for the extraction of information from it. At last, we would be infusing flavors into it i.e. getting the features of the data for information extraction.
In this article Web Scraping using Python, We will be covering a series of articles where all the data preparation steps will be covered which are as follows:
In this first article, we will be learning about Web Scraping. The points covered in this article given below:-
What is Web Scraping?
Web scraping is a way to extract the information from web pages which is present in HTML format. The data is present in an unstructured format, so web scraping helps to get this data along with this we can convert it into a structured format.
Different ways opted for Web Scraping.
There are numerous ways through which we can scrape the web. Some of them are as follows:-
It’s time to have a look at the libraries which are used for web scraping.
Web scraping through Beautiful Soup. Here we will be scraping the web through the Beautiful Soup library. For scraping purposes, we are using a weather forecast website. We will be scraping the weather forecast data of San Francisco. So let’s start this journey!!!!
First, we will be importing BeautifulSoup library as bs4 and requests library which is used for extracting URL.
Using the requests library we are downloading the desired web page which consists of URL of the web page, latitude, and longitude of the city i.e. San Francisco.
After downloading the page, we will be parsing the HTML content using BeautifulSoup.
Forgetting the forecast from the web page, we will have to inspect the webpage and recognize the ‘id’ tag value and assign it to a variable.
Our next task is to get the class attribute of the ‘id’ tag and we assign this to forecast_items variable. Here the find_all () function gets all the class attributes of the page.
We are printing the HTML code and then using prettify () function to get in a structured manner.
Next task is to identify the class attributes for extracting more information. The three variables used are period, short description and temperature.
For obtaining the title of the forecasts, we are using the ‘title’ attribute of ‘img’ tag. After obtaining data, we will be using prettify () to get the data structured and then printing the title.
We have to iterate over period tags for getting the period names of further days. Here we have used list comprehension for this.
Furthermore, using the ‘tombstone-container’ we are extracting the short description, temperature, and description of different days of forecast.
For better and clear representation of data, we will be mapping the values to a dataframe.
Therefore, this dataframe is the final result which is obtained through this web scraping.
Coming to another way of scraping, we will be using Scrapy framework. Scrapy creates new classes called Spider that define how a website will be scraped by providing the starting URLs and what to do on each crawled page.
Using scrapy we will be scraping URL of the images of headphones from amazon.com and getting them in text file.
For installing scrapy.
: pip install scrapy
Let’s start scraping with scrapy.
First, we have to open a command prompt and then create a new project by using
Here ‘startproject’ command creates a new folder with a name as ‘headphones’. This folder will contain 4 already created files which are items.py, settings.py, middleware.py and pipelines.py and these are required for creating a spider. These files can be customized if required.
Now we will create ‘spider’ by ‘genspider’ command where we specify the name of
NOTE: Keep in mind that the project name and spider name should be different.
Output:
We are starting with the most basic scrapper python class which is using scrapy. The Spider which is spider class provided by Scrapy. Here we are using the name of the spider and then using the init () function.
In the start_requests () function we are specifying the URL which is to be crawled. Then we iterate over each URL and we yield the URL’s using Request() function of scrapy.
In this parse () function, we are extracting the URL’s of the images through the ‘
Continuing this parse () function, we are using the try and except block. The try part is used to get the next link which is present in ‘span’ tag and yielding the links which are followed.
Lastly, in except block we check if there are no more links available. Then we are creating a file i.e. text file which consists of URL’s of images. Here we are converting the URL’s into strings. The code for this article can be found here.
The post Web Scraping Tutorial Using Python – Part 1 appeared first on StepUp Analytics.
]]>The post 5 Hints Each Blogger Must Take After When Composing Visitor Web journals appeared first on StepUp Analytics.
]]>Visitor blogging is a prevalent technique utilized by bloggers to expand the activity to their site. There are various visitor blogging sites where bloggers can present their substance to pitch their item or help in writing a thesis statement to the coveted gathering of people. Be that as it may, such as everything, visitor blogging has its own particular arrangement of tenets and controls. Considering these principles will guarantee that your visitor blogging exertion ends up fruitful.
In this article, you will read about the tips each blogger must take after when composing visitor web journals
Numerous beginner bloggers wrongly ignore rules for the site they are composing for, figuring their innovativeness can’t be bound by a specific arrangement of rules. This prompts dismissal of the article put together by the blogger.
Along these lines, it is vital to peruse the rules given by the site and figure your substance appropriately. Remember the rules and the intended interest group of the site which you are composing for. Doing generally will just demolish your association with the host blogger.
Great blogging is tied in with making an effect on the clients. Whatever substance you write in the visitor sites mirrors your image straightforwardly. You need to settle on a watchful selection of words, theme, and watchwords to guarantee it leaves a positive effect on the peruser.
Attempt and be one of a kind as far as what you cover in your article. An unmistakable storyline that makes some significant focuses will probably get read and will enable you to manufacture a dependable gathering of people.
Visitor blogging gives you a chance to get more movement on your site, blog, internet-based life profile or wherever you need to with the aides of connections. In any case, as specified before dependably check with destinations rules around joins, a few locales (like Advanced Donut) won’t enable you to connect to pages that incorporate an information catch frame. These connections can be utilized to take the peruser to an item page, Facebook promotion, digital book, video and so forth.
The point is to take your group of onlookers to a particular place where you can accomplish the coveted result, yet in addition, offers some benefit to your peruser. Abstain from embeddings arbitrary connections which leave the peruser confounded.
Most bloggers are anxious to get an enormous measure of activity and endorsers by distributing arbitrary posts. In any case, on the off chance that you need to construct a reliable brand picture then you must be tolerant. It requests close checking and examination of details and figures to guarantee the article returns long haul results.
Go for a visitor post with a long haul system and not only 10 long stretches of buzz. For example, with a strong catchphrase inquire about, you will have the capacity to support your notoriety and believability in that specific specialty. Additionally, this guarantees the visitor post positions well on Google.
We as a whole know the significance of getting remarks on visitor posts. It plainly implies individuals are understanding it and thinking that it is sufficiently profitable to leave a remark. It’s through remarks that you can grow long-haul associations with your dependable perusers.
Likewise, the remarks area is where you can improve your picture as a brand or a specialist. In this way, ensure you react each and every remark or question in the remark segment on auspicious premise. You can likewise take up a thought from the remarks for your next post and report it without further ado.
Do you Have any Visitor Posting Techniques?
These were a portion of the visitor posting methodologies that have truly worked for me and my colleagues. In the event that you have another methodology or even a silly thought, don’t hesitate to share them in the remarks beneath.
The post 5 Hints Each Blogger Must Take After When Composing Visitor Web journals appeared first on StepUp Analytics.
]]>The post Game Theory For Competitive Programming appeared first on StepUp Analytics.
]]>Very few Competitive Programmers are aware of Game Theory. The reason is lack of good resources on the internet about the Game Theory. But don’t worry, through this blog you will clear you’re all the doubts related to game theory.
This topic is more of an intuitive topic. I shall try my best to develop your intuition in the same.
Combinatorial games are two-person games with perfect information and no chance moves (no randomization like coin toss is involved that can affect the game). These games have a win-or-lose or tie outcome and determined by a set of positions, including an initial position, and the player whose turn is to move.
Player moves from one position to another, with the players usually alternating moves, until a terminal position is reached. A terminal position is one from which no moves are possible. Then one of the players is declared the winner and the other the loser, or there is a tie (Depending on the rules of the combinatorial game, the game could end up with a tie).
The only thing that can be stated about the combinatorial game is that the game should end at some point and should not be stuck in a loop. But one of the looping game is a game like chess
In order to prevent such looping situation in chess (consider the case of both the players just moving their queen’s to-and-fro from one place to the other), there is actually a “50-move rule” according to which the game is considered to be drawn if the last 50 moves by each player have been completed without the movement of any pawn and without any capture. Source: Stackexchange.
Especially the coding part of Combinatorial Game Theory (CGT) is relatively very small and easy. The key to the Game Theory problems is that hidden observation, which can be sometimes very hard to find.
Some of the following games those come under the category of Combinatorial Game Theory:
I know that you are very well aware of both the first and second games, but you are thinking about the third game.
What is this game?
How to play this game?
But don’t worry I will clear your all doubts later in this Blog. Let us leave that for now and move forward. We can divide combinatorial games into two categories as shown below:
Impartial Games:
In impartial Games, the possible moves from any position of the game are the same for the players.
Partisan Games:
In Partisan Games the possible moves from any position of the game are not the same for the players.
Let’s understand these Games (Impartial and Partisan) with an Example one by one.
1. Given a number of piles in which each pile contains some numbers of stones/coins. In each turn, the player chooses one pile and remove any number of stones (at least one) from that pile. The player who cannot move is considered to lose the game (i.e., one who takes the last stone is the winner).
As it can be clearly seen from the rules of the above game that the moves are the same for both the players. There is no restriction on one player over the other. Such a game is considered to be an impartial Game.
The above-mentioned game is famous by the name Game of Nim which will be discussed in detail later in this blog.
2. Let us take an example of Chess Game in this game, one player can only move the black pieces and the other one can only move the white ones. Thus, there is a restriction on both the players. Their set of moves are different and hence such a game is classified under the category of Partisan Games.
Partisan Games are much harder to analyze than Impartial Games as in such games we can’t use the Sprague-Grundy Theorem (will explain later in this blog).
Now, we already know what is Game of Nim (given in the previous section).
Here, I will explain to you how to solve the problem of (Game of Nim) in the Competitive Programming.
Here, I will take an example, consider that there are two players- Alice and Bob, and initially there are three piles of coins having 3, 4, 5 coins in each of them as shown below. We assume that first move is made by A. See the below figure for the clear understanding of the whole gameplay.
Here, Both Alice and Bob are expert in this game, they will not do any mistake during the game.
In this game, we will take both scenarios, when Alice takes the first move or Bob takes the first move.
Alice makes the first move:
Here, Alice means A and Bob means B
Bob makes the first move:
Here, Alice means A and Bob means B
After seeing both figures, it must be clear that the game depends on one important factor – Who starts the game first?
Here, one question may come to your mind. Does the player who starts first will win every time?
Let us again play the game, starting with Alice, and this time with a different initial configuration of piles.
The piles have 1, 4, 5 coins initially.
Will Alice win again as he has started first? Let us see.
Here, we can see in the figure, Alice has lost. But how? We know that this game depends heavily on which player starts first. Thus, there must be another factor which dominates the result of this simple-yet-interesting game. That factor is the initial configuration of the stones/piles. This time the initial configuration was different from the previous one.
So, we can conclude that this game depends on two factors:
But wait. How to solve this problem, how to find the winner of this game, when this problem comes Competitive Programming.
In fact, we can predict the winner of the game before even playing the game! This helps the Competitive Programmer to solve this problem.
To solve this problem, we need to calculate the Nim sum.
Nim sum: The cumulative XOR value of the number of coins/stones in each pile/heaps at any point of the game is called Nim-Sum at that point.
“If both Alice and Bob play optimally (i.e.- they don’t make any mistakes), then the player starting first is guaranteed to win if the Nim-Sum at the beginning of the game is non-zero. Otherwise, if the Nim-Sum evaluates to zero, then player Alice will lose definitely.”
For the proof of the above theorem, see: Wikipedia
Let us apply the above theorem in the games played above. In the first game, Alice started first and the Nim-Sum at the beginning of the game was, 3 XOR 4 XOR 5 = 2, which is a non-zero value, and hence Alice won. Whereas in the second game-play, when the initial configuration of the piles was 1, 4, and 5 and Alice started first here Nim sum, 1 XOR 4 XOR 5 = 0, through the above theorem Alice will Lose the game.
C++ implementation of the above Theorem:
But Competitive Programming is not a sport for kids, in good programming contests, you will not find Game Theory problems as simple as above. To solve good problems, I will cover some important topics in Game Theory below.
Grundy Number is a number that defines a state of a game. We can define any impartial game (example: nim game) in terms of Grundy Number.
Grundy Numbers or Nimbers determine how any Impartial Game (not only the Game of Nim) can be solved once we have calculated the Grundy Numbers associated with that game using Sprague-Grundy Theorem (will explain later in this blog).
But before calculating Grundy Numbers, we need to learn about another term- Mex.
What is Mex?
‘Minimum excludant’ also known as ‘Mex’ of a set of numbers is the smallest non-negative number not present in the set.
The Grundy Number/ number is equal to 0 for a game that is lost immediately by the first player and is equal to Mex of the numbers of all possible next positions for any other game.
Below are three example games and programs to calculate Grundy Number and Mex for each of them. Calculation of Grundy Numbers is done basically by a recursive function called as calculate_Grundy() function which uses calculate_Mex() function as its sub-routine.
Through these examples, you will able to know that how Grundy Numbers and Mex is helpful to solve the problems.
Example 1
The game starts with a pile of n stones, and the player to move may take any positive number of stones. Calculate the Grundy Numbers for this game. The last player to move wins. Which player wins the game?
Answer:
Since if the first player has 0(n=0) stone, he will lose immediately, so Grundy (0) = 0
If a player has 1 stone, then he can take all the stones and win. So the next possible position of the game (for the other player) is (0) stones.
Hence, Grundy (1) = Mex (0) = 1 [According to the definition of Mex]
Similarly, if a player has 2 stones, then he can take only 1 stone or he can take all the stones and win. So the next possible position of the game (for the other player) is (1, 0) stones respectively.
Hence, Grundy (2) = Mex (0, 1) = 2 [According to the definition of Mex]
Similarly, if a player has ‘n’ stones, then he can take only 1 stone, or he can take 2 stones……. or he can take all the stones and win. So the next possible position of the game (for the other player) is (n-1, n-2,.1) stones respectively.
Hence, Grundy(n) = Mex (0, 1, 2, …. n-1) = n [According to the definition of Mex]
We summarize the first the Grundy Value from 0 to 10 in the below table:
Optimized Dynamic Programming Code in (C++):
Example 2:
The game starts with a pile of n stones, and the player to move may take any positive number of stones up to 3 only. The last player to move wins. Which player wins the game? This game is 1 pile version of Nim.
Answer:
Since if the first player has 0 stones, he will lose immediately, so Grundy (0) = 0
If a player has 1 stone, then he can take all the stones and win. So the next possible position of the game (for the other player) is (0) stone
Hence, Grundy (1) = Mex (0) = 1 [According to the definition of Mex]
Similarly, if a player has 2 stones, then he can take only 1 stone or he can take 2 stones and win. So the next possible position of the game (for the other player) is (1, 0) stones respectively.
Hence, Grundy (2) = Mex (0, 1) = 2 [According to the definition of Mex]
Similarly, Grundy (3) = Mex (0, 1, 2) = 3 [According to the definition of Mex]
But what about 4 stones?
If a player has 4 stones, then he can take 1 stone or he can take 2 stones or 3 stones, but he can’t take 4 stones (see the constraints of the game). So the next possible position of the game (for the other player) is (3, 2, 1) stones respectively.
Hence, Grundy (4) = Mex (1, 2, 3) = 0 [According to the definition of Mex]
So we can define Grundy Number of any n >= 4 recursively as-
Grundy(n) = Mex [Grundy (n-1), Grundy (n-2), Grundy (n-3)]
We summarize the first the Grundy Value from 0 to 10 in the below table-
Optimized Dynamic Programming Code in (C++):
Example 3:
The game starts with a number- ‘n’ and the player to move divides the number- ‘n’ with 2, 3 or 6 and then takes the floor. If the integer becomes 0, it is removed. The last player to move wins. Which player wins the game?
Answer:
Suppose, we take n=7, Now the first player can divide the n with (2,3 or 6).
If first player divide n by 2 n=floor(n/2), n=3
If first player divide n by 3 n=floor(n/2), n=2
If first player divide n by 6 n=floor(n/2), n=1
Then for the second player n could be 3,2 or 1.
So Grundy (7) =Mex (1,2,3) =0 [According to the definition of Mex]
We summarize the first the Grundy Value from 0 to 10 in the below table:
Optimized Dynamic Programming Code in (C++):
Above we have learned how to find Grundy Numbers through the examples. For solving tough problems, we have to learn (Sprague – Grundy Theorem).
Suppose there is a composite game (more than one sub-game) made up of N sub-games and two players, Alice and Bob. Then Sprague-Grundy Theorem says that if both Alice and Bob play optimally (i.e., they don’t make any mistakes), then the player starting first is guaranteed to win if the XOR of the Grundy numbers of position in each sub-games at the beginning of the game is non-zero. Otherwise, if the XOR evaluates to zero, then player A will lose definitely, no matter what.
How to apply Sprague Grundy Theorem?
We can apply the Sprague-Grundy Theorem in any impartial game and solve it. The basic steps are listed as follows:
Now, we take an example and understand how to apply Sprague Grundy Theorem to find the winner, we will follow every four steps one by one.
Example:
The game starts with 3 piles having 3, 4 and 5 stones, and the player to move may take any positive number of stones up to 3 only from any of the piles [Provided that the pile has that much amount of stones]. The last player to move wins. Which player wins the game assuming that both players play optimally?
Answer: we will follow each step.
First Step: The sub-games can be considered as each pile.
Second Step: We see from the below table that
We have already seen how to calculate the Grundy Numbers of this game above in this blog.
Grundy(3)=3
Grundy(4)=0
Grundy(5)=1
Third Step: The XOR of 3, 4, 5 = 2.
Fourth Step: Since XOR is a non-zero number, so we can say that the first player will win.
C++ program that implements above all four steps:
References: Wikipedia
I will explain to you one more very good Problem based on Nim-game, that will seriously boost your skill in Game Theory.
Example (composite game):
N x N chessboard with K knights on it. Unlike a knight in a traditional game of chess, these can move only as shown in the picture below (so the sum of coordinates is decreased in every move). There can be more than one knight on the same square at the same time. Two players take turns moving and when it is a player’s, turn he chooses one of the knights and moves it. A player who is not able to make a move is declared the loser.
Answer:
This is the same as if we had K chess boards with exactly one knight on every chessboard. This is the ordinary sum of K games and it can be solved by using the Grundy numbers. We assign Grundy number to every subgame according to which size of the pile in the Game of Nim it is equivalent to. When we know how to play Nim we will be able to play this game as well.
Here, Pseudocode for generating Grundy numbers for each position on the ChessBoard.
int grundy_Number(position pos) { moves[] = possible positions to which I can move from pos set s; for (all x in moves) insert into s grundy_Number(x); //return the smallest non-negative integer not in the set s; int ret=0; while (s.contains(ret)) ret++; return ret; }
How to find the Grundy numbers in this game?
We use the same concept to find the Grundy Number as we did in the Game of Nim.
Grundy number(m) for each position on chess board:
Suppose you are calculating for the position x.
G(m)= Mex(G(X1), G(X2) …. G(xm))
Where m= {number of possible moves from position x of the knight}.
G(X1), G(X2), G(X3) …. and G(xm) are the Grundy Numbers for all the position where a knight can move from x. These Grundy numbers are already calculated by you.
The following table shows Grundy numbers for an 8 x 8 board:
A better approach is to compute Grundy numbers for an N X N chessboard in O(n^2) time and then XoR these K (one for every horse) values. If their xor is 0 then we are in a losing position, otherwise, we are in a winning position.
Why is the pile of Nim equivalent to the subgame if its size is equal to the Grundy number of that subgame?
Other composite games:
It doesn’t happen often, but you can occasionally encounter games with a slightly different set of rules. For example, you might see the following changes:
Q. When it is a player’s move he can choose some of the horses (at least one) and move with all the chosen ones?
Solution: You are in a losing position if and only if every horse is in a losing position on his own chess board (so the Grundy number for every square, where the horse is, is 0).
Problems for Practice:
GAME3 — Yet Another Fancy Game [SPOJ]
GAME31 — The game of 31 [SPOJ]
Advanced details in competitive programming, Check my GitHub repo:
Awesome-competitive-programming
Happy coding
The post Game Theory For Competitive Programming appeared first on StepUp Analytics.
]]>The post A Refresher on Regression Analysis appeared first on StepUp Analytics.
]]>However, if we are given the optimum combinations of these predictor variables, we can build a model for the crop yield which can be used to predict the required crop yield (model builds data).
While examining a patient, the dosage is set keeping in mind his other illnesses and previous medical records such as blood sugar level, cholesterol, eyesight, etc. Here dosage can be considered as some dependent variable and his other illnesses, medical records are considered as independent variables.
Such a relationship, when a dependent variable needs to be measured considering all other independent variables is expressed through terms like correlation and regression.
In simple terms, regression helps us to predict or analyze relationships between two or more variables. The factor being predicted is known as a dependent variable and the factors that are used to predict the values of the dependent variable are called independent variables.
Regression analysis is used to do the same. For example, you might guess that there is a connection between how much you eat and how much you weigh, regression analysis can help you quantify that. Regression analysis will give us an equation for a graph so that we can make predictions about our data.
In statistics, some random numbers lying in a table make little sense to us. To make sense out of it, we can use regression and obtain some inferences about the future performance of the given random variable.
Suppose you’re a sales manager trying to predict next month’s numbers. You know that dozens, perhaps
even hundreds of factors from the weather to a competitor’s promotion to the rumor of a new and
the improved model can impact the number.
Perhaps people in your organization even have a theory about what will have the biggest effect on sales. “Trust me. The more rain we have, the more we sell.” “Six weeks after the competitor’s promotion, sales jump.”
Regression analysis is a way of mathematically sorting out which of those variables does indeed have an
impact. It answers the questions: Which factors matter most? Which can we ignore? How do those factors interact with each other? And, perhaps most importantly, how certain are we about all of these factors?
The best way to understand linear regression is to relive the experience of childhood. If you ask a class fifth child to arrange people in his class by increasing order of weight, without asking them their weights, the child would likely look (visually analyze) at the height and build of the classmates and arrange them using a combination of these visible parameters.
The child has actually figured out that height and build would be correlated to the weight by a linear relationship. This is the linear regression in real life!
In simple terms, simple linear regression is predicting the value of a variable Y (the dependent variable)
based on some variable X (the independent variable) provided there is a linear relationship between the
variables X and Y.
If there are more than one independent variables, then we can predict the value of Y using Multiple Linear Regression. For example, when we predict rent based on square feet alone, then we can use simple linear regression, but when we predict the rent based in square feet and age of the building, then we will use multiple linear regression.
The linear relationship between the two variables can be represented by a straight line, called the regression line.
Now to determine if there is a linear relationship between two variables, we can simply plot the scatter plot (plotting of the coordinates (x,y) on a graph) of variable Y with variable X. If the plotted points are randomly scattered then it can be inferred that the variables are not related.
There is a linear relationship between the variables.
If there are points lying in a straight line, then there exists a linear relationship between the variables.
After drawing a straight line through the points plotted, we will find that not all the points lie on the line. This happens because the line that we have drawn may not be the best fit and the points plotted are probabilistic, i.e., our observations are approximate.
But, when there exists a linear relationship between X and Y, then we can plot more than one line through these points. How do we know which one is the best fit?
To help us choose the line of best fit, we use the method of least squares.
Least Squares
This is the mathematical relationship between the variables X and Y where,
X is the independent variable
Y is the dependent variable
𝑏𝑜 is the intercept of the regression line
𝑏1 is the slope of the regression line
e is the error or deviation from the actual/ observed variable of the variable Y
Here, is the difference between the ith observed value and the ith calculated value. This error could be positive or negative. We have to minimize this error to get the line of best fit. On minimizing the error sum of squares, we obtain the values of 𝑏𝑜 and 𝑏1 using the two normal equations
And
Then, we find the values of 𝑦𝑖 for the given values of 𝑥𝑖 and plot the line of best fit.
R Code for Simple Linear Regression
We use the lm() function to create a relationship model (between the predictor and the response variable). The basic syntax for lm() function is:
lm(formula,data)
formula – the symbol for presenting the relationship between x and y
data – vector on which the formula will be applied
For example:
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131) y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48) # Apply the lm() function. relation <- lm(y~x) print(relation)
So, the final code becomes:
#Load Train and Test datasets #Identify feature and response variable(s) and values must be numeric and numpy arrays x_train <- input_variables_values_training_datasets y_train <- target_variables_values_training_datasets x_test <- input_variables_values_test_datasets x <- cbind(x_train,y_train) # Train the model using the training sets and check score linear <- lm(y_train ~ ., data = x) summary(linear) #Predict Output predicted= predict(linear,x_test)
For example:
# The predictor vector. x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131) # The response vector. y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48) # Apply the lm() function. relation <- lm(y~x) # Find weight of a person with height 170. a <- data.frame(x = 170) result <- predict(relation,a) print(result)
Multiple regression analysis is almost the same as simple linear regression. The only difference between
simple linear regression and multiple linear regression is in the number of independent ( or predictor)
variables used in the regression.
For example, if we want to find out if weight, height, and age of the people explain the variance in their cholesterol levels, then multiple regression will come to our rescue. We may take weight, height, and age as independent variables 𝑥1, 𝑥2 𝑎𝑛𝑑 𝑥3 and cholesterol as our dependent variable.
Assumptions:
There are three major uses of multiple regression analysis. First, it can be used to forecast effects or impacts of changes in the future, i.e., it helps us to understand how much will the dependent variable change when we change the independent variables. For example, we can use multiple regression to find how much GPA is expected to increase (or decrease) for every one point increase (or decrease) in IQ.
Also, it can be used to identify the strength of the effect that the independent variable has on a dependent variable.
Lastly, multiple linear regression analysis predicts future values. It can be used to get point estimates. For example, to know what factors affect the crop yield the most, multiple regression analysis can be used.
First, we will plot to scatter plots of every independent variable with the dependent variable. These scatter plots will help us understand the direction and correlation among the variables.
In the first plot, we see a positive correlation between the dependent and the independent variable.
Whereas, in the second plot, we see an arch-like curve. This indicates that a regression line might not be the best way to explain the data, even if the correlation between them is positive.
The second step of multiple linear regression is to formulate the model, i.e. that variables X1, X2 and X3 have a casual influence on variable Y and that their relationship is linear.
The last step is to fit the regression line.
The multiple linear regression equation is given as:
Proceeding in the same way as above to find the constants 𝑏𝑜, 𝑏1,… , 𝑏𝑚 and then obtaining the values of 𝑦𝑖
for the given values of 𝑥𝑖
Then, we plot the corresponding coordinates and draw the lines of best fit for each combination of independent and dependent variables.
Here, 𝑏0 is the intercept and 𝑏1,…., 𝑏𝑚 are regression coefficients. They can be interpreted the same way as slop. Thus, if 𝑏𝑖 =2.5, it would indicate that Y will increase by 2.5units if Xi increases by 1 unit.
If 𝑏𝑖 is more, then Y is more related to Xi, otherwise, it is less correlated.
R Code for Multiple Linear Regression
The code for multiple regression is similar to that of simple linear regression. We consider the following example to understand the code
Consider the data set “mtcars” available in the R environment. It gives a comparison between different car models in terms of mileage per gallon (mpg), cylinder displacement(“disp”), horsepower(“hp”), weight of the car(“wt”) and some more parameters.
The goal of the model is to establish the relationship between “mpg” as a response variable with “disp”,”hp” and “wt” as predictor variables. We create a subset of these variables from the mtcars dataset for this purpose.
input <- mtcars[,c("mpg","disp","hp","wt")] # Create the relationship model. model <- lm(mpg~disp+hp+wt, data = input) # Show the model. print(model) # Get the Intercept and coefficients as vector elements. cat("# # # # The Coefficient Values # # # ","n") a <- coef(model)[1] print(a) Xdisp <- coef(model)[2] Xhp <- coef(model)[3] Xwt <- coef(model)[4] print(Xdisp)
print(Xhp) print(Xwt)
We use the coefficient values, we create the mathematical equation
Y = a+Xdisp.x1+Xhp.x2+Xwt.x3
We can use the regression equation created above to predict the mileage when a new set of values for
displacement, horsepower and weight are provided.
For a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is −
Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91 = 22.7104
The post A Refresher on Regression Analysis appeared first on StepUp Analytics.
]]>The post Competitive Programming: Algorithms and Data Structure appeared first on StepUp Analytics.
]]>This blog is the continuation of How to start with Competetive Programming
In each part, I will introduce you with important concepts used in competitive programming (will not go in detail) and will provide a good reference to read these topics in detail.
If you want to be a serious competitive programmer. You should have knowledge of some Mathematical concepts and good command on number theory.
Number theory
In number theory, there are many Concepts. Let me introduce you with all these one by one, and that would save a lot of time and efforts while programming in the contests.
1. Modular arithmetic When one number is divided by another, the modulo operation finds the remainder. It is denoted by the % symbol.
Example Assume that you have two numbers 9 and 2. 9%2 is 1 because when 9 is divided by 2, the remainder is 1. More details visit this: Modular arithmetic
2. Modular exponentiation Exponentiation is a mathematical operation that is expressed as (x^n) and computed as x^n = x*x*…*x (n times). But Modular exponentiation In this operation, given three numbers x, y, and p, is competed as compute (x^y) % p.
Example:
Input: x = 2, y = 3, p = 5
Output: 3
Explanation: 2^3 % 5 = 8 % 5 = 3.
More details visit this: Modular exponentiation
3. Greatest Common Divisor (GCD)
The GCD of two or more numbers is the largest positive number that divides all the numbers that are considered.
Example:
The GCD of 20 and 12 is 4 because it is the largest positive number that can divide both 20 and 12.
More details visit this: GCD
4. Euclidean algorithm
The idea behind this algorithm is GCD(A, B)=GCD(B, A%B). It will recurse until A%B=0.
5. Extended Euclidean algorithm
This algorithm is an extended form of Euclid’s algorithm. GCD(A, B) has a special property so that it can always be represented in the form of an equation i.e. Ax+By=GCD(A, B).
The coefficients (x and y) of this equation will be used to find the modular multiplicative inverse. The coefficients can be zero, positive or negative in value. This algorithm takes two inputs as A and B and returns GCD(A, B) and coefficients of the above equation as output.
Example If A=30 and B=20,
then 30∗(1)+20∗(−1)=10 where 10 is the GCD of 20 and 30.
More details visit this: Extended Euclidean algorithm
6. Modular multiplicative inverse
What is a multiplicative inverse? If A.B=1, you are required to find B such that it satisfies the equation. The solution is simple. The value of B is 1/A. Here, B is the multiplicative inverse of A.
What is modular multiplicative inverse? If you have two numbers A and M, you are required to find B such it that satisfies the following equation: (A.B)%M=1 Here B is the modular multiplicative inverse of A under modulo M.
More details visit this: Modular multiplicative inverse
7. Sieve of Eratosthenes
Given a number n, print all primes smaller than or equal to n. It is also given that n is a small number. The sieve of Eratosthenes is one of the most efficient ways to find all primes smaller than n when n is smaller than 10 million.
Example:
Input : n = 20
Output: 2 3 5 7 11 13 17 19
More details visit this: Sieve of Eratosthenes
8. Euler’s Totient Function
Euler’s Totient function fun(n) for an input n is a count of numbers in {1, 2, 3, …, n} that are relatively prime to n, i.e., the numbers whose GCD (Greatest Common Divisor) with n is 1.
Example:
fun(6) = 2
gcd(1, 6) is 1 and gcd(5, 6) is 1.
More details visit this: Euler’s Totient Function
9. Convex Hull
Given a set of points in the plane. the convex hull of the set is the smallest convex polygon that contains all the points of it.
More details visit this: Convex Hull
Which data structure you will use, that depends on the problem you are trying to solve. If a problem is mapped to the most efficient data-structure which captures the essence of that problem, then it leads to an elegant solution to the problem.
The “right” choice of data-structure would not only depend on the representation of the inputs but the query it is supposed to be optimal for. E.g if asked to find a number among the list of number efficiently, then BST(Binary Search Tree) is a choice which would effectively represent the input data for the set of all point search queries.
If the query was for a range of numbers, and not just a single number, then BST is no longer the optimal choice but the data-structure to choose is maybe B+ Tree.
Here I will categorize the all-important data structures for different – different competitive programming skill level.
Beginner:
1. Linked List
2. Stack
3. Queue
4. Binary Search Tree
Intermediate:
1. Heap
2. Priority Queue
3. Huffman Tree
4. Union-Find
5. Trie
6. Hash Table
7. TreeMap
Proficient :
1. Segment Tree
2. Binary Indexed Tree
3. Suffix Array
4. Sparse Table
5. Lowest Common Ancestor
6. Range Tree
Expert:
1. Suffix Automaton
2. Suffix Tree
3. Heavy Light Decomposition
4. Treap
5. Aho-Corasick Algorithm
6. K Dimensional Tree
7. Link-Cut Tree
8. Splay Tree
9. Palindromic Tree
10. Ropes Data Structure
11. Dancing Links
12. Radix tree aka Prefix tree
13. Dynamic Suffix Array
I have seen all of the listed data structures being used in various programming contests.
Many of them are given in language libraries. But it is very important to understand their dynamics. Otherwise, understanding related higher-level structures will be difficult (if possible).
One may find some higher level data structures easier than lowers (happened to me).
Those programmers use c++ language for their competitive programming they can use some of data structures in STL.
1. Vector
2. List
3. Deque
4. Queue
5. Priority_queue
6. Stack
7. Set
8. Multiset
9. Map
10. Multimap
To be a good competitive programmer you must have a good understanding of the algorithms and the way your code works. The best algorithms are the ones which are small (fewer lines of code) and efficient.
You can develop your mind in building great algorithms by reading the code and practicing writing code.
Here I will introduce you with some of the standard algorithms that we use in competitive programming.
Searching algorithms
Sorting algorithms
Greedy Algorithms
A greedy algorithm is an algorithm that always makes a choice that seems best “right now”, without considering the future implications of this choice.
Greedy Algorithm as the name itself implies is an algorithm that is always greedy in taking decisions at each step of process, i.e. it chooses the best solution (either maximum or minimum / known as local optimum in technical terms) at each step of process assuming that you end up with the best solution (known as global optimum in technical terms) for the whole problem in the end.
Here are some algorithms where the Greedy approach is used:
Pattern Searching Algorithms
In Pattern searching algorithm we search the pattern that repeats one or more time in the sequence or string. Here I will introduce you with some efficient pattern searching algorithms that find a pattern in a particular sequence or string and finds the number occurrences of a pattern in that sequence or string in optimal time.
Graph Algorithms
Some of the most famous graph algorithms are given below:
Introduction DFS and BFS:
Minimum Spanning Tree:
Shortest Paths:
Connectivity:
Maximum Flow:
Dynamic Programming
In Dynamic Programming, a problem is divided into sub-problems and the solutions of these sub-problems are combined together to reach an overall solution for the main problem. When using approaches like Divide-and-Conquer, a sub-problem may be solved multiple times. Divide-and-Conquer methods may have to perform more work in these cases.
Dynamic Programming solves each of these sub-problems just once and then saves it, thus reducing the number of computations by avoiding the work of recalculating it again at a later stage, where the solution for that sub-problem is required
Here are some of the most famous problems where dynamic programming is used.
Backtracking Algorithms
Backtracking = {track for the possible solution and return if it is true otherwise get back and so on}.
In backtracking, we start with one possible move out of many available moves and try to solve the problem, if we are able to solve the problem with the selected move then we will print the solution else we will backtrack and select some other move and try to solve it. If none of the moves work out, we claim that there is no solution to the problem.
Here are some most famous problems where Backtracking approach is used.
Advanced details in competitive programming, Check my GitHub repo. Awesome-competitive-programming
The post Competitive Programming: Algorithms and Data Structure appeared first on StepUp Analytics.
]]>The post Introduction to Competitive Programming appeared first on StepUp Analytics.
]]>Good programmers write code that humans can understand.
Before you get to know that how to start competitive programming, first let’s understand what is Competitive Programming and what is the benefit of doing competitive programming.
“Competitive programming” is a mind sport of Computer Programmers that is held over the internet or the local network. In a programming competition, programmers have to write the computer programs for given problems in their reliable language like (c,c++, Java, Python).
C++ is a widely used programming language in competitive programming. Now you will be thinking why ‘C++’ is widely used? Don’t worry I will talk about it later in this blog.
A programming competition generally involves the set of logical, mathematical and algorithms based problems (around 3 to 10 problems comes in each competition). Programmers solve the problems one by one. how many problems you will solve depends on your Competitive programming skills.
You need to choose a problem then read it, analyze and then crack the logic behind the problem, then write the code in your favorite programming language (c,c++, python, java) etc.
Once you break the problems and have written the code make sure your code is correct and the required testing is done in the given time period. Then submit the code on contest hosting website like ( codeChef , codeforces , hackerRank ). After successful submission, you will get points and that will designate you rank in a programming competition.
These question will 100% come into your mind:
Don’t worry I will explain what is the need for competitive programming.
Competitive Programming is the base of a good software developer or software engineer. Competitive programming will make you very good at writing efficient programs quickly. If you get really serious with competitive programming, it will make you an expert in Data structure and Algorithms as well.
It helps the software developers to create the optimized algorithm that helps to computer software and other software to become robust and work fast.
We all know about Internet giant and valuable companies in the world like Google and Facebook, every year these companies host a competitive programming challenge that helps you to make a direct entry into these companies and land a dream job.
Google:
Google Codejam, Kickstart.
Facebook:
Facebook Hackercup.
Now, I think you have got a decent awareness of competitive programming and its Benefits. I know you are very excited to start Competitive programming. Let’s start!
Before starting with competitive programming you should have knowledge of at least one programming language. Any programming language will do. But most problems are set with C/C++ and Java programmers in mind, so knowing any one of them will be really helpful.
You don’t need to know really advanced concepts, like classes or generics/templates. You should just know if/else, loops, arrays, functions and have some familiarity with the standard library, like math functions, string/array operations, and input/output. For ‘C’, <string.h>, <stdio.h>, <math.h>, <stdlib.h> will generally be sufficient to start.
If you know only C, you can easily start. But at some point in time (especially when you reach advanced stages), you’ll need features which most languages have but C does not. Learning C++ is very easy if you know C. I’ll suggest that you start out with C and learn C++ in parallel with competitive programming.
Even if you are not confident of your skills in a programming language, you can (and should) still start. Competitive programming is also a good way to practice a new language you have learned.
Some good resources to learn programming languages:
For C++: geeksforgeeks
For Java: geeksforgeeks
For Python: geeksforgeeks
There are many coding platforms (Online judges) on the internet.
Most websites will give you a challenge and will ask you to write a program implementing that challenge. You will then have to submit your code. Your program will be automatically compiled and run and you’ll be told whether it ran correctly or not. Such websites are known as online judges. You will find many online judges over the internet. Here some popular ones:
But I recommend. You should stick to just one (or maybe two) online judges when you start competitive programming.
Most online judges like ( CodeChef, Codeforces ) have problems categorized by difficulty levels. For each difficulty level, easier problems generally have more submissions. So you can sort problems based on a number of submissions to find the easiest ones.
For beginners, I recommend CodeChef . If you have never tried and solved problems on an online judge, you can begin by solving the beginner problems on Codechef that will make you confident to solve good problems.
IDE is an environment that helps the programmers to write the code, compile and run it. There are many offline and online IDE’s available
There are many offline IDE’s are available for all type of operating system:
windows user:
If you are a windows user, you might want to use an IDE. Code::Blocks and DevC++ are good for C and C++ language. For Java user Netbeans and for python pycharm are popular IDE’s.
Linux or Mac user:
advise you to use:
Here some popular online IDE’s:
Ideone is widely used as an online compiler. There is one more benefit of ideone is that programmers can create their account on ideone and save their codes on a cloud.
CodeChef online IDE.
There is only one mantra to become a beginner to advance competitive programmer practice, practice and practice only.
You need at least 4 to 5 hours practice every day and try to solve at least 3 to 4 problems in a day. You can choose any online judge for solving problems. But in starting I recommend you to go with CodeChef because it has four set of problems beginner, easy, medium and hard.
It is helpful if you stay in touch with people who do competitive programming regularly. This will keep you motivated.
Often while practicing, you will not be able to solve some problems. Do not give up easily! Keep trying! But sometimes even after trying for hours, we are not able to solve it. In those cases, it is advisable to look at the editorials. Editorials are step-by-step explanations on how to solve a problem. Often you’ll find new innovative ways of solving problems on reading them. So sometimes you should read editorials even if you have been able to solve a problem.
Sometimes reading editorials is not enough to understand how to solve a problem. This is usually the case when you know how to solve it but you are not able to express your ideas as code easily. When that happens, you should try looking at other code. Some online judges make other people’s code public (like CodeChef ) while some don’t (like SPOJ).
Once you solve 20 to 25 problems, you should occasionally take part in programming contests. Many websites(online judges) host contests regularly.
Once you solve 20 to 25 problems, you should occasionally take part in programming contests. Many websites(online judges) host contests regularly.
First, let’s understand types of programming contests:
There Two types of a contest we usually see:
Long contest It is around 1 to 10 days competition that contains around 5 to 10 problems on various difficulties easy, medium and hard problems. Usually in this contest rank decide based on the no. of submissions in the duration competition days (means most number of problem-solving programmers will get the first rank).
CodeChef host a long competitive programming challenge every month that is called codechef long challenge. There is two division of this contest div1 and div2. click here to know more details about these divisions.
Short contest It is around 2 to 5 hours competition that contains around 2 to 10 problems on various difficulties easy, medium and hard. Usually, in this contest rank decide based on the no. of submissions and as well as how fast you submit the correct code in the duration of competition hours.
Some popular short contests-
This is a second programming challenge hosted by codeChef in a month. It is 2 hours 30 minutes contest.
This is the last contest hosted by codeChef in a month. It is also a short competitive programming challenge. It is 3 hours contest.
Codeforces is most popular online judge in the world. Codeforces host four types of contests div1,div2, div3 and educational round. These are 2 hours contests.
To know about the rules of these challenges: click here.
The ACM ICPC challenge is an annual programming challenge.
The ACM ICPC is considered as the “Olympics of Programming Competitions”. It is quite simply, the oldest, largest, and most prestigious programming contest in the world. It consists of various short programming contests. It is the team challenge which consists of three members.
For beginners, I recommend starting with a codechef long challenge because it is 10 days long challenge and you have enough time to solve the problems.
You should not be disheartened if you are able to solve only one or two questions. This is natural when starting out. As you get better, you’ll be able to solve more and more. If you are not able to solve any question, you should contact a senior and he/she will help you.
When you have solved more than 75 problems, you should also start solving problems on Codeforces and taking part in Codeforces contests. This is one of the sites where the most serious programmers of the world can be found.
With regular practice, you should become pretty good.
NOTE: never, never and never cheat in live contests because that would be very harmful for you.
This is the reason why (c++) is a widely used programming language in the competitive programming. Because it has a large standard library (STL) that helps competitive programmers to write the code fast and with minimum lines.
Most programming language offers its rich standard libraries. After gaining sufficient experience with a programming language, it is advisable to sift through its standard libraries to see what all does it offer.
Let’s Learn the standard libraries.
For c++ users good reference sites:
cppreference has the advantage of having an offline version. However, cplusplus.com’s reference is better when you are unfamiliar with the things you are reading about (which will generally, happen with relatively advanced concepts like iterators and STL containers). cplusplus.com has better explanations for these topics.
For Java users:
For Python users:
Advanced details in competitive programming visit this:
Awesome-competitive-programming
Happy coding………
The post Introduction to Competitive Programming appeared first on StepUp Analytics.
]]>The post Anaplan – Cloud Based Business Planning Platform appeared first on StepUp Analytics.
]]>Anaplan is a cloud-based business enterprise, works on the single hub where we can create and use the business planning models. Anaplan is a new platform challenging existing tools in the market regarding corporate performance and management solutions. It is a cloud-based business modelling and planning platform, an alternative to ERP and Systems integrator business enterprises.
The main aim of Anaplan is providing utterly cloud-based platform to the financial services. The not only economic purpose it also provides solutions for workday and salesforce to hold the data as much as available.
The Anaplan solutions used to a built-in memory of 64 bit, multicore platform and delivered through a cloud-based platform. Anaplan has already started providing their services to some large and small enterprises like McAfee, Aviva, Kimberly_clarke and many similar firms.
Due to Anaplan’s flexibility, it can be used to serve the financial, commercial and planning operation model. These models easily integrate with other models. Anaplan models can be built from scratch and can be customized based on pre-delivery models like:
Anaplan can change according to the market, trying to reduce the cost, risk, and increase the efficiency. This is possible when Anaplan brings all the s&op, supply planning, and demand management together, and enables the decision making throughout the supply chain. Anaplan can connect the other models like sales and finance through the network. Anaplan provides features, that can improve visibility and collaboration in our business with ease.
With the help of Anaplan Cloud-Based Platform, implementing following tasks are made easy
When we compared the supply chain with old days now, it’s better. Years ago, many companies used ERP for their business tracking and collaboration. Later companies use the combination of ERP and planning spreadsheets. By using this spreadsheets lot of issues occur in the future as planning becomes messy. To overcome these type of problems, we can go with effective supply chain management, i.e., Anaplan.
Supply chain management define the cyclic process of goods and services, from the initial stage of raw material to the final stage of finished goods as products or services. This process deals with supply chain management activities like planning, execution, monitoring and controls the operations. In supply chain management, one of the best features of Anaplan is that it provides total visibility among the suppliers, distributors and production
team. Due to this, the purchase of products or services can be accurate and avoid the risk. Anaplan can also reduce the purchase cost of the product, prevent product shortages and prepared test scenarios executed without fail. This Anaplan’s Cloud-Based Platform connects manufacturers, suppliers, and distributors.
We know that supply chain management is crucial for any business/companies in providing user satisfied services. With the Anaplan, we can improve the following services in a better possible way:
Supply chain management makes sure that you deliver appropriate products to the customer while working with an array of products in varying quantities. It helps in providing the products to right place on right time. By the supply chain management, we can reduce the operating cost of a product which includes purchasing cost, production cost, and total supply chain cost. Thus reduces the value of fixed assets and increase the cash inflow of the business resulting high-profit leverage.
Supply chain management plays a significant role in Society as well, i.e., medical trips, natural disaster relief activities, etc. When an unexpected natural disaster occurs, the SCM team comes into action, and take the necessary steps in delivering products and services to the customers to make sure of things done on time.
Present day markets need an agile view for monitoring their business line. Anaplan platform provides this kind of look to the business team in tracking the product lifecycle, effects of new product introduction, product end life, rates and other. By Anaplan cloud-based platform all the enterprises can efficiently optimize their business line in achieving their organizational goals thus resulting in maximum profits and increase their business share in the market.
Anaplan Platform collects all the cross-functional business information. Using this information, the planner creates or develop new products, and predict outcomes of the products based on their performance and revenue. Remove the products that show not many results. According to the demand planning, enterprises avoid unexpected issues like natural disasters, labor problems and other.
With the Anaplan platform, you can easily connect the planning, business line, product innovation and trade partners. It also helps you to synchronize planning models(re calibrate planning, test scenarios, operational planning) to fluctuate business.
For every successful supply chain management, requires the correct foundation. With the capabilities of Anaplan strategic management, the business can create policies to multiple groups of end users and products. Anaplan provides connection among network, inventory and products, and customer segmentation. This can help you make rules and policies to support supply chain management.
For the next few generations of supply chain management, Anaplan provides the better models enabling transparency about the product delivery system, thus resulting from improved trust factor for the organization form the customers. Cloud based platform and agile view are currently most used features that are appreciated by the organizations that have already started using the Anaplan plan supply chain management.
The post Anaplan – Cloud Based Business Planning Platform appeared first on StepUp Analytics.
]]>The post Randomized Block Design (RBD) and Its Application appeared first on StepUp Analytics.
]]>Suppose we want to compare the effects of t treatments each of with r replicates. Then we need n=rt experimental units. First, we divide the experimental units into r homogeneous blocks (groups). Then assign the treatments at random to the units of a block.
The model we consider here is,
Select “Data Analysis” from “Data” from Toolbar > Select “Anova: Two-Factor Without Replication” and click on “OK” > In “Input Range” select the data only. You can choose “Alpha” as your desired level of significance. Select “Output options”. Click on “OK” > You get the table.
library(reshape) data <- data.frame( block = c("b1", "b2", "b3"), T1 = c(41,65,76), T2 = c(45,67,72), T3 = c(44,66,76), T4 = c(45,66,77), T5 = c(46,76,64)) #Loading data expmnt <- melt(car.noise, id.var="block") #reshaping for use names(expmnt)<-c("block","treatment","yield") #renaming summary(aov(yield~block+treatment,data = expmnt)) #result of ANOVA
OUTPUT:
> library(reshape) > data <- data.frame( block = c("b1", "b2", "b3"), T1 = c(41,65,76), + T2 = c(45,67,72), T3 = c(44,66,76), + T4 = c(45,66,77), T5 = c(46,76,64)) #Loading data > expmnt <- melt(car.noise, id.var="block") #reshaping for use > names(expmnt)<-c("block","treatment","yield") #renaming > summary(aov(yield~block+treatment,data = expmnt)) #result of ANOVA Df Sum Sq Mean Sq F value Pr(>F) block 2 2368.1 1184.1 46.013 4.09e-05 *** treatment 4 6.9 1.7 0.067 0.99 Residuals 8 205.9 25.7 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Sometimes w.r.t. some experimental design we see that some of the observations in the layout of the experiment are missing. In the field experiment, it may happen due to the attack of pests or negligence of the observer. Again sometimes observations are so suspicious and it is better to treat them as missing. In such a situation, the convention is to estimate the missing observations in terms of available yields in order to have a complete layout. Once we get the complete layout, we carry out the usual analysis. Several techniques are proposed by different statistician of which the procedure, proposed by Yates, is the popular one and it is referred to as Yates Missing Plot Technique.
Let there are k missing observations in a layout of an experimental design consisting of t treatments where we are to test whether the effects of the treatments are equal or not. At first, we express the error sum of squares in terms of available yield and the missing observations. Now we determine the estimate of the missing observations minimizing error SS, E(x1,x2,…,xk) w.r.t. x1,x2,…,xk . Evidently, the estimates are obtained by the following system of equations,
The post Randomized Block Design (RBD) and Its Application appeared first on StepUp Analytics.
]]>The post Design Of Experiment: Completely Randomized Design appeared first on StepUp Analytics.
]]>Although the sample survey and design of experiments concerned with data collection, we use them for different purposes. In a sample survey, we derive the methods for collecting representative samples from a population such that we can interpret the characteristics of that population.
In the design of the experiment, no such population exists. Here we have to define the experimental units which are to be used to perform the experiments. So in the design of the experiment, the experimenter has the control on the experiment but in sample survey, there is no such control, sample observations occur in nature and cannot be subjected to any experimental control. Suppose we want an estimate on adult height of a city.
This is a problem of a sample survey. Here we decide the sampling technique, collect the data and infer the height of that population. Now, suppose we want to know which of five given varieties of rice is expected to give the maximum yield in the long run, we have to conduct an experiment.
Before discussing the principles of designs, it is proper to explain the terminology used in this context. The commonly used terms are experiment, treatment, experimental unit, experimental error and precision.
Experiment – An experiment is getting an answer to the question that the experimenter has in mind. In planning an experiment, we clearly state our objectives and formulate the hypotheses we want to test.
Experimental unit – An experimental unit is a material to which the treatments are to be applied and on which the variable under study is measured
Treatment – The procedures/objects under comparison in an experiment are the treatments.
Experimental errors – There is always a variation during an experiment. Some of the variations can be controlled and the other part which is random is called experimental error.
Precision – It is measured by the reciprocal of the variance of a mean i.e.
As n (the number of observations) increases, precision increases.
There are three basic principles or design –
(i) Randomization (ii) Replication (iii) local control.
Randomization: It is essential for getting a valid estimate of random experimental error. It minimizes the bias in the experiment.
Replication: As we see as many as we increase the replications (observations) the error variance decreases and as a result precision increases.
Local control: The third principle is local control or error control. Randomization and replication minimize the experimental error.
CRD is the simplest design where replication and randomization are used. Suppose we have t levels of a factor each with ri replications, i =1,2,…,t. The total number of experimental units is n = i (for simplicity we will take all ri = r). Here we allocate the treatments in n completely at random. We can look at this design as a one-way ANOVA model.
The model we consider here is,
Let us consider an example:
There are 3 levels of factor A, B, C. We want to test their significance. Let us take a sample of six observations for each level as shown below.
By using R programming we can easily test the significance of the levels
A <- c(22,42,44,52,45,37) #Observations of level A B <- c(52,33,8,47,43,32) #Observations of level B C <- c(16,24,19,18,34,39) #Observations of level C x <- data.frame(A,B,C); r <- c(t(as.matrix(x))) f <- c("Item1", "Item2", "Item3") k <- 3 #number of levels n <- 6 #number of observatons in each levels levels <- gl(k, 1, n*k, factor(f)) #Matching treatments a <- aov(r~levels); summary(a)
OUTPUT:
A <- c(22,42,44,52,45,37) #Observations of level A B <- c(52,33,8,47,43,32) #Observations of level B C <- c(16,24,19,18,34,39) #Observations of level C x <- data.frame(A,B,C); A B C 1 22 52 16 2 42 33 24 3 44 8 19 4 52 47 18 5 45 43 34 6 37 32 39 r <- c(t(as.matrix(x))) f <- c("Item1", "Item2", "Item3") k <- 3 #number of levels n <- 6 #number of observatons in each levels levels <- gl(k, 1, n*k, factor(f)) #Matching treatments a <- aov(r~levels) Call: aov(formula = r ~ levels) Terms: levels Residuals Sum of Squares 745.4444 2200.1667 Deg. of Freedom 2 15 Residual standard error: 12.11106 Estimated effects may be unbalanced summary(a) Df Sum Sq Mean Sq F value Pr(>F) levels 2 745.4 372.7 2.541 0.112 Residuals 15 2200.2 146.7
Now the p-value is 0.112 which is greater than 0.05, the desired level of significance.
So we accept (cannot reject) the null hypothesis. That means there is no significant difference between the three levels A, B, C.
The post Design Of Experiment: Completely Randomized Design appeared first on StepUp Analytics.
]]>The post What Is Heteroscedasticity in Regression Analysis appeared first on StepUp Analytics.
]]>Let’s come to the 3rd point. We all know what the residual is. The difference between actual value and the predicted value of the dependent variable is residual. And we assume that the residuals with equal variance. But when this assumption is violated, that is the residual’s variance is not equal(constant) then the problem is called Heteroscedasticity.
There are various causes for the presence of heteroscedasticity in our regression model. Some of them are:
It is customary to check for heteroscedasticity of residuals once you build the linear regression model. We can either use the graphical method or some statistical tests for detecting the heteroscedasticity in our model. First, we will discuss the graphical method. I am going use R Programming language and environment (R studio) for the detecting purpose.
Basically what we do in this graphical method is: we develop a model for our data set and create a plot of residuals. Now if we see a randomness in the plot then there is no heteroscedasticity. But if there is a specific pattern or deterministic pattern (like fan shape or any other pattern), then heteroscedasticity is present in our model. It’s very simple. Okay! Let’s take the very popular “cars” data set and fit a model.
R-code:
model_1 <- lm(dist ~ speed, data=cars) plot(1:length(cars$dist),model_1$residuals, main = "Residual plot",xlab = "no of observation",ylab = "residuals") abline(h=mean(model_1$residuals))
Let’s have a look at the plot that we have created. And we can see that the Randomness is there in that plot. So we can say no heteroscedasticity is present. That’s great. But suppose heteroscedasticity is present in our data. then how do our plot look like? Let’s see.
> n=rep(1:150,2) > a=0 > b = 1 > sigma2 = n^1.3 > eps = rnorm(n,mean=0,sd=sqrt(sigma2)) > y=a+b*n + eps > mod <- lm(y ~ n) > plot(n,y)
Statistical tests:
Next, we will discuss some theoretical approach for detecting the heteroscedasticity. There are some statistical tests for detection. I will discuss two of them.
The Breusch-Pagan test is a pretty simple but powerful test. It can be used to detect whether more than one independent variables are the cause for heteroscedasticity. There are five steps to the Breusch-Pagan test.
Step 1: first we will run the regular regression model and collect the residuals.
Step 2: Then we will estimate the variance of the residuals.
Step 3: we will compute the square of the standardized residuals.
Step 4: we will fit another regression line with all our independent variables taking the sum of standardized residuals as the dependent variable.
Step 5: we will calculate the RSS (Residual sum of the square) divide the RSS by 2 and will compare with the χ2 table’s critical value for the appropriate degrees of freedom or we will use the P-value. If the P-value is less than significant level then we will reject the null hypothesis that the variances of the residuals are equal.
Now in R, the task is very simple. Let’s have a look.
R-code:
> library(lmtest) > bptest(model_1) studentized Breusch-Pagan test data: model_1 BP = 3.2149, df = 1, p-value = 0.07297
P-value is 0.07297 which is greater than 0.5. So we can’t reject the null hypothesis. That is,we can conclude that there is no heteroscedasticity
Step 1. First, we arrange the data in ascending order of the independent variable Xj
Step 2. We will omit the middle observations (app. 20%) of the sorted data and fit two separate regressions, one for small values of Xj and one for large values of Xj and record the residual sum of squares (RSS) for each regression, say RSS1 for small values of Xj and RSS2 for large Xj’s.
Step 3. Then we will calculate the ratio F = RSS2/RSS1, which will follow an F distribution with d.f. = [n – d – 2(k+1)]/2 both in the numerator and the denominator, where d is the number of omitted observations, n is the total number of observations, and k is the number of explanatory variables.
Step4. We will reject H0: Residuals’ variances are equal if F > Fα,[n-d-2(k-1)]/2 or we can use the p-value to check.
Now in R, we will write:
> library(lmtest) > gqtest(model_1) Goldfeld-Quandt test data: model_1 GQ = 1.5512, df1 = 23, df2 = 23, p-value = 0.1498 alternative hypothesis: variance increases from segment 1 to 2
Again p-value is 0.1498 and that means no heteroscedasticity is there. So by the graphical method and by statistical tests, we can conclude that our model is homoscedastic.
Once you find heteroscedasticity in your model, it’s mandatory to fix the issue. We can use a weighted least square regression model or a transformation od dependent variable.
In weighted regression, we assign a weight to each data point based on the variance of its fitted value. We will give small weights to observations associated with higher variances to minimize their squared residuals. Then weighted regression automatically minimizes the sum of the weighted squared residuals. If we can use the correct weights, heteroscedasticity is replaced by homoscedasticity. It’s a very good approach to remove heteroscedasticity.
Transforming the data is harder than WLS because it involves the much manipulations. It is difficult to interpret the result because the units of data are gone. What we will do is that transform our original data into different values that produce better residuals with similar variances.
Author: Kuntal Roy Chowdhury
The post What Is Heteroscedasticity in Regression Analysis appeared first on StepUp Analytics.
]]>