The post Neural Networks using H2o Package in R appeared first on StepUp Analytics.

]]>An artificial neural network (or neural network for short) is a predictive model motivated by the way the brain operates. Think of the brain as a collection of neurons wired together. Each neuron looks at the outputs of the other neurons that feed into it, does a calculation, and then either fires or doesn’t.

Accordingly, artificial neural networks consist of artificial neurons, which perform similar calculations over their inputs. Neural networks can solve a wide variety of problems like handwriting recognition and face detection, and they are used heavily in deep learning, one of the trendiest subfields of data science. However, most neural networks are “**Black Boxes**”—inspecting their details don’t give you much understanding of how they’re solving a problem. And large neural networks can be difficult to train. For most problems you’ll encounter as a budding data scientist, they’re probably not the right choice. These might graphically look like

Where in the input layer consists of the independent variables and the output layer in the variable of interest that is the effect on which is to be measured or checked. The hidden layers are user-defined and play a crucial role in the accuracy of the model. The greater the number of the layers, the more complex our Neural Network model becomes. Such networks with multiple layers are called Deep Neural Networks. To decide the number of layers or the number of nodes in a layer is quite a difficult task.

Unfortunately, there is no reliable rule to determine the number of neurons in the hidden layer. The appropriate number depends on the number of input nodes, the amount of training data, the amount of noisy data, and the complexity of the learning task, among many other factors. It is on the user to decide these parameters.

It has been proven that a neural network with at least one hidden layer of sufficient neurons is a universal function approximator. This means that neural networks can be used to approximate any continuous function to an arbitrary precision over a finite interval.

*Step by Step Analysis*

There are a couple of different ways to build neural networks in R but the main focus in this blog will be using the H2o package for our analysis. We’ll be using the Concrete Data throughout our analysis, which is easily available on the UCI Machine Learning repository. We’ll follow a certain ladder of steps which will be a little different from the standard methods. Given by:

Initialise the H2o package.- Read in the data
- Data pre-processing if required
- Convert data into H2o readable format
- Split the data into training and testing sets
- Check for the accuracy of the model on the test data

First lets read in the data from the csv file and check its structure:

concrete <- read.csv(“concrete.csv”)

str(concrete)

So our data here has 1030 observations with 9 variables. Our variable of interest is the strength of concrete. We aim to perform a regression task, where we try to predict the strength of the concrete on the basis of the remaining variables.

Before we proceed to the analysis part we do some changes in the data to ensure robustness in the final model. We can clearly see that the data has very high and low magnitude values. For this we need to first normalize the data:

normalise <- function(x){

return((x-min(x))/(max(x)-min(x)))

}

data <- as.data.frame(lapply(concrete,normalise))

str(data)

This now looks much better and cleaner. Moving on to the steps defined above, we begin our analysis using the H2o package.

install.packages(“h2o”)

library(h2o)

h2o.init(max_mem_size = “2G”, nthreads = 2, ip=”localhost”, port=54321)

This way we initialize the H2o package. Now we have the data that can be read in by the h2o package. Converting the data might take a few seconds, depending upon the configuration of your laptop. Wait until the conversion shows 100%. To check the first few values you can use the head function.

d.hex <- as.h2o(data,destination_frame = “d.hex”)

head(d.hex)

For splitting the dataset, instead of the

set.seed(99)

split <-h2o.splitFrame(data=d.hex,ratios = 0.75)

train <- split[[1]] > test <- split[[2]]

Now comes the main part of our analysis:

model_nn <- h2o.deeplearning(x=1:8,y=”strength”, training_frame = train,hidden =5,model_id = “model_nn”)

The code here explains that our independent variables are the first 8 variables and dependent variable is the “strength” variable. We use 5 nodes in the single hidden layer and assign model_nn as the model name.

To check the performance of the model on the test data:

perf <- h2o.performance(model_nn,test)

perf

We get a few performance measures but we stick to only RMSE at this point. The next step is to make predictions and to check how these predictions match with the test data.

pred<- as.data.frame(h2o.predict(model_nn, test))

test1 <- as.data.frame(test)

cor(pred,test1$strength)

In order to find
the correlation between the predicted values and the strength variable in the test
data we need to transform both to into a data frame first. **The correlation comes out to be 82% which is quite good**. But is
much lower than the standard “neuralnet” method in R. You can always add more
nodes and layers to check how accuracy changes and can select a model that
gives a better correlation value.

This is how you can use H2o package to build Deep Learning models in R.

The post Neural Networks using H2o Package in R appeared first on StepUp Analytics.

]]>The post Extending Stochastic Models for Autonomous Vehicle Systems appeared first on StepUp Analytics.

]]>*The word “stochastic” comes from the Greek stokhazesthai, which means to aim at or guess at. A stochastic process also called a random process, in which outcomes are ambivalent. Randomness and the probability distribution of each process are basic key elements of stochastic.*

*For example, (according
to the cancer survey) there are 50 possible metastatic locations for cancer
spread in human body. By applying stochastic modeling to get the information
from randomness input data, if we increase the trails, then we get conclusion
on how cancer spread, which is next organ to be affected, which organ don’t have
any interrupt from cancer and finally decide what kind of treatment should give
to the patient.*

*Stochastic models will
help us to solving every complex problem which has randomness by nature. *

*It has some following key elements to bring intelligence on its own way,*

*Random** walks** are process of taking randomness input in unpredictable way to get
corresponding output, here it doesn’t take any past data to predict future.*

*The word “Random walks” was given by mathematician Karl Pearson (1857 – 1936) in 1905. It is very useful in algorithmic trading, and also it may have applied on air traffic collision detection and safety**Brownian motion or wiener process** was discovered by biologist Robert Brown (1773 – 1858), to learn the particle movements under a microscope. But it was mathematically described by Norbert Wiener (1894 – 1964). It is a continuous-time stochastic process and a subset of the Levy Process (stochastic process with independent, stationary increments). It will generate a pattern for randomness, which means if you increase trails of sample spaces then in particular moment the randomness will be changed into deterministic one, so we will able to predict the outcomes. It widely used in finances sectors, biochemistry and AI etc.,*

*Poisson process** is a stochastic process in which a number of events are counted in a given interval of time. The time could be inter-arrival times which should be independent of one another. The Poisson process is a continuous process. It helps us to predict the probability of certain events happening in a fixed interval of time.* *It is named after Siméon Denis Poisson who discovered it in 1838. For examples, it will help Autonomous Vehicles to find the number of occurrences of traffic signs in the road and make the prediction of what is next sign should be? with the help of other properties in stochastic.*

*Markov Chain** is a stochastic process that moves
from one state to another, the new state depends only on the current position
and not on the historical positions. It also known as the memoryless property (Markov
property) of a stochastic process. The change of positions can be represented
with a transition matrix. Weiner process and Poisson process are a subset of
Markov process (continuous type of markov chain). Markov chain is named after
the Russian mathematician Andrey Markov (1856 – 1922). If we apply intelligence
in cancer prediction in our body with markov chain properties, it will lead us
to get following information*

*What should be the next organ will be affected by cancer?**Which organ won’t be affected by cancer?**What is the probability of the patient saving rate from cancer?**What kind of treatment should be given to the patient?*

*So it helps doctors to give correct treatment to the patients.*

*Why Markov chains are needed to apply for experimentation like AVS wherein the scenarios are randomized & behavior changes based on the randomness.*

*Markov decision process (extension of markov chain) is one
of the example for reinforcement learning. Which means ***y = f (x) z, ***we give x and z to
learn function f to generate y similarly y helps to define function f.*

*It takes decision making in every situation only depend up
on current state. *

*I will explain the Markov* *decision process as the following example, **Imagine that following table is the grid of places of a city, An AVS need to reach Goal State from Start State by execution actions. If you found the boundary in the given table like (1,1) if you tried to go UP, then you should stay where you are. Similarly, if you tried to go LEFT, then you should stay where you are. similarly, if you tried to go RIGHT, then you can move next place*

*Actions: UP, DOWN, LEFT, RIGHT**Question:**What is the shortest sequence getting from Start to Goal? **Ans1: UP, UP, RIGHT, RIGHT, RIGHT **Ans2: RIGHT, RIGHT, UP, UP, RIGHT**Both answers are given the correct solution to this problem, so I get the first one.*

*Markov property,**In order to know the information of the near future (say, at time t+1) the present information at time t matters.**Given a sequence,*

*The** first order of Markov says,*

*That is, Xt depends only on Xt-1. Therefore, Xt+1 will depend on Xt*

*The second order of Markov says, *

*that is, Xt depends only on Xt-1 and Xt-2*

*Only present matter**Stationary (Rules are no Change)*

*Markov Decision Process**STATES: S **MODEL: T(s,a,s’) ~ P(s’|s,a)**ACTION: A(s), A **REWARDS: R(s), R(s,a), R(s,a,s’)**POLICY: π(s) –> a ** π * (optimal policy)*

*STATES: S **(from above example there are 12 states available)**States are feature representation of data collected from the environment.**It can be either discrete or continuous.*

*MODEL: T(s, a, s’) ~ P(s’|s, a)**Model or transition model describes that rules of this example to reach goal state, it’s basic function of three variables CURRENT STATE(s), ACTIONS(a), NEW STATE(s’), it will produce the probability of landing up on new state (s’) given that the agent take action (a) in given state (s). It tells, what will happen if you do something in a particular place?*

*In a deterministic environment, where the probability for any landing state other than the determined one will have zero probability.*

*For example: *

*Determined environment:**If you take a certain action, go Up, you will certainly perform that action with probability 1.*

*Stochastic environment*: If you takesame action, go Up, there will certain probability say 0.8 to actually perform the given action and there is 0.1 probability it can perform an action (either Left or Right) perpendicular to the given action, Up. Here for the s state and the Up action transition model, T(s,a,s’) = P(s’|s,Up)=0.8

*It follows the first order Markov property. So we can also say that Autonomous Vehicle is also a stochastic environment because AVS is composed of Decision making that is in different states defined by position and speed or other attributes of AVS. Actions performed by each decision making process change their states and cause a change in the AVS.*

*ACTION: A(s), A**ACTION can be perform at particular state **EX: Up, Down, Left, Right. It can also be either discrete or continuous)** A= {Up, Down, Left, Right}*

*It can be treated as a function of the state, a = A(s), where depending on the state function, it decides which action is possible.*

*REWARDS: R(s), R(s,a), R(s,a,s’)**The reward of the state quantifies the usefulness of entering into a state. There are three different forms to represent the reward namely, R(s), R(s, a) and R(s, a, s’), but they are all equivalent. The domain knowledge plays an important role in the assignment of rewards for different states as minor changes in the reward do matter for finding the optimal solution to an MDP problem. It is a scalar value.*

*Ex: This is an example for R(s)**Goal state gets a reward as 1 and Not Goal state get reward value as -1. R(Goal)=1**R(Not Goal) = -1*

*POLICY: π(s) –>a**The policy is a function that takes the state as an input and outputs the action to be taken. Therefore, the policy is a command that the agent has to obey.*

*It is a guide telling which action to take for a given state. **π * (optimal policy), which maximizes the expected reward.*

*From these MDP properties, you get an idea that how to implement MDP (reinforcement learning) to an Autonomous vehicle. For example, An autonomous vehicle operates in an idealized grid-like city, with roads going North-South and East-West. There are traffic signs are as a state (S). Then model or transition model(T) will be telling to AV, what will be the next signs appear on the road to help AV to make decision. *

*If AV determined the stop sign, then it takes an action(A) that release the throttle and decrease the speed on stop the vehicle. An AV gets rewarded (R) for each successful completion of the trip. From the rewards and penalty AV should learn optimal policy (π *) for driving on a city road, obeying traffic rules correctly, and trying to reach the destination within a goal time.*

*What are the things need to be done to achieve AVS level 4?*

*Level 0: No Automation**Level 1: Driver Assistance Required**Level 2: Partial Automation Options Available**Level 3: Conditional Automation**Level 4: High Automation**Level 5: Full Automation*

*Currently we’re in level 3, Audi claims that the new A8 is
the first production car to achieve Level 3 autonomy—not Level 4 as Motor Trend
claims.* *The
Audi AI traffic jam pilot can take over the tedious job of creeping through
highway traffic jams at speeds below 37 MPH.*

*Compare with level 3, level 4 is highly automated. Because in level 3, car driver can go far above 37 MPH then the car autonomy will be ruled out. In that kind of situation, the driver needs to take responsibility for the car. So level 4 is Autonomous Vehicles will be able to handle most “dynamic driving tasks” to use SAE International’s terminology. *

*Which means, Level 4 car can handle most normal driving tasks on its own. But we still need driver intervention from time to time, during poor weather condition. So now, following things need to be done to achieve AVS Level 4,*

*They
are:*

*A level 3 autonomous car should understand the SAE International’s terminology to move to level 4**An AVS level 4, capable of performing all driving function under certain conditions so we need to*introduced morereal world driving problems withcondition to solve by an AV then it moveslevel 3 to level 4.*Make level 3 AVS upgrade into level 4 by introducing reinforcement learning, which means if an AVS already know what traffic signs will appear next then it will aware and make*decision on it, so here driver or other people in car treated as cargo!*In level 3 AVS, in*specific situation and environment AV may not have any interrupt on its way (likehighway ) then it allowshuman driver to do whatever if they want. But we need to upgrade this specialization in level 4 as that an AV can drive itself independently inmost environment, with some exceptions for weather or unusual environments.Human may still need to take over attime . By introducing RADER, LIDAR, GPS, Digital Cameras, Processors to upgrade level 3 AV into level 4.

*Demonstration of how Markov chain is useful in achieving AVS level 4 by considering one the scenarios of road-sign detection & action*

*An Autonomous Vehicle detects the traffic signs on road by using the camera and classified with the help of kNN algorithm, here it uses Markov chain to predict, what will be a next traffic sign according to the current traffic sign. Here it generates the Markov transition matrix based on Trichy to Madurai national highway in Tamilnadu, India.*

*By taking high matrix powers gives the limiting distribution is*

*So following R code explains that Autonomous Vehicle simulated for 100 Kilometers. During that distance, AVS detect traffic Symbols and their probability of appearing to be next after current traffic signs. Then we increase a sample space to get a more accurate probability of traffic signs will predict next to take a decision by AVS. Compare this simulated result to million step simulation.*

markov <- function(init,mat,n,labels) {

if (missing(labels)) labels <- 1:length(init)

simlist <- numeric(n+1)

states <- 1:length(init)

simlist[1] <- sample(states,1,prob=init)

for (i in 2:(n+1))

{ simlist[i] <- sample(states,1,prob=mat[simlist[i-1],]) }

labels[simlist]

}

P <- matrix(c(0,0.8,0.17,0.28,0.28,0,0.02,0.01,0.08,0.08,

0.08,0.02,0.1,0.22,0.18,0.02,0.3,0,0.08,0,

0.02,0.2,0,0.3,0.38,0,0.08,0,0.02,0,

0.22,0.4,0.02,0,0.08,0.09,0,0,0.11,0.08,

0.28,0,0.1,0.1,0.01,0.09,0.1,0.12,0.08,0.12,

0.4,0,0.2,0,0,0.04,0,0.3,0,0.06,

0,0.3,0.01,0,0.07,0.3,0,0.2,0.1,0.02,

0,0,0.2,0,0,0.45,0.25,0,0,0.1,

0,0,0.1,0.05,0,0,0.25,0,0.2,0.4,

0,0,0.1,0.05,0,0.01,0,0.37,0.33,0.14), nrow=10, byrow=TRUE)

lab <- c(“Stop”,”Speed_Limit”,”Speed_Break”,”Pedestrain”, “School_Ahead”, “Man_At_Work”, “Narrow_Road”, “Road_Wideness”, “Hospital”,”Petrol”)

rownames(P) <- lab

colnames(P) <- lab

init <- c(1/10,1/10,1/10,1/10,1/10,1/10,1/10,1/10,1/10,1/10) # initial distribution

states <- c(“St”,”Sl”,”Sb”,”Pt”,”Sch”,”M”,”Nr”,”Rw”,”H”,”P”)

# simulate chain for 100 steps

simlist <- markov(init,P,100,states)

simlist

table(simlist)/100

steps <- 1000000

simlist <- markov(init,P,steps,states) table(simlist)/steps

*Output:*

simlist <- markov(init,P,100,states)

simlist

[1] “M” “St” “P” “Rw” “Sb” “H” “P” “H” “Nr” “M” “St” “Sb” “Sch” “St” “Sch” “St” “Sl” “Pt” “H”

[20] “Nr” “Sl” “Sch” “St” “Sl” “Sb” “Sch” “H” “P” “Rw” “M” “St” “Nr” “Sl” “M” “Rw” “M” “P” “P”

[39] “Rw” “M” “Rw” “M” “Rw” “M” “Sb” “Sl” “H” “H” “P” “Rw” “M” “Rw” “Nr” “M” “St” “Sb” “Sch”

[58] “M” “Sb” “Pt” “H” “P” “Rw” “Nr” “Rw” “Nr” “M” “St” “Sl” “Nr” “Sl” “Sch” “Nr” “Sl” “Pt” “St”

[77] “Sch” “Nr” “M” “St” “Sl” “Sb” “Sch” “St” “Sch” “P” “P” “H” “P” “H” “Sb” “Nr” “Sl” “Pt” “Sl”

[96] “Nr” “Sl” “Nr” “Rw” “P” “Rw”

table(simlist)/100

simlist

H M Nr P Pt Rw Sb Sch Sl St

0.09 0.13 0.12 0.11 0.04 0.12 0.08 0.09 0.12 0.11

steps <- 1000000

simlist <- markov(init,P,steps,states)

table(simlist)/steps

simlist

*This methodology helps Autonomous vehicle to take decision early by predicting the future traffic signs on road. It might easy to take decision on identified information.*

*How AVS also can recognize (image) road-signs to further achieve the level 4?**Autonomous Vehicle uses OpenCV, cameras to recognize road signs with the help of machine learning algorithms. Here, we use the k-Nearest Neighbor algorithm in R to predict the traffic signs. So following steps will explain how kNN helps AV to recognize the road-signs (images).*

**Recognizing a road sign with kNN**

After several trips with a human behind the wheel, it is time for the Autonomous Vehicle to attempt the test course alone. As it begins to drive away, its camera captures the following image:

Can you apply a kNN classifier to help the car recognize this sign?

**R Code:**

# Load the ‘class’ package

library(class)

library(tidyverse)

traffic_signs <- read_csv(“E:/internship in stepup analytics/extened stochastic model for autonomous vehicle system/traffic_sign3.csv”)

signtype<-traffic_signs$sign_type

nextsign <- traffic_signs[c(343),c(3:50)]

**# Classify the next sign observed**

knn(train = traffic_signs[-c(1:2)], test = nextsign, cl = signtype)

**Output:**

knn(train = traffic_signs[-c(1:2)], test =

[1] stop

Levels: hospital man_at_work narrow_road pedestrian petrol road_wideness school speed speed_break stop* we’ve trained our first nearest neighbor classifier! The AV successfully identified the sign and stopped safely at the intersection. So how did the knn() function correctly classify the stop sign? The answer is that the sign was in some way similar to another stop sign. So kNN isn’t really learning anything; it simply looks for the most similar example.*

*Exploring the traffic sign dataset**To better understand how the knn() function was able to classify the stop sign, it may help to examine the training dataset is used. Each previously observed street sign was divided into a 4×4 grid, and the red, green, and blue level for each of the 16 center pixels are recorded as illustrated here.*

*The result is a dataset that records the sign_type as well as 16 x 3 = 48 color properties of each sign.*

*R Code:*

*# Examine the structure of the signs dataset*

str(traffic_signs)

*# Count the number of signs of each type*

table(traffic_signs$sign_type)

*# Check r10’s average red level by sign type*

aggregate(r10 ~ sign_type, data = traffic_signs, mean)

*Output:*

aggregate(r10 ~ sign_type, data = traffic_signs, mean)

*As you might have expected, stop signs tend to have a
higher average red value. This is how kNN identifies similar signs.*

*Classifying a collection of road signs**Now that the autonomous vehicle has successfully stopped on its own, we feel confident in allowing the car to continue the test course. The test course includes 59 additional road signs divided into ten types:*

*At the conclusion of
the trial, you are asked to measure the car’s overall performance at
recognizing these signs.*

*R code:**# Use kNN to identify the test road signs*

signtypes2<-traffic_signs[c(traffic_signs$sample == “train”),c(2)]

signtypes<-t(signtypes2)

testsigns<-traffic_signs[c(traffic_signs$sample == “test”),c(3:50)]

trainsigns<-traffic_signs[c(traffic_signs$sample == “train”),c(3:50)]

signspred <- knn(train = trainsigns, test = testsigns, cl = signtypes)

*# Create a confusion matrix of the actual versus predicted values*

testsigns2<-traffic_signs[c(traffic_signs$sample == “test”),c(2:50)]

signsactual <- testsigns2$sign_type

table(signspred,signsactual)

*# Compute the accuracy*

mean(signspred == signsactual)

*Output:*

table(signspred,signsactual)

mean(signspred == signsactual)

[1] 0.9464286

*That Autonomous Vehicle is really coming along! The confusion matrix lets you look for patterns in the classifier’s errors.*

*There is a complex relationship between k and classification accuracy. Bigger is not always better. In such case, with smaller neighborhoods, kNN can identify subtler patterns in the data. So what is a valid reason for keeping k as small as possible (but no smaller)? *

*Answer is that a smaller k may utilize subtler patterns.*

*Testing other ‘k’ values**By default, the knn() function in the class package uses only the single nearest neighbor. Setting a k parameter allows the algorithm to consider additional nearby neighbors. This enlarges the collection of neighbors which will vote on the predicted class.*

*Compare k values of 1, 7, and 15 to examine the impact on traffic sign classification accuracy.*

*R Code:**# Compute the accuracy of the baseline model (default k = 1)*

k_1 <- knn(train = trainsigns, test = testsigns, cl = signtypes)

mean(signsactual == k_1)

*# Modify the above to set k = 7*

k_7 <- knn(train = trainsigns, test = testsigns, cl = signtypes, k = 7)

mean(signsactual == k_7)

*# Set k = 15 and compare to the above*

k_15 <- knn(train = trainsigns, test = testsigns, cl = signtypes, k = 15)

mean(signsactual == k_15)

*Output:*

k_1 <- knn(train = trainsigns, test = testsigns, cl = signtypes)

mean(signsactual == k_1)**[1] 0.9464286**

k_7 <- knn(train = trainsigns, test = testsigns, cl = signtypes, k = 7)

mean(signsactual == k_7)**[1] 0.9375**

k_15 <- knn(train = trainsigns, test = testsigns, cl = signtypes, k = 15)

mean(signsactual == k_15)**[1] 0.7410714**

*Which
value of k gave the highest accuracy? k_1 and value is 0.9464286*

*Seeing how the neighbors voted**When multiple nearest neighbors hold a vote, it can sometimes be useful to examine whether the voters were unanimous or widely separated.*

*For example, knowing
more about the voters’ confidence in the classification could allow an
autonomous vehicle to use caution in the case there is any chance at all that a stop sign is ahead.*

*Here, we will learn how to obtain the voting results from the knn() function.*

*R Code:**# Use the prob parameter to get the proportion of votes for the winning class*

signpred <- knn(train = trainsigns, test = testsigns, cl = signtypes, k = 7, prob = TRUE)

*# Get the “prob” attribute from the predicted classes*

signprob <- attr(signpred, “prob”)

*# Examine the first several predictions*

head(signpred)

*# Examine the proportion of votes for the winning class*

head(

*Output:*

head(signpred)

[1] stop stop stop stop stop stop

Levels: hospital man_at_work narrow_road pedestrian petrol road_wideness school speed speed_break stop

head(signprob) [1] 1 1 1 1 1 1

*Now you can get an
idea of how certain our kNN learner is about its classifications.*

*Before applying kNN to a classification task, it is common practice to rescale the data using a technique like min-max normalization. What is the purpose of this step is to ensure all data elements may contribute equal shares to distance. Rescaling reduces the influence of extreme values on kNN’s distance function.*

*Conclusion*

*Above Gartner hype cycle 2018 autonomous vehicles show that autonomous vehicle level 4 is under construction it takes more than 10 years. Currently, Audi A8 introduced AV level 3 in 2018. Through this article, I suggest that it’s one of the ways to applying a stochastic model to the autonomous vehicle, we can apply in a various way depends on the problem. So, I give an idea that what will be the next traffic signs appear on road and in future we will develop the AV to perform decision making for this kind of situation.*

*References**1. Gartner Inc.**2. Brett Lantz, Data Scientist at the University of Michigan (Datacamp.com)**3. Introduction to stochastic processes with R by Robert P. Dobrow**4.* *https://en.wikipedia.org/wiki/List_of_stochastic_processes_topics**5. **https://www.datasciencesociety.net/stochastic-processes-and-applications*

The post Extending Stochastic Models for Autonomous Vehicle Systems appeared first on StepUp Analytics.

]]>The post Object Detection With CNN, RCNN And Fast RCNN appeared first on StepUp Analytics.

]]>Image classification algorithm takes the entire image as the input. It identifies the object in the image and outputs the class to which it belongs. However, the object detection algorithm would tell you which different objects are present in the image and also, it’s a location in the image. Thus, it outputs the bounding boxes (x, y, width, height) which indicates the location of the object in the image. The object detection algorithm is being used widely in face detection or vehicle detection. The simplest example is how Facebook detects faces in an image.

It can be used in stores to count the number of people each day and maintain some statistics on how crowded the store is on each day, which is the most crowded day and which is the least crowded day. Object detection is basically used to find out objects that belong to a particular class (vehicle, human being, cat, dog, etc) in an image.

In this article, we will see the overview of object detection using CNN and detailed explanation of RCNN and fast RCNN.

As seen in the above image, classification refers to classifying the object to a class and assigning it a class label. Here, the image consists of a single image and the main goal is to provide it a class name. In localization, a region of that object is found out. The object is been bounded by a box. Object detection is the combination of both classification and localization. In real life, images consist of several different objects that belong to different classes. We cannot label them with a single class. Hence, object detection methods are used.

This is the basic structure of Convolution neural network that is used for image classification. Input image undergoes various pooling and convolution layers, followed by fully connected layers. The output is the predicted class along with its confidence.

**How to use CNN for object detection?**

Divide the input image in to separate regions. Now, each of these regions would be considered as separate images. Feed these images as the input to CNN and classify each image to a class. Finally combine all these regions to get the original image with the detected regions.

The advantage of this method is, it is quick. It quickly divides the images into several regions. However, the image is not taken into consideration while forming the regions and simply divides the images into fixed-size regions. Also, the objects in the image can have different aspect ratios and spatial locations. Therefore, a large number of regions will be required, which makes it computationally intensive.

As seen in the CNN approach, a lot more regions are required for object detection. To solve this problem, firstly a region proposal algorithm is used to find the most promising regions in the images. Selective search algorithm is the most commonly used algorithm.

Selective search algorithm.

Selective search is a fast object detection algorithm. It does the grouping of similar regions based on the color, texture, size and shape using graph-based segmentation method. The output image below shows the segments. But here these segments cannot be used as region proposals because

1. As seen in the image, few objects in the input image may contain two or more segments.

2. Region proposal of objects that are covered by some other objects cannot be created. E.g.: cup filled with coffee.

Thus, it uses over segmentation method. Here, the objects that are segmented from the background are again segmented into sub-components.

Now the selective search algorithm adds bounding boxes to the region proposal and groups the adjacent segments. This step is repeated over iterations such that smaller segments are combined to from large segments. This is known as hierarchical segmentation.

This algorithm extracts 2000 regions per image. Thus, instead of having a huge number of images we can work with just 2000 images.

- Take the input image
- Find the Region of Interest (ROI) using selective search algorithm.
- Reshape these inputs into a fixed size as required by the CNN. It acts as an input to a pre trained CNN (e.g. AlexNet)
- In the final layer, SVMs are added. It detects if the object is present in the image, if yes, then classifies it to
appropriate class. - Apply bounding box regression.

In the above layer of SVM, it classifies of the region proposal into different classes. However, in order to get a tighter bounding box around the object, linear regression is applied over the region proposal

Problems in RCNN

- Since every image has 2000 region proposals, CNN has to extract features from each region proposal.
- There are 3 models in this algorithm
- CNN for feature extraction
- Linear SVM classifier for identifying objects
- Regression model for bounding boxes

This makes RCNN very slow. It can not be used for real dataset.

**What makes RCNN slow?**

Running CNN 2000 times per image. This makes it computationally intensive. Fast RCNN removes this dilemma. It passes the input image into the CNN model to get the convolution feature map. These feature maps are converted into region proposals. Now, these region proposals are pooled (usually max pooing). This pooling layer is called as RoI (Region of Interest) pooling. Thus, 2000 passes for one image is reduced to just 1.

**Steps**

- Take the input Image.
- Pass it through a ConvNet. It generates RoI
- Apply RoI pooling to each region. This reshapes each region into a fixed size.
- Pass the regions through Fully connected layers. It classifies them as well as returns the bounding boxes. A softmax and the linear regression model is used simultaneously for the same.

Problems with Fast RCNN

- Even though there is only one image as opposed to 2000 regions per image, it still uses selective search approach to find RoI.
- Although, it better than RCNN, we cannot use it in rea world dataset.

The post Object Detection With CNN, RCNN And Fast RCNN appeared first on StepUp Analytics.

]]>The post Detailed Introduction to Recurrent Neural Networks appeared first on StepUp Analytics.

]]>For more details on Neural Network read our previous article on **Neural Network.**

Now it’s time to start our journey where we will look in depth about Recurrent Neural Networks.

**In this article we’ll cover:**

- What is Recurrent Neural Network
- Working of the Recurrent Neural Network
- Backpropagation in RNN
- Disadvantages of RNN’s – Vanishing and Exploding Gradient Problem
- Introduction to LSTM Network
- Varied Recurrent Neural Networks.
- Implementation of RNN
- Applications of Recurrent Neural Networks

**What is Recurrent Neural Networks?**

We as human beings have the ability to remember things and using that we can take decisions in our mundane life. This same ability is contained in Recurrent Neural Networks.

Recurrent Neural Networks are one of the most powerful and robust types of neural networks. The fact which makes RNN distinguished is their ability to work with the internal memory.

*This figure shows an example of Recurrent Neural Networks where the output is given back to the inputs*

You must be wondering Recurrent Neural Networks must be something newly invented but this not true, they are quite old since they were first introduced in 1980’s but their actual potential has gained some noteworthy attention in recent past because of the rise in the computing power and advances in designing of the networks.

As earlier mentioned, RNN’s have internal memory which makes them capable of remembering things related to the inputs received, through this they can predict what will be next as the output and they can get a profound understanding of the sequence and the context related to it.

Due to all the above reasons, RNN’s easily manage sequential data like time series, speech, text recognition, audio/video and much more.

Before we understand the working of RNN, let’s discuss what sequential data is all about. The sequential data is ordered data where related things are joined together to form a series or sequence. Examples of sequential data are DNA sequence, audio/video related data, Time series data (in this type of data we have series of data points with respect to the time order) and there are many such kinds of data.

*Figure 2: Example of Time Series Data which is a type of sequential data and input to RNN*

In RNN the information is continuously looping. It combines the present input and previous input to produce the output.

From the above diagram, it is evident that in RNN the output generated is provided back to the next input for generation of further inputs. This unique ability of RNN to remember exactly what they had previously obtained helps in the prediction of various sequential data.

At the same time, we can Feed Forward Neural Networks do not have any kind of memory of the inputs obtained earlier because of which they are poor in terms of dealing with data consisting of the order of time. The only thing which they remember is the training performed on them.

In RNN, we supply weights to the two inputs i.e. current and recent past. Along with this, the change in weights which is required for reducing the error while performing the training is done through gradient descent and Backpropagation through time.

After having learned the way recurrent neural networks work, it’s time to look at how the RNN’s are trained with the data which is provided for them. While training the RNN we have to answer all such questions like how do we decide the weights for each connection? How to initialize the weights for the hidden units.

For answering all these and many more such questions we use Backpropagation of error and Gradient Descent. But here comes the catch, we cannot use the backpropagation of error used for Feed-Forward Neural Networks.

*Figure 3: Unrolled Recurrent Neural Network consisting of sequences of neural networks*

The reason why we cannot use the traditional Backpropagation is that the RNN are cyclic graphs whereas feed-forward networks are acyclic directional graphs. Because of this structure of Feedforward networks, we are able to calculate the error from the above layer, but as the RNN have the different structure we cannot calculate the error with the same method.

Now for performing backpropagation in RNN, we must unroll the Recurrent Neural Network so that we are viewing it as a sequence of neural networks. In the above diagram, we can see that the right side consists of a series of neural networks where the error of the present time step depends on the previous timestep.

Therefore, Backpropagation through time (BPTT) is helping the error to be backpropagated from last to the first timestep, while we unroll the RNN. Using this method, we are able to calculate the error for each timestep and through this, we update weights and try to reduce the error with each epoch.

The disadvantage which BPTT faces is that it can get expensive in terms of computation when the number of timesteps is high.

The two disadvantages of RNN are as follows:

**1. Exploding Gradient**

We are aware of the term Gradient from the traditional neural networks. In simple terms, gradient helps us to know how much the output of a function changes, when we change the inputs.

In RNN’s the algorithm makes the weights of significant importance without providing any reason or proof for the same. To counter this problem, we can truncate/squash the gradients.

**2. Vanishing Gradient**

In vanishing gradient when the values of the gradient are too trivial and the model’s learning is halted or the time taken for the learning purposes is too long.

For solving this issue, the idea of LSTM (Long Short-Term Memory) is used which was proposed by Sepp Hochreiter and Juergen Schmidhuber.

**Introduction to LSTM Network**

LSTM or Long Short-Term Memory Network is an extension to Recurrent Neural Network. We are well aware of the fact that RNN’s are good to use when we have sequential data but then we have also looked at the disadvantages which it faces. To overcome these issues, we have an LSTM network.

LSTM resolve both the issues and is capable of performing training over a long sequence which was not feasible in normal recurrent neural networks. The units of LSTM are used as building blocks for RNN and thus resulting in LSTM network. The LSTM networks are analogous to Computers as they can perform operations like reading, write and delete from the memory which is stored.

Now it’s time to dig ourselves deep and try implement recurrent neural networks using Keras.

**Keras** is a high-level API written in python for implementing neural networks and it has the ability to run on top of Tensorflow or Theano.

**For Installing Keras you can opt for the following code:**

Before installing Keras, we have to install Tensorflow. This is because Keras acts as a wrapper around Tensorflow/Theano for building much more complicated models of Deep Learning and Keras provides an easy interface to interact with.

pip install tensorflow

**Now installing Keras with the following code:**

pip install keras

It’s time to commence the implementation of our very own recurrent neural network model.

**Step 1:**

import numpy as np from keras.models import Sequential from keras.layers import Dense from keras.layers import Dropout from keras.layers import LSTM from keras.utils import np_utils

Here we begin with importing the libraries which are required. We have imported numpy library for performing mathematical operations, structuring the input, output and data labels.

After this, we have imported some specific functions of Keras used for building our RNN. The specific role of different functions will be discussed in later steps.

**Step 2:
**For performing the actual code implementation, we will require the input data. Here the input data is a monologue from Othello. You can get the text file from here. Remember to save the text file in the same directory where the python/Jupyter notebook is kept.

#Reading the data and converting it into lower case data = open("Othello.txt").read().lower() #Now we will sort the data which is obtained in the form of list chars = sorted(list(set(data))) #Now we are counting the total number of characters totalChars = len(data) #Number of unique chars numberOfUniqueChars = len(chars)

The input data which is available is in the form of text and we want to convert it into the form compatible with Keras. For the same reason, we are converting the text into lowercase which is a form of normalization.

After this, we are creating sorted () list of character in the text and store number of characters in the dataset in the ‘totalchars’. Lastly, we are storing the length of characters which will consist the number of unique characters.

**Step 3:
**For representing each character in for numbers, we will be using dictionaries for the same.

#For better results we are assigning Id to each character CharsForids = {char:Id for Id, char in enumerate(chars)} #This is the opposite to the above idsForChars = {Id:char for Id, char in enumerate(chars)} #Here we are deciding the number of characters learned i.e. Timestep numberOfCharsToLearn = 100

First, we are creating the dictionary where the character is the key and each key character is represented using a number. In the next line of code, we are doing the opposite of the previous line. Lastly, we are deciding the number of characters to be trained in a one-time step i.e. one training example.

**Step 4:
**The following counter is used for looping 100 times (number of characters learned).

#Since our timestep sequence represetns a process for every 100 chars we omit #the first 100 chars so the loop runs a 100 less or there will be index out of #range counter = totalChars - numberOfCharsToLearn

Now we have created empty lists for storing formatted data in the form of input ‘charX’ and output ‘y’.

#Here we are storing the input data charX = [] #Here we have stored the output data y = [] #This loops through all the characters in the data skipping the first 100 for i in range(0, counter, 1): #This one goes from 0-100 so it gets 100 values starting from 0 and stops #just before the 100th value theInputChars = data[i:i+numberOfCharsToLearn] #If we do not use ':' we start with 0, and so we get the actual 100th value #Essentially, the output Chars is the next char in line for those 100 chars #in X theOutputChars = data[i + numberOfCharsToLearn] #Through this we are adding/appending 100 chars as an Id to list charX charX.append([CharsForids[char] for char in theInputChars]) #For every 100 values there is one y value which is the output y.append(CharsForids[theOutputChars])

‘theInputChars’ is used for storing the first 100 input characters and after this loop is repeated as it takes the next 100 input characters and this loop is continued for the rest of the inputs.

‘theOutputChars’ stores only one character which is the next character after the final character in ‘theInputChars’

Lastly, in the ‘charX’ list we are appending 100 integers which are trained in the iteration and these integers are representing the ID’s of characters which were the input. Along with this, the integer ID is also appended to the ‘y’ list which is the output of single character in ‘theOutputChars’.

**Step 5:
**After the above steps, we want the data to be an incorrect form for Keras.

First, we are shaping the input array where the three parameters represent ‘samples’, ‘time-steps’ and ‘features’, this form is necessary for Keras.

#Len charX represents how many of those time steps we have #Our features are set to 1 because in the output we are only predicting 1 char #Finally numberOfCharsToLearn is how many character we process X = np.reshape(charX, (len(charX), numberOfCharsToLearn, 1))

For efficient and effective results we are normalizing the data.

#For normalizing X = X/float(numberOfUniqueChars) #This sets it up for us so we can have a categorical(#feature) output format y = np_utils.to_categorical(y) print(y)

**Output:
**This is the categorical form of output.

Now we are transforming the ‘y’ into a one-hot vector. The **one-hot vector** is an array of 0’s and 1’s. In this vector 1’s occur only at positions where the ID is found to be true. This same process is followed for 100 different examples having the length of ‘numberOfUniqueChars’.

**Step 6:
**Finally, it’s time to build our Recurrent Neural Network model.

model = Sequential() #Since we know the shape of our Data we can input the timestep and feature data #The number of timestep sequence are dealt with in the fit function model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]))) model.add(Dropout(0.2)) #number of features on the output model.add(Dense(y.shape[1], activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam') model.fit(X, y, epochs=100, batch_size=128) model.save_weights("Othello.hdf5") # model.load_weights("Othello.hdf5")

In the above code snippet, **Line 1 **uses the Sequential () which is imported from Keras. It is used for creating an empty template model used for building RNN.

In the next line, we add the first layer to the empty template model. This layer is LSTM layer containing 256 units and ‘input_shape’ as its one of the parameters.

‘Dropout’ import ensures that overfitting which occurs frequently in RNN is restricted to minimum. For restricting overfitting, this imported function randomly selects neurons and ignores them during training. Here ‘Dropout’ is provided with a parameter as ‘0.2’ which means 20% of the neurons will be dropped.

‘Dense’ is used for getting the output in the form of a layer of any neural network/recurrent neural network.

Using the ‘add’ function from the import, the available layer acts an output layer. Here we use the activation of the dot of the weights and inputs along with bias.

Now in configuration settings, we have loss function with parameter as ‘categorical_crossentropy’ and optimizer is ‘Adam’.

With the fit () function, we will run the training algorithm. Here the epochs are specifying the number of times the batches are to be evaluated. In this tutorial the number of epochs is taken as 100 and if you want you can change the number of epochs and see what different results you can get. With the batch size number, we are specifying the number of input data set we want to evaluate. For this practical, it is set as 128 i.e. first 128 examples are going as input then next 128 and this continues for the whole data set.

At last, the training is completed and we can save the weights. Moreover, we can also load the previously trained weights.

**Output:
**The following is the output which is expected when we start the training process, it will take some time. You can see that each of the 100 epochs is executed and the error i.e. loss value is continuously decreased which is a good sign.

Once this computing is finished, the code will now jump to the code for predicting the text. So let’s have a look at it.

**Step 7:
**Initially, in

With **Line 3** we are initiating a loop for 500 times, this value can be changed to see the different results, the value 500 means we are generating 500 characters through this loop.

Using **Line 4 **we are generating a data example used for predicting the next character. After normalizing in **Line 5**, we supply it to the prediction model in **Line 6.**

randomVal = np.random.randint(0, len(charX)-1) randomStart = charX[randomVal] for i in range(500): x = np.reshape(randomStart, (1, len(randomStart), 1)) x = x/float(numberOfUniqueChars) pred = model.predict(x) index = np.argmax(pred) randomStart.append(index) randomStart = randomStart[1: len(randomStart)] print("".join([idsForChars[value] for value in randomStart]))

In **Line 7 **we get the index of the next predicted character after that sentence. In **Line 8** and **9 **we are appending the character predicted to the starting sentence which gives 101 characters and so we can omit the first character to get the next 100 characters.

Lastly, we are looping until 500 and printing out the generated text by conversion of ID’s to the designated characters.

**Output:**

This is the output text which is predicted through this model, it will vary as per the changes which you will make in the model. The output generated in the form of prediction is not very convincing, but as our model is very basic we can expect this result.

**Natural Language Processing
**There have been numerous models built by people which are able to represent a language model. And these models have been capable of generating poems on the basis of large inputs which was provided to it in the form of poems itself.

**Language Modelling and Generating Text
**Prediction of texts and words is done using the input sequences of text/sentences.

**Language Translation
**Many famous applications like Google Translate, Duolingo have been using this technique of translation of one language into another language.

**Image/Video Tagging
**Here we identify different images present in each frame of an image/video. For its implementation, RNN is combined with CNN and then we get the desired output but this application is still in its early days.

I hope you have enjoyed and learned a lot from this article.

The post Detailed Introduction to Recurrent Neural Networks appeared first on StepUp Analytics.

]]>The post Application of Reinforcement Learning appeared first on StepUp Analytics.

]]>Once we have an understanding of how the world works, we can use our knowledge to accomplish civic goals. This learning from the interaction is known as reinforcement learning.

*Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.* (Source: Wikipedia)

In the field of reinforcement learning, we refer to the learner or decision maker as the agent. The set of conditions, it is provided with is referred to as the environment. The response to the learner is termed as rewards.

The agent performs certain actions in the environment which can have a positive or negative effect. It is decided by the interpreter. The interpreter, based on the efficiency of the action, provides positive or negative rewards to the agent and takes it to the next stage. The goal of the agent is to maximize the total positive reward. It remembers its actions from past and acts accordingly so to maximize the total rewards.

Let’s think of the agent as a small puppy born into the world, without any understanding of how anything works. Now say that the owner communicates to the puppy, how he would like it to behave. The puppy looks at the command and based on that observation, it is expected to choose how to respond. Of course, it has an invested interest in responding appropriately to get a treat (which is the reward).

But it doesn’t know what any of the actions do yet or what effect will it have on the world. So it has to try them out and see what happens. At this point, it has no reason to favor any action to the owner’s command so it chooses an action at random. After taking the action, it waits for a response. In response to its action, it receives a feedback from his owner.

If it does what it was commanded to do, it receives a reward in form of a treat. But if it doesn’t, he receives a negative reward in form of scolding. In general, its aim is to get maximum treats from the owner. It may take the puppy some time to get an idea of what is happening it should be able to figure it out eventually. The same situation happens with a reinforcement learning agent.

It interacts with the environment and eventually figures out how to gain maximum rewards. The agent explores all potential hypotheses to choose actions for maximizing the rewards, rather than exploiting limited knowledge about what is already known and should work well.

The applications of reinforcement learning are numerous and diverse, ranging from self-driving cars to board games. One of the major breakthroughs in machine learning in the 90s was TD- Gammon, an algorithm that used RL to play backgammon. Recently an RL trained agent was able to beat professionals in Alpha Go, another very complicated game.

Jumping to a completely different domain, RL is also used in robotics. For instance, it is used to teach robots to walk. RL is successfully used in self-driving cars, ships, and airplanes. It is even used in finance, biology, telecommunication, and various other businesses.

**MDPs and One-Step Dynamics
**Markov Decision Processes are used to rigorously define an RL problem.

- The
**state space**S is the set of all (*non-terminal*) states. - In episodic tasks, we use S+ to refer to the set of all states, including terminal states.
- The
**action space**A is the set of possible actions. (Alternatively, A(*s*) refers to the set of possible actions available in state s in*s*∈) - The
**return at time step***t*is*Gt*=*Rt*+1 +*Rt*+2 +*Rt*+3 +… - The agent selects actions with the goal of maximizing expected (discounted) return.
- The
**one-step dynamics**of the environment determine how the environment decides the state and reward at every time step.

A **(finite) Markov Decision Process (MDP)** is defined by:

- a (finite) set of states S (or S+, in the case of an episodic task)
- a (finite) set of actions A
- a set of rewards R
- the one-step dynamics of the environment
- the discount rate
*γ*∈ [0,1] - The discounted return at time step
*t*is*Gt*=*Rt*+1 +*γRt*+2 +*γ*2*Rt*+3 +…. - The discount rate
*γ*is something that you set, to refine the goal that you have the agent. It must satisfy 0≤*γ*≤1. If*γ*=0, the agent only cares about the most immediate reward. If*γ*=1, the return is not discounted. For larger values of*γ*, the agent cares more about the distant future. Smaller values of*γ*result in more extreme discounting, where – in the most extreme case – agent only cares about the most immediate reward.

Let’s understand MDPs with help of an example. Consider a recycling robot that runs on battery. It picks up cans scattered all around the room when it has sufficient battery. Whenever the battery is low, the robot is supposed to go to its docking station to recharge itself. There are various states and actions possible for the robot.

If it has a high battery, it can clean the room or wait in a situation of already cleaned room. If it has a low battery, it has to recharge itself. If it keeps functioning in spite of a low battery, it can halt at any point of time and human intervention would be required to take it to its docking station. Based on these, we can define an MDP for this robot in the following fashion:

**States: **{HIGH, LOW}**
Action:**{SEARCH, RECHARGE, WAIT}

Reward:

The agent looks for the best policies to maximize the reward.

A **deterministic policy** is a mapping *π*:S→A. For each state *s *∈ S, it yields the action* a *∈ A that the agent will choose while in state s.

A **stochastic policy** is a mapping π:S×A→[0,1]. For each state* s *∈ S and action *a *∈ A, it yields the probability *π*(*a*∣*s*) that the agent chooses action *a* while in state *s*.

As in the example of recharging robot, a deterministic policy would tell that when a robot is in a low state, it will recharge and when in a high state, it will search or wait. But a stochastic policy will tell the probability that the robot will recharge when in low or high state and the probability that it will search in a low/high state.

The state-value function for a policy *π* is denoted *vπ* . For each state* s *∈ S, it yields the expected return if the agent starts in state *s* and then uses the policy to choose its actions for all time steps.

That is,* vπ* (*s*)≐E*π* [*Gt* ∣*St* =*s*]. We refer to* vπ* (*s*) as the value of state *s* under policy** ***π*. E*π* [⋅] is defined as the expected value of a random variable, given that the agent follows policy *π*.

A policy *π*′ is defined to be better than or equal to a policy *π* if and only if* vπ*′ (*s*)≥*vπ* (*s*) for all* s *∈ S. An **optimal policy ***π*∗ satisfies *π*∗ ≥*π* for all policies *π*. An optimal policy is guaranteed to exist but may not be unique. All optimal policies have the same state-value function *v*∗, called the optimal state-value function.

The action-value function for a policy *π* is denoted* qπ*. For each state* s *∈ S and action *a *∈ A, it yields the expected return if the agent starts in state *s*, takes action *a*, and then follows the policy for all future time steps. We refer to* qπ* (*s*, *a*) as the value of taking action *a* in state *s* under a policy *π*. All optimal policies have the same action-value function, called the optimal action-value function.

For example, suppose a person has to reach to point B from a point A. There can be several paths between these two points. But there is only one path that takes the shortest time to reach. When the user takes minimum time to travel between the points, we can say that he has used an optimal policy by choosing the best path possible.

Majorly two algorithms are used in solving RL problems namely, Monte Carlo Methods and Temporal Difference Methods.

Algorithms that solve the **prediction problem** determine the value function *vπ* (or* qπ* ) corresponding to a policy *π*. Methods that evaluate a policy *π* from interaction with the environment fall under one of two categories:

**On-policy**methods have the agent interact with the environment by following the same policy*π*that it seeks to evaluate (or improve).**Off-policy**methods have the agent interact with the environment by following a policy*b*(where*b*≠*π*) that is different from the policy that it seeks to evaluate (or improve).

Each occurrence of state *s *∈ S in an episode is called a **visit to** *s*.

There are two types of Monte Carlo (MC) prediction methods (for estimating * vπ* ):

**First-visit MC** estimates *vπ* (*s*) as the average of the returns following *only first* visits to *s*(that is, it ignores returns that are associated with later visits).

**Every-visit MC** estimates* vπ* (*s*) as the average of the returns following *all* visits to *s*.

There are two types of MC prediction methods for estimating* qπ* :

**First-visit MC**estimates*qπ*(*s*,*a*) as the average of the returns following*only first*visits to s (that is, it ignores returns that are associated to later visits).**Every-visit MC**estimates*qπ*(*s*,*a*) as the average of the returns following*all*visits to s.

Algorithms designed to solve the **control problem** determine the optimal policy *π*∗ from interaction with the environment. **Generalized policy iteration (GPI)** refers to the general method of using alternating rounds of policy evaluation and improvement in the search for an optimal policy.

A policy is **greedy** with respect to an action-value function estimate *Q* if, for every state *s *∈ S, it is guaranteed to select an action* a *∈ A(*s*) such that *a*=argmax(*a*∈A(*s*) *Q*(*s*,*a*)). It is common to refer to the selected action as the **greedy action**. A policy is *ϵ***-greedy** with respect to an action-value function estimate *Q* if for every state *s *∈ S,

with probability 1-*ϵ*, the agent selects the greedy action, and

with probability *ϵ*, the agent selects an action (uniformly) at random.

In order for MC control to converge to the optimal policy, the **Greedy in the Limit with Infinite Exploration (GLIE)** conditions must be met:

- every state-action pair is visited infinitely many times, and
- The policy converges to a policy that is greedy with respect to the action-value function estimate
*Q*.

Whereas Monte Carlo (MC) prediction methods must wait until the end of an episode to update the value function estimate, temporal-difference (TD) methods update the value function after every time step.

For any fixed policy, **one-step TD** (or **TD(0)**) is guaranteed to converge to the true state-value function, as long as the step-size parameter *α* is sufficiently small.

In practice, TD prediction converges faster than MC prediction.

**Sarsa(0)** (or **Sarsa**) is an on-policy TD control method. It is guaranteed to converge to the optimal action-value function *q*∗, as long as the step-size parameter *α* is sufficiently small and *ϵ* is chosen to satisfy the **Greedy in the Limit with Infinite Exploration (GLIE)** conditions.

- On-policy TD control methods have better online performance than off-policy TD control methods (like Q-learning).
- Expected Sarsa generally achieves better performance than Sarsa.

Reinforcement problems work with discrete spaces. But in practical life, we have continuous spaces to deal with. It is the point where Markov Decision Processes fail. To overcome this shortcoming, Deep Reinforcement Learning is gaining momentum.

Deep reinforcement learning (DRL) is an exciting area of AI research, with potential applicability to a variety of problem areas. Some see DRL as a path to artificial general intelligence, or AGI, because of how it mirrors human learning by exploring and receiving feedback from environments. (Source: VentureBeat)

*It is basically RL applied with neural networks. *

Let’s take a look at the below image. A computer is being taught to play the famous Mario game. Consider a particular instance of time, as in the image. What is the required action at this point in time, to jump or to run? The model will decide the action being fed into a convolutional neural network (CNN). This CNN will provide information to the model about similar instances in the past and corresponding rewards. The model will then choose the action that will yield it maximum rewards

DRL solves the problem of discrete spaces by making the reinforcement algorithms work in continuous spaces. It is relatively a new concept and research is going on in this area.

The post Application of Reinforcement Learning appeared first on StepUp Analytics.

]]>The post Introduction To Keras: Tutorial and Implementation appeared first on StepUp Analytics.

]]>Keras is a high-level neural networks API, written in Python and capable of running on top of **TensorFlow**, **CNTK**, or **Theano**. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.

It runs on Python 2.7 or 3.5 and can seamlessly execute on GPUs and CPUs have given the underlying frameworks.

**User friendliness.** Keras is an API designed for human beings, not machines. It offers consistent & simple APIs, it minimizes the number of user actions required for common use cases, and it provides clear and actionable feedback upon user error.

**Modularity.** A model is understood as a sequence or a graph of standalone, fully-configurable modules that can be plugged together with as few restrictions as possible. In particular, neural layers, cost functions, optimizers, initialization schemes, activation functions, regularization schemes are all standalone modules that you can combine to create new models.

**Easy extensibility.** To be able to easily create new modules allows for total expressiveness, making Keras suitable for advanced research.

**Work with Python**. No separate models configuration files in a declarative format. Models are described in Python code, which is compact, easier to debug, and allows for ease of extensibility.

We see that Deep Learning frameworks operate at 2 levels of abstraction:

**Lower Level:**This is where frameworks like Tensorflow, MXNet, Theano, and PyTorch sit. This is the level where mathematical operations like Generalized Matrix-Matrix multiplication and Neural Network primitives like Convolutional operations are implemented.**Higher Level:**This is where frameworks like Keras sit. At this Level, the lower level primitives are used to implement Neural Network abstraction like Layers and models. Generally, at this level, other helpful APIs like model saving and model training are also implemented.

**You cannot compare Keras and Tensorflow because they sit on different levels of abstraction. I also want to take this opportunity to share my experience of using Keras:**

**I do not agree that Keras is only useful for basic Deep Learning work**. Keras is a beautifully written API. The functional nature of the API helps you completely and gets out of your way for more exotic applications. Keras does not block access to lower level frameworks.- Keras results in much more readable and succinct code.
- Keras model Serialization/Deserialization APIs, callbacks, and data streaming using Python generators are very mature.
- Keras has been declared the official high-level abstraction for Tensorflow.

Before installing Keras, please install one of its backend engines: TensorFlow, Theano, or CNTK. We recommend the TensorFlow backend.

- TensorFlow installation instructions.
- Theano installation instructions.
- CNTK installation instructions.

You may also consider installing the following optional dependencies:

- cuDNN (recommended if you plan on running Keras on GPU).
- HDF5 and h5py (required if you plan on saving Keras models to disk).
- graphviz and pydot (used by visualization utilities to plot model graphs).

Then, you can install Keras itself. There are two ways to install Keras:

**Install Keras from PyPI (Recommended)**

sudo pip install keras

if you are using a virtual env, you may want to avoid using sudo:

pip install keras

**Alternatively: install Keras from the GitHub source:**

First, clone Keras using git:

git clone https://github.com/keras-team/keras.git

Then, cd to the Keras folder and run the install command:

cd keras sudo python setup.py install

** **In this section, we will learn, how to use keras (sequential model) for building the deep learning Models.

The **Sequential** model is a linear stack of layers. You can create a **Sequential** model by passing a list of layer instances to the constructor:

from keras.models import Sequential from keras.layers import Dense, Activation model = Sequential([ Dense(32, imput_shape=(784,)), Activation('relu'), Dense(10), Activation('softmax'), ])

You can also simply add layers via the **.add()** method:

model = Sequential() model.add(Dense(32, input_dim=784)) model.add(Activation('relu'))

The model needs to know what input shape it should expect. For this reason, the first layer in a **Sequential** model (and only the first, because following layers can do automatic shape inference) needs to receive information about its input shape. There are several possible ways to do this:

- Pass an
**input_shape**argument to the first layer. This is a shape tuple (a tuple of integers or**None**entries, where**None**indicates that any positive integer may be expected). In**input_shape**, the batch dimension is not included. - Some 2D layers, such as
**Dense**, support the specification of their input shape via the argument**input_dim**, and some 3D temporal layers support the arguments**input_dim**and**input_length**. - If you ever need to specify a fixed batch size for your inputs (this is useful for stateful recurrent networks), you can pass a
**batch_size**argument to a layer. If you pass both**batch_size=32**and**input_shape=(6, 8)**to a layer, it will then expect every batch of inputs to have the batch shape**(32, 6, 8)**.

As such, the following snippets are strictly equivalent:

model = Sequential() model.add(Dense(32, input_shape=(784,)))

model = Sequential() model.add(Dense(32, input_dim=(784,)))

Before training a model, you need to configure the learning process, which is done via the **compile** method. It receives three arguments:

- An optimizer. This could be the string identifier of an existing optimizer (such as rmsprop or
**adagrad**) or an instance of the**Optimizer** - A loss function. This is the objective that the model will try to minimize. It can be the string identifier of an existing loss function (such as
**categorical_crossentropy**or**mse**), or it can be an objective function. - A list of metrics. For any classification problem, you will want to set this to
**metrics=[‘accuracy’]****.**A metric could be the string identifier of an existing metric or a custom metric function.

# For a multi-class classification problem model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) # For a binary classification problem model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy']) # For a mean squared error regression problem model.compile(optimizer='rmsprop', loss='mse') # For custom metrics import keras.backend as K def mean_pred(y_true, y_pred): return K.mean(y_pred) model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy', mean_pred])

Keras models are trained on Numpy arrays of input data and labels. For training a model, you will typically use the **fit** function.

# For a single-input model with 2 classes (binary classification): model = Sequential() model.add(Dense(32, activation='relu', input_dim=100)) model.add(Dense(1, activation='sigmoid')) model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy']) # Generate dummy data import numpy as np data = np.random.random((1000,100)) labels = np.random.randint(2, size=(1000,1)) # Train the model, iterating on the data in batches of 32 samples model.fit(data, labels, epochs=10, batch_size=32)

# For a single-input model with 2 classes (binary classification): model = Sequential() model.add(Dense(32, activation='relu', input_dim=100)) model.add(Dense(10, activation='sigmoid')) model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) # Generate dummy data import numpy as np data = np.random.random((1000,100)) labels = np.random.randint(10, size=(1000,1)) # Train the model, iterating on the data in batches of 32 samples model.fit(data, labels, epochs=10, batch_size=32)

To understand the **keras**, how we use **keras** in deep learning models. I am taking **“Pima-Indians-diabetes-dataset”. **This is a simple Neural Network Implementation using **Keras** to understand that how actually **keras model** works.

You will find a complete **Jupyter notebook** on **“Pima-Indians-diabetes-dataset”. **

**Please visit my Github link: ****Code-neverends**

The post Introduction To Keras: Tutorial and Implementation appeared first on StepUp Analytics.

]]>The post Recurrent Convolutional Neural Network: RCNN appeared first on StepUp Analytics.

]]>Selective Search performs the function of generating 2000 different regions that have the highest probability of containing an object. After we’ve come up with a set of region proposals, these proposals are then “warped” into an image size that can be fed into a trained CNN (AlexNet in this case) that extracts a feature vector for each region. This vector is then used as the input to a set of linear SVMs that are trained for each class and output a classification. The vector also gets fed into a bounding box regressor to obtain the most accurate coordinates.

Non-maxima suppression is then used to suppress bounding boxes that have a significant overlap with each other.

If you wish to know more on Neural Network, Go through this awesome article on **Neural Network Workflow**

Improvements were made to the original model because of 3 main problems:

- Training took multiple stages (ConvNets to SVMs to bounding box regressors),
- Was computationally expensive, and
- Was extremely slow (RCNN took 53 seconds per image).

Fast R-CNN was able to solve the problem of speed by basically sharing computation of the conv layers between different proposals and swapping the order of generating region proposals and running the CNN.

In this model, the image is first fed through a ConvNet, features of the region proposals are obtained from the last feature map of the ConvNet and lastly, we have our fully connected layers as well as our regression and classification heads.

Faster R-CNN works to combat the somewhat complex training pipeline that both R-CNN and Fast R-CNN exhibited. The authors insert a region proposal network (RPN) after the last convolutional layer. This network is able to just look at the last convolutional feature map and produce region proposals from that. From that stage, the same pipeline as R-CNN is used (ROI pooling, FC, and the classification and regression heads).

The post Recurrent Convolutional Neural Network: RCNN appeared first on StepUp Analytics.

]]>The post Neural Networks Workflow appeared first on StepUp Analytics.

]]>But to operate with anything, there is always a *rulebook *which one must follow. And in the case of neural networks, the set of rules are just simple, follow the basic workflow of the network! This post features the neural networks’ workflow and also some certain mathematics associated with each flow of control, so let us get started.

If you are not familiar with what are Neural Networks and what are they doing here, then I’d suggest you follow our previous articles and come back soon. Though I’ll try to keep it as basic as I can as a prerequisite I assume that Y’all know Deep Learning [link to the previous blog], and have a gist of neural networks.

As stated in the last post as well, Neural Networks is an interconnection of several of neurons, or nodes, which have weights [W/ Ө], biases [b], and activations [f(x, y)] associated with it. This interconnection enables the neurons to exchange data in the form of real numbers.

Each neuron is a function which gets some input, and it returns an output associated with it, which is calculated by an activation function residing inside. Now say that there is a network with 100 neurons, just imagine how many inputs and outputs will be in this network!

(I have already explained the terms *weight, bias, *and *activation function *in the last post here.)

The pipeline associated with the architecture of a neural network goes as follows –

- Forward Propagation
- Cost Computation
- Optimization Algorithm to minimize this cost function
- Backward Propagation

Let us see each in detail as to how does it work!

This is the very first step of the Neural Architecture pipeline which takes in the output decides some values for the learnable parameters i.e. weights and biases and generates some output.

- Starting with the weight and bias unit, each is randomized by the programmer; mostly the bias is set to be zero and weight to be any random number. This is setting up of parameters is done for each neuron in the input layer. [Learn about input and other types of the layer in the previous post.]
- After randomly initializing the learnable parameters, the input is fed into the input layer of the network. So the input layer is a vector having
*n*number of nodes where*n*is the number of features which are distinguished from the dataset.- From the last post, there say three features. One containing a categorical column for the problem statement in one word, another for the ratings of UI from the user, and third for the background of the crowd which is engaged, another categorical column containing background in the single word.
- Hence the number of nodes in the input layer would be three for each feature.
**Don’t confuse between the above picture and a neural network doing the stated task. The above figure was made just for a reference, while what we are talking about, right now is a neural network having 3 nodes in the input layer and 1 in the output while hidden units depend upon the choice of the programmer.**

- As the input is fed and the parameters are initialized, the next task is to apply a choice of activation function for each layer. Using this function, an approximate mapping of the input with the output is done. (Approximate because if you’d recall, the parameters are randomly initialized in the first step, so an exact mapping can’t be achieved randomly). However, as the learning proceeds i.e. as the flow goes into deeper layers, the learnable parameters starts changing and moving towards the value which indeed is desired.
- For example, “
*Y =**Ө*is the function, so after setting up random values for parameters, a multiplication of input values with the weights of that layer is done and then bias parameter is added to this term. This generates up to a numerical output which serves as the input of the next layer. So for the next layer, the value of*.X + b”*is decided by this function.*X*

- For example, “
- As the activation function is applied to the input layer, it generates some output. Now, this output acts as an input for the first hidden layer. And the process goes on and on for deeper layers until the final
*output layer*is reached. Now, this output layer to has an activation function which generates the final output of the network, which is used to check the performance of the model.

The programmer when deploys a neural network, he/she has some desires with the model like what will be the output? Will it appropriately sort the problem statement? If yes, how accurate would it be?

The machine learning problems often have original results with which the model’s performance is judged (supervised learning). Or otherwise as stated before, there are some expectations associated. And that is how the above questions can be answered. A comparison is done between the values what the model outputs (or let us predicts) and what are the desired outputs. This exactly is done via the **Cost Function. **

A cost function is used to **evaluate the performance** of the model. It outputs a single real value which denotes accuracy of the model (and takes in the original result and model-generated output in case of supervised learning). So now the next pipeline’s objective is to minimize this cost function so as to gain a higher accuracy score.

One such cost function is mentioned beside, which is known as Root Mean Square Error. Here, P and Q denote the predictions of the model and actual results, and n is the total number of instances (or training examples) available.

I’d like to take an analogy to explain this “optimizing the cost function” pipeline, which I really connect to and hope that you too will.

The machine learning model is often referred to a kid who is left in the world to learn from his mistakes and sometimes corrected by his guardian when he makes one. When a model is just deployed for training, it is like a kid made to roam around in a park on his own, whose objective is to reach to an ice cream truck standing in the corner. And whenever he deviates from the path which leads to the truck (i.e. the accurate output in model’s case), a guardian corrects him and points him again and again while deviating until he finally reaches the goal.

Now in NN’s case, this optimization algorithm acts as a guardian for the model as it does exactly the same thing of correcting it whenever it goes wrong. The aim of the optimization algorithms is to minimize the cost function and thus increase and accuracy of the model significantly. The most basic and simplest algorithm which is used for neural nets is ** Gradient Descent**.

However, there are various other techniques (as well as various variants of gradient descent) which could be equipped for your baby model like Adagrad, RMSprop, Adam, Adadelta, as well as various variants of gradient descent as well [check out the post here].

After the forward pass, computation of cost function, etc. is done, then comes the part of updating the learnable parameters (remember we initialized them randomly in the forward pass?). This update is done using a technique known as *Backward Propagation* (or simple Backprop).

In the optimization pipeline, the **derivative of the cost function is computed with respect to each parameter involved in the model** (i.e. every weight parameter, every bias parameter, and every activation function associated. This derivative tells us how much change must be brought in a particular variable so as to increase the accuracy of the model, or in other words, decrease the cost function. Hence the *parameter which is interfering more in the cost function is penalized more, due to this derivative approach*.

As the name suggests, backward pass operates in the opposite direction of the forward propagation i.e. in the very first step the derivative of the cost function is computed wrt the recent layers’ activation, then its weights and bias. Then the control shifts to the layer previous to the current one do the same for that and move similarly until it finally reaches the very first layer. This was each parameter is updated and then the cycle reruns from the forward pass again but this time with *updated* parameters!

- Forward Propagation
- Cost Computation
- Optimization Algorithm to minimize this cost function
- Backward Propagation

Deep learning is obviously an iterative process as quoted by Andrew NG but you cannot afford to struggle with just the basic workflow. Now, there you go, try your hands-on. Make a neural network from scratch! It may be a hectic task but it will really help you understand the idea even better. Meanwhile, stay connected for I’ll be covering to make and deploy your own neural network in just 20 lines of code

The post Neural Networks Workflow appeared first on StepUp Analytics.

]]>The post A Beginner’s Guide To Convolutional Neural Network appeared first on StepUp Analytics.

]]>Ever since then, a host of companies have been using deep learning at the core of their services. Facebook uses neural nets for their automatic tagging algorithms, Google for their photo search, Amazon for their product recommendations, Pinterest for their home feed personalization, and Instagram for their search infrastructure.

However, the classic, and arguably most popular, use case of these networks is for image processing. Within image processing, let’s take a look at how to use these CNNs for image classification.

Image classification is the task of taking an input image and outputting a class (a cat, dog, etc) or a probability of classes that best describes the image. For humans, this task of recognition is one of the first skills we learn from the moment we are born and is one that comes naturally and effortlessly as adults. Without even thinking twice, we’re able to quickly and seamlessly identify the environment we are in as well as the objects that surround us.

When we see an image or just when we look at the world around us, most of the time we are able to immediately characterize the scene and give each object a label, all without even consciously noticing. These skills of being able to quickly recognize patterns, generalize from prior knowledge, and adapt to different image environments are ones that we do not share with our fellow machines.

When a computer sees an image (takes an image as input), it will see an array of pixel values. Depending on the resolution and size of the image, it will see a 32 x 32 x 3 array of numbers (The 3 refers to RGB values). Just to drive home the point, let’s say we have a color image in JPG form and its size is 480 x 480. The representative array will be 480 x 480 x 3.

Each of these numbers is given a value from 0 to 255 which describes the pixel intensity at that point. These numbers, while meaningless to us when we perform image classification, are the only inputs available to the computer. The idea is that you give the computer this array of numbers and it will output numbers that describe the probability of the image being a certain class (.80 for a cat, .15 for a dog, .05 for a bird, etc).

Now that we know the problem as well as the inputs and outputs, let’s think about how to approach this. What we want the computer to do is to be able to differentiate between all the images it’s given and figure out the unique features that make a dog a dog or that make a cat a cat. This is the process that goes on in our minds subconsciously as well.

When we look at a picture of a dog, we can classify it as such if the picture has identifiable features such as paws or 4 legs. In a similar way, the computer is able to perform image classification by looking for low-level features such as edges and curves and then building up to more abstract concepts through a series of convolutional layers. This is a general overview of what a CNN does. Let’s get into the specifics.

But first, a little background. When you first heard of the term convolutional neural networks, you may have thought of something related to neuroscience or biology, and you would be right. Sort of. CNN’s do take a biological inspiration from the visual cortex. The visual cortex has small regions of cells that are sensitive to specific regions of the visual field.

Back to the specifics. A more detailed overview of what CNNs do would be that you take the image, pass it through a series of convolutional, nonlinear, pooling (downsampling), and fully connected layers, and get an output. As we said earlier, the output can be a single class or a probability of classes that best describes the image. Now, the hard part is understanding what each of these layers do. So let’s get into the most important one.

The first layer in a CNN is always a **Convolutional Layer**. First thing to make sure you remember is what the input to this conv (I’ll be using that abbreviation a lot) layer is. Like we mentioned before, the input is a 32 x 32 x 3 array of pixel values. Now, the best way to explain a conv layer is to imagine a flashlight that is shining over the top left of the image.

Let’s say that the light this flashlight shines covers a 5 x 5 area. And now, let’s imagine this flashlight sliding across all the areas of the input image. In machine learning terms, this flashlight is called a **filter**(or sometimes referred to as a **neuron **or a **kernel**) and the region that it is shining over is called the **receptive field**. Now this filter is also an array of numbers (the numbers are called **weights** or **parameters**). A very important note is that the depth of this filter has to be the same as the depth of the input (this makes sure that the math works out), so the dimensions of this filter is 5 x 5 x 3.

Now, let’s take the first position the filter is in for example. It would be the top left corner. As the filter is sliding, or **convolving**, around the input image, it is multiplying the values in the filter with the original pixel values of the image (aka computing **element wise multiplications**). These multiplications are all summed up (mathematically speaking, this would be 75 multiplications in total). So now you have a single number.

Remember, this number is just representative of when the filter is at the top left of the image. Now, we repeat this process for every location on the input volume. (Next step would be moving the filter to the right by 1 unit, then right again by 1, and so on). Every unique location on the input volume produces a number. After sliding the filter over all the locations, you will find out that what you’re left with is a 28 x 28 x 1 array of numbers, which we call an **activation map** or **feature map**. The reason you get a 28 x 28 array is that there are 784 different locations that a 5 x 5 filter can fit on a 32 x 32 input image. These 784 numbers are mapped to a 28 x 28 array.

Let’s say now we use two 5 x 5 x 3 filters instead of one. Then our output volume would be 28 x 28 x 2. By using more filters, we are able to preserve the spatial dimensions better. Mathematically, this is what’s going on in a convolutional layer.

However, let’s talk about what this convolution is actually doing from a high level. Each of these filters can be thought of as **feature identifiers**. When I say features, I’m talking about things like straight edges, simple colors, and curves. Think about the simplest characteristics that all images have in common with each other.

Let’s say our first filter is 7 x 7 x 3 and is going to be a curve detector. (In this section, let’s ignore the fact that the filter is 3 units deep and only consider the top depth slice of the filter and the image, for simplicity.)As a curve detector, the filter will have a pixel structure in which there will be higher numerical values along the area that is a shape of a curve (Remember, these filters that we’re talking about as just numbers!).

Now, let’s go back to visualizing this mathematically. When we have this filter at the top left corner of the input volume, it is computing multiplications between the filter and pixel values at that region. Now let’s take an example of an image that we want to classify, and let’s put our filter at the top left corner.

Remember, what we have to do is multiply the values in the filter with the original pixel values of the image.

Basically, in the input image, if there is a shape that generally resembles the curve that this filter is representing, then all of the multiplications summed together will result in a large value! Now let’s see what happens when we move our filter.

The value is much lower! This is because there wasn’t anything in the image section that responded to the curve detector filter. Remember, the output of this conv layer is an activation map. So, in the simple case of a one filter convolution (and if that filter is a curve detector), the activation map will show the areas in which there at mostly likely to be curves in the picture.

In this example, the top left value of our 26 x 26 x 1 activation map (26 because of the 7×7 filter instead of 5×5) will be 6600. This high value means that it is likely that there is some sort of curve in the input volume that caused the filter to activate. The top right value in our activation map will be 0 because there wasn’t anything in the input volume that caused the filter to activate (or more simply said, there wasn’t a curve in that region of the original image).

Remember, this is just for one filter. This is just a filter that is going to detect lines that curve outward and to the right. We can have other filters for lines that curve to the left or for straight edges. The more filters, the greater the depth of the activation map, and the more information we have about the input volume.

The post A Beginner’s Guide To Convolutional Neural Network appeared first on StepUp Analytics.

]]>The post Introduction To Deep Learning and Neural Nets appeared first on StepUp Analytics.

]]>Deep Learning is here to *stay* for at least another couple of years if not the decade. With this world of changing technologies, the newly found domain is of Machine Learning and AI, and to which Deep Learning extensively collaborates a lot. So let us get a little more familiar with what it is, and how has it gained so much attention?

**Table of Contents**

- What is Deep Learning?
- How does it work?
- Terminologies associated with DL with the help of an example.
- How Deep Learning is better than the “Traditional Learning Approaches”?
- Introduction to Neural Networks

Download the Slides for this article

The domain of Deep Learning is actually derived from Machine Learning in the year 1986 and was thought of in the 1980s, yeah you read it right!

Before leaping to the former, let us first see a definition of Machine Learning just so you get a gist if you don’t know it already.

**“Machine learning is an application of Artificial Intelligence (AI) that provides systems, the ability to implicitly learn and improve from experience without being programmed to do so”**.

And how will the systems learn? Of course from the *data* that is explicitly supplied to them as input.

Now coming to Deep Learning; it is a subfield of ML (as stated already) and it concerns with the algorithms which are inspired by the structure and analogy of the brain. Deep Learning extensively uses *Neural Networks *which is what imitates a human brain to process the input data and learn patterns for decision making in the future.

Some key points to remember about Deep Learning are:

- It learns by looking at the examples, just like Machine Learning.
- It can directly operate over the inputted images, texts, or even sounds!
- It typically uses various architectures of Neural Networks (it sure might be fiddling your brain up, thinking what
*is a Neural Net*, so we’ll come to that soon)

As already discussed, a Deep Learning Algorithm or model directly processes on pictures, videos, texts or sounds. The programmer need not manually extract features and supply it to the model but the algorithm learns patterns directly from the data inputted.

A DL model consists of several neurons where each neuron (or perceptron) has some input on the basis of which it generates an output. The interconnection of all these neurons forms a network which is known as a **Neural Network,** and this interconnection enables the neurons to communicate the data with each other.

**The figure shows the similarity between a Biological and an Artifical Neuron (Perceptron)**

Neural Networks use neurons which are artificially generated keeping in mind the biological ones, hence the network is also called an **Artificial Neural Network **(or simply **ANN**).

As an input is given to the neuron, it computes its output based on a mapping function. This mapping function is known as an **Activation Function**. Basically, the goal of this activation function is to map the relation between the input and the output values. Also, with each input parameter is associated one more parameter which is known as a **Weight(s)**. The final optimized value of *weight(s)* is learned by the model by iterating several times through the data inputted. Weight is just a real number which is multiplied with the scalar inputs and the output result is passed on to the next node via the connection.

*Weight tells how much impact that input parameter has on the output*. For instance, take the problem statement to find whether a newly launched app will be a success or a failure!

The adjacent figure shows possible parameters of concern, i.e., the problem statement and aim of the application, its UI, and lastly, the background of the crowd engaged with it.

So each parameter will have different intensity with which it will impact the output. Suppose the problem statement is not excellent but its UI is great and the crowd into which it is marketed is of the just right domain, then the app will have obviously gain success.

Sometimes another parameter which comes into play is a **Bias Unit**. It is denoted by the letter ** b**, and it adds non-linearity in the mapping of the function thus making the model much flexible, adaptive to different situations and hence robust!

Ok, so Deep Learning is another tech in the market but why does the world pivoting to Deep Learning from the “traditional learning algorithms”?

The answer to this question will be so much clearer by just looking at the adjacent figure.

The graph was provided by one of the leading scientists in the domain of AI, and the founder and director of Google Brain, *Andrew **Yan-Tak*** NG**.

In this era of Big Data, the scientists have realized that data plays a key role in this data-driven world. And when this huge amount of data is provided to the Deep Learning Algorithm, it not only optimizes its performance with it but also enhances the accuracy level of the system as a whole.

This wasn’t seen in the case of Older Learning Algorithms (as in graph), whose performance saturated after reaching a point. And due to this higher level of optimization, Deep Learning is preferred over the Older Learning Algorithms.

Moreover, Deep Learning approach can be applied to Classification as well as Regression tasks, which proves its versatility and hence it becomes handy in the field of AI.

What you’ve have seen until now is just a single neuron and how does it work, but Deep Learning is not a single neuron but a network of neurons. This network of neurons, when plotted, looks something like the figure below.

In the above figure/graph, each node depicts a *neuron* and several neurons placed vertically makes up a ** layer**. Furthermore, a neural network having more than two layers is used in deep learning architectures.

The very first layer which passes-on the input variables/parameters is known as an **Input Layer**, and then comes a number of **Hidden Layers**. Hidden layers got their name as they are generally not represented when dealing with the client architecture. Also, the decision to choose the number of hidden layers is completely on the programmer based on his requirements.

The number of nodes (or neurons) also depends upon the programmer and the use case of the model. Lastly, there is the **Output layer**, which is dedicated to serving the output of the model. The number of nodes at the input and output layers are determined by the problem statement, for example, if it’s a classifier classifying three classes (say cat, fish, and dog) then the output layer will have 3 nodes for each class. And, the number of nodes in the input layer is the same as the number of input parameters (refer to the example of the success/failure of the new application).

What happens in a neural network is, the input is given to the input layer of the network. Now, this layer multiplies the scalar input values with the weights (which are randomly initialized initially), apply the activation function and pass on the result to the 1^{st} hidden layer via the connection.

Now, this layer does the same thing until this whole process reaches to the output layer, which compares the result based on the cost function and iterates the process again and again until the cost function is minimized.

The reason why Deep Learning is used must be quite clear to you from this post also, it can be customized a lot on the basis of the need of the programmer. It is one of the state-of-the-art techniques which are being used today; it also gives rise to CNN’s, RNNs, LSTMs, etc. which are indeed used in almost everywhere.

If you wish to know more about the mathematical part of the neural networks or how do they work then leave us a comment. Happy Learning!

The post Introduction To Deep Learning and Neural Nets appeared first on StepUp Analytics.

]]>