The post Testing Of Hypothesis: Parametric Tests appeared first on StepUp Analytics.

]]>*Note: In this article, we assume that the
normality and no outliers assumption holds true where required and do not test
it separately.*

*You can refer to the R Code and data files used in this article here: *

**Q. Check if mean birth weight of newborn babies is 3 kgs.**

A. We have data of birth weight of 25 newborn babies. To check if the mean birth weight of newborn babies is 3kgs, we use one sample t-test. As nothing is mentioned about the level of significance, we consider the default value of 5%.

*To test,*

**H**_{0}**: Mean birth weight of newborn babies is 3kgs v/s H**

As p-value (0.3195) > 0.05, do not
reject H_{0} and conclude that the mean birth weight of newborn babies
is 3 kgs.

Alternatively, this question can be solved by calculating the value of the statistic using the formula step by step. We can make a function to give the value of the statistic for any dataset.

First, let us understand how a function works in R.

*In maths, if we write a function, y =
f(x) = x^2, then it’ll give different values of y for different values of x. If
x = 3, then y will be 9. *

*Similarly, in R, a function of one or more arguments (in this case x is an argument) performs the given statements (in this case y = x^2) to give the required output for the desired value (in this case the output is 9, for the given input of 3). The above-discussed function can be named as try1.*

**Q. Test if there is a difference in time taken (in minutes) by 2 teams in writing an article.**

A. We have data of 14 writers per team. We’ll use independent sample t-test for this problem.

First, we’ll check equality of variances assumption for 2 groups.

*To test,*

**H**_{0}**: Variances between the two groups are equalv/sH**

As p-value (0.5799) > 0.05, do not
reject H_{0}. We conclude that the variances are equal, hence perform
the t-test.

*To test,*

**H**_{0}**: No significant difference in mean time taken by 2 teams in writing an article v/sH**

As p-value (0.6447) > 0.05, do not
Reject H_{0} and conclude that there is no significant difference in
time taken by 2 teams to complete the article.

Alternatively, this question can be solved by calculating the value of the statistic using the formula step by step. We can make a function to give the value of the statistic for any dataset.

**Q. Test if there is a reduction in weight post diet plan implementation.**

A. As the data corresponds to weight values pre and post diet plan implementation, it is a case of

*To test,*

**H**_{0}**: Mean difference of weights pre and post implementation of diet plan is 0v/sH**

As p-value (2.688e-07) < 0.05, reject H_{0}
and conclude that there is a reduction in weight post diet plan implementation.

Alternatively, this question can be solved by calculating the value of the statistic using the formula step by step. We can make a function to give the value of the statistic for any dataset.

**Q. Determine whether the standard deviation of the heights of 12 year old children is equal to 4cm, based on a random sample of 5 heights in cm.**

A. This question can be solved by calculating the value of the statistic using the formula step by step. We can make a function to give the value of the statistic for any dataset.

*To test,*

**H**_{0}**: Standard deviation of the heights of 12-year-old children is equal to 4cmv/sH**

As p-value (0.3694) > 0.05, do not reject H_{0} and conclude that the standard deviation of the heights of 12-year-old children is 4cm.

**Q. In a one year mortality investigation, 4 of the 25 ninety year olds present at the start of the investigation died before the end of the year. Assuming that the number of deaths has a binomial (25,p) distribution, test whether this result is consistent with a mortality rate of p = 0.2 for this age**.

A. To test,

**H**_{0}**: p = 0.2 v/s H**

As p-value
(0.804) > 0.05, do not reject H_{0} and conclude that the true
mortality rate for this age is 0.2.

**Q. In a one-year mortality investigation, 25 of the 100 ninety-year-old males and 20 of the150 ninety year old females present at the start of the investigation died before the end of the year. Assuming that the numbers of deaths follow binomial distributions, test whether **t

A. This question can be solved by calculating the value of the statistic using the formula step by step. We can make a function to give the value of the statistic for any dataset.

*To test,*

**H**_{0}**: Male and Female mortality rates are the samev/sH**

As the p-value (0.01866) < 0.05, reject
H_{0} and conclude that male and female mortality rates are different.

**Q. In a one year investigation of claim frequencies for a particular category of motorists,**

A. *To test,*

**H**_{0}**: Average claim frequency λ is equal to 0.175v/sH**

As p-value (0.005388) < 0.05, reject H_{0} and conclude that the true claim frequency is less than 0.175.

**Q. In a one year investigation of claim frequencies for a particular category of motorists**

A. *To test,*

**H**_{0}**: Claim frequency is the same for drivers under age 25 and over age 25v/sH**

It is evident that the statistic value is too high on comparing with the normal distribution tables. Therefore, conclude that the claim frequencies are different for younger and older drivers.

**Download **data and R codes used in this article

**Statistics for Data Science.**

The post Testing Of Hypothesis: Parametric Tests appeared first on StepUp Analytics.

]]>The post Stepwise Regression appeared first on StepUp Analytics.

]]>So our objective is to build regression models which are complete and as realistic as possible. We want every variable which is even remotely related to the dependent variable to be included. Secondly, we want to include as few variables as possible.

Theory and experience give us a certain direction as to which variables should be included in the regression model. Moreover manually filtering through and comparing regression models can be tedious.

Luckily, several approaches exist for automatically performing feature selection or variable selection — that is, for identifying those variables that result in superior regression results. These **traditional approaches the determining the subset of actual predictor variables** is called the **variable selection**.

The** three **main approaches of variable selection are as follows:

**Forward Selection**

The forward selection method begins with no candidate variables in the model. Then at each step, we select the variable with the

Forward selection is mostly used when a large group of variables exists. For example, suppose there are more than fifty variables in a data set. A reasonable approach would be to obtain the best “n” number of variables and then apply the all-possible algorithm in the subset. This procedure is also a good choice when multicollinearity is a problem.

**Backward Selection**

The backward selection method or one might say backward elimination method begins with all possible variables that are believed to be potentially significant. Then at each step, we attempt to eliminate the variable that is most insignificant. This process continues until no insignificant variables remain and no further variables can be deleted without a statistically significant loss of fit.

**Stepwise Selection**

Stepwise regression is a combination of forward and backward selection. The method of variable selection starts with no predictor variables and then sequentially adds new variables into the model which leads to a reduction of the sum squared errors. Then simultaneously the old predictors are removed from the model at later stages that have become insignificant as a result of the inclusion of additional variables in the model.

The process carries on until an equilibrium point is reached where no significant reduction in the sum squared residuals is to be gained by adding variables in the regression and where a significant increase in the sum squared residuals would arise if a variable were removed from regression.

There are various packages to do the stepwise regression. Here I have shown** two** methods. Loading the necessary packages

library (

library (caret)

library (leaps)

library (MASS)

**1 ^{ST} METHOD**

The first method uses the **stepAIC ()** function present in the **MASS** package. It chooses the best model by AIC **Wiki**. We use the option **direction** **= “both” **for stepwise regression. We can also perform forward and backward selection by choosing **“forward”** and **“backward”** respectively. Fitting the model with all the predictor variables and saving it in

**model_lm.** ** ****model_lm <- lm (medv ~.,data = train)**

This is the multiple linear regression models. And the summary given below shows that **model_lm **contains all the predictor variables in the data set.

Checking the summary of the model

**summary (model_lm)**

Fitting the stepwise regression model and saving it** **

**in model_stepAIC** **model_stepAIC <- stepAIC (model_lm, direction = “both”, trace = FALSE)**

From the summary of the stepwise regression model, it is clear that only the variables which are highly significant are considered in the updated model.

**summary (model_stepAIC)**

Now we predict the value of **medv** or House Price and save it in

**Pred_model_stepAIC** **Pred_model_stepAIC <- predict (model_stepAIC, test [,-14])**

**mape_stepAIC**

In the second method, I have used the **caret** package and we the option to use **method = “leapSeq” **to carry out the stepwise regression. The other methods to fit forward and backward selection are given below:

“

“

We use **10-fold
cross-validation** to estimate the RMSE, MAE and etc. These statistical error
estimates are used to compare the models and to automatically choose the one of
them

We also specify the tuning parameter** nvmax**, which is the maximum number of predictors that can be present in the model. Here

**set.seed (123)****train.control <- trainControl (method = “cv”, number = 10)**

Saving the model in** model_caret**

**model_caret <- train (medv ~., data = train, method = “leapSeq”, tuneGrid = data.frame (nvmax = 1:10), trControl = train.control)****summary (model_caret)**

**model_caret$results**

**plot (model_caret)**

**model_caret$bestTune**

So the best number
of predictor variables that minimizes the RMSE is** 4**
>Now, predicting the values of House Price with the caret model and saving it in

**Pred_model_caret** **Pred_model_caret <- predict (model_caret, test [,-14])**

Calculating and
saving the value of mape in **mape_caret**

**mape_caret <- mape (test$medv, Pred_model_caret)****mape_caret**

**ADVANTAGES
AND DISADVANTAGES**

**Advantages
of Stepwise Regression:**

- It is faster than most other automatic model-selection methods.
- It has the ability to manage the predictor variables in the regression model by eliminating or inserting them into the model according to their significance.

**Disadvantages
of Stepwise Regression**

- Smaller datasets may result in higher sum squared residuals.
- The method adds or removes variables in a certain order; that we end up with a combination of predictors. That combination of predictors may not be closest to how it is in reality.

To know more about the other methods see the links given below:

The post Stepwise Regression appeared first on StepUp Analytics.

]]>The post Actuarial Science PPD under New Curriculum appeared first on StepUp Analytics.

]]>PPD is the replacement program for Work-based skills (WBS) and all student members who joined after 1 September 2017 are required to complete PPD. But if you joined the IFoA before 1 September 2017 then you are subject to a transition from the previous scheme you were on and need to have moved onto PPD by 1 September 2018 or your next anniversary of joining the IFoA. For more information about the Transition (i.e., WBS to PPD) can **click here**

PPD is an annual requirement for all student members. **When can you start recording your PPD?** Whenever you start working and most of your time is spent on activities and tasks which are actuarially relevant, you can start recording you PPD.

Before this article as a student member, I had a lot of confusion regarding the PPD. So, through this article, I will try to sort some of your queries regarding PPD.

There are three elements to complete your PPD:

- Satisfying a number of core competencies under three key objectives: –
- Effective Communication
- Problem Solving and Decision Making
- Professionalism

- Maintaining a record of formal learning activities undertaken
- You need to record at least two hours of Formal Learning and Development activities each year within your PPD Records. These must be unconnected with the actuarial exams and activities you have recorded to meet the requirement of the CPD Scheme.

- Demonstrating the learning gained from completing these activities.
- The description should be a concise explanation and overview of the activity you are referencing.

The requirement for Associate and Fellow is:-

**For Associate****For Fellow**

So, your PPD year will run from the month you joined the IFoA for twelve months and you are required to submit your PPD each year. During this year you have to demonstrate how you have applied in practice the knowledge and skills you have gained through the IFoA examinations, with at least 2-hour of formal learning.

All your PPD activities are recorded in your online members account **My PPD** and can be completed on an ongoing basis. You are advised to discuss your activities while recording, with your line manager or supervisor as they may be contacted to verify the information is accurate.

If you have CPD requirements under the CPD scheme then you will be required to complete these as well as the PPD, means **The Professional Skills Training** you have to complete under the CPD scheme cannot be used to meet the requirements of PPD.

If you complete the exams part way through your PPD year then a PPD submission will be required for each full month of that year. **For example**, if you pass your final exam in July and your PPD year runs Feb to January, a completion of February to June will be required.

If you are on parental leave, perhaps taking a career break, or not working in an actuarial relevant role you should contact the IFoA to inform about your situation and log a break. You cannot log a break yourself and it is your responsibility to inform the IFoA. If you are unable to meet the annual PPD requirement once your PPD deadline has passed, you will be penalized:-

- The penalty of £200 to submit any additional logs once your deadline has passed.
- If you continue to fail to meet your PPD requirements 3 months after your initial PPD deadline has passed, then you are being suspended from their examinations.
- If even after 12 months of initial deadline, you are unable met the annual requirements for 2 years then your IFoA membership may be revoked.

Check out more about PPD **Click here **

This is all from my side. I hope, I was able to sort some of your confusion regarding PPD. For further queries or doubts, you can comment down below.

Read more about **Actuarial Science New Curriculum**

**I wish you all the best!!**

The post Actuarial Science PPD under New Curriculum appeared first on StepUp Analytics.

]]>The post Information Value (IV) and Weight of Evidence (WOE) appeared first on StepUp Analytics.

]]>Information value(IV) and Weight of Evidence(WOE) were developed for the credit and financial industries primarily to build better models to predict the risk of loan defaults (credit risk models ). We know that there are many factors such as age, education, income the person, previous credit history, loan amount and etc which determine the probability (risk) of loan defaults as expressed in credit scores. With the help of the information value and WOE, we can calculate the predictive power of these predictor variables.

**So, we can say that WOE and IV play two distinct roles when analyzing data:**

- WOE describes the relationship between a predictive variable and a binary target variable.
- IV measures the strength of that relationship.

They are also used in marketing analytics project such as customer attrition model, campaign response model etc. But here I have used a credit risk dataset to explain the importance of both the concepts.

The Weight of Evidence measures the strength of a set of groups or bins and separates events from non-events. This is done by computing a simple ratio of:

**(Distribution of Goods) / (Distribution of Bads)**

**Bad customers **refer to the customers who were the loan defaulters and **good customers** refer to the customers who paid back the loans.** Distribution of goods** is the percentage of good customers in a group and the **distribution of bads** is the percentage of bad customers in a group.

If the Distribution bads > Distribution Goods, the odds ratio will be less than 1 and if the Distribution bads < Distribution Goods in a group, the odds ratio will be more than 1.

Now WOE is calculated by taking the natural logarithm of the ratio of percentage of non-events to the percentage of events.

So, the WOE will be a negative number if the odds ratio is less than 1 and it will be positive if the odds ratio is more than 1.

For a **continuous variable**, we create bins (categories/groups) for a continuous independent variable and then combine the categories with similar WOE values and replace categories with WOE values.

For **categorical independent variables**, we directly combine categories with similar WOE and then replace the categories with continuous WOE values. This is done because the categories with similar WOE have almost the same proportion of events and non-events. In other words, the behavior of both the categories is the same.

Now, in general, 10 or 20 bins are taken and each bin should have at least 5% of the observations and should be non-zero for both bad and good customers. The number of bins determines the amount of smoothing – the fewer bins, the more smoothing. Moreover the fewer bins the capture of important patterns in the data is more while leaving out the noise. The WOE should be distinct for each category and it should be either increasing or decreasing. In case of missing WOE, we add 0.5 to the number of events and non-events in a group.

- Handles missing values
- Handles outliers
- The transformation is based on the logarithmic value of distributions. This is well suited for Logistic Regression.
- There is no need for dummy variables
- By using the proper binning technique, it can establish a monotonic relationship (either increase or decrease) between the independent and dependent variable
- IV value can be used to select variables quickly.

The Information Value (IV) of a predictor is related to the sum of the values for WoE over all groups. Thus, it expresses the amount of information of a predictor variable for separating the Goods from the Bads. It ranks the variables on the basis of their importance or the amount of information it carries. Information value increases as bins/groups increases for an independent variable.

Moreover, Information value should not be used in the classification model other than logistic regression (for eg. random forest or SVM) as it is designed for binary logistic regression model only. It is one of the most useful techniques to select important variables in a predictive model.

The formula for information value is shown below.

**install.packages(“Information”)
library (Information)**

** **The dataset is available on **Kaggle**

**data <- read.csv (“creditcard.csv”)**

We should make sure that all independent categorical variables are stored as a factor and the binary dependent variable has to be numeric before running IV and WOE.

**str(data)
**

This creates WOE tables and IVs for all variables in the input dataframe

**IV <- create_infotables (data=data, y=”Class”, bins=10,parallel=FALSE)**

We can extract the IV values of the variables

**IV$Summary**

We save the IV values of each of the independent variables in a data frame.

**IV_Value <- data.frame(IV$Summary)**

We can also plot the WOE for various variables to see their trend. For example

**plot_infotables (IV, “Amount”)**

There are some other packages which can also be used to get the information values of the variables. Such as package **informationValue**

Details on Package **Click**

The functions used here are :

**WOE(X,Y)****WOETable(X,Y)****IV(X,Y)**

Where X is the categorical variable for which the IV is computed .Y is the binary response variable which represents Good or Bad customers.

- Considers each variable’s independent contribution to the outcome.
- Detect linear and non-linear relationships.
- Rank variables in terms of “univariate” predictive strength.
- Visualize the correlations between the predictive variables and the binary outcome.
- Seamlessly compare the strength of continuous and categorical variables without creating dummy variables.
- Seamlessly handle missing values without imputation.
- Assess the predictive power of missing values.

To learn more on Statistics for Data Science: **Click**

The post Information Value (IV) and Weight of Evidence (WOE) appeared first on StepUp Analytics.

]]>The post Actuarial Science CM1 An Introductory Brief appeared first on StepUp Analytics.

]]>Well, these are some of the basic calculations which actuaries do! And the subject which introduces you to such techniques is **CM1 – Actuarial Mathematics.**

CM1 – Actuarial Mathematics deals with the principles of modeling and mathematical techniques applied in actuarial work. It focuses particularly on deterministic models which can be used to model past and future cash flows those which are dependent on death, survival, or other uncertain risks.

This is one of the most important subjects of Core Principles’ series which you must study with absolute perfection as its knowledge is a prerequisite for almost all other actuarial subjects.

Let’s go through its contents and see how to crack it!

CM1 is divided into two parts – CM1A and CM1B. CM1A is a 3 hours 15 minutes written examination taken at the examination center and CM1B is a 1 hour 45 minutes problem-based assessment using Excel which can be taken from any location as per the student’s convenience. Both the examinations will have to be taken in the same sitting.

While CM1A assesses your theoretical knowledge and understanding of the concepts, CM1B will assess the application of the theoretical knowledge using real sets of data. The weights assigned to CM1A and CM1B are 70% and 30% respectively. However, the outcome will be based on the single evaluation of CM1 by adding the respective weights of the two papers.

Before going further, let me tell you the variety of skills that can be tested in this exam – The approximate split of assessment of these skills is **20% Knowledge **(the detailed knowledge and understanding of the concept)**, 65% Application **(the practical application of the principles underlying the topic) and **15% Higher Order **(the ability to do higher-level analysis, performing the complex calculations, decision making and giving judgments on various situations).

**Data and basics of modeling (10%) –**Deals with data analysis and modeling (uses, benefits, limitations, and how and why the models are used). We will also study the general cash flow models and financial instruments.

**Theory of Interest rates (20%) –**Deals with the knowledge and understanding of the interest rates for different time periods and cash flows. It also describes the time value of money and various actuarial symbols for annuities. You will also study the concepts of duration, convexity, and immunization. This section is very basic but very important.

**The Equation of Value and its Applications (15%) –**The equation of value is one of the most important tools for solving various practical problems involving calculations of the price of financial instruments, premiums & reserves, and preparation of loan schedules.

**Single Decrement Models (10%) –**Here we gain an understanding of various assurance and annuity contracts and other types of insurance products. We will also study the various life table functions and their use.

**Multiple Decrement Models (10%) –**Here we study the insurance contracts involving multiple lives. We will also do the valuation of cash flows involving uncertainty.

**Pricing and Reserving (35%) –**It is one of the most scoring sections of CM1. It involves the calculation of premium and reserves under various scenarios and gives us a clear picture of why companies set reserves.

The study material for CM1B is not separate from CM1A but the concepts given in CM1A will be applied in Excel based on real data. So, the main topics of this section are Loan Schedule, Project Appraisal- NPV, Payback period, IRR etc., Premium and Reserve calculation, and Profit testing to name a few. However other topics can also be tested in the CM1B exam. So you should make sure that you are thoroughly prepared for the assessment.

*Subjects which should be studied before CM1:*

- CS1 – Actuarial Statistics

*Subjects which would require the knowledge of CM1:*

- CM2 – Financial Engineering and Loss Reserving
- CB1 – Business Finance
- CP1 – Actuarial Practice
- CP2 – Modelling Practice
- SP1 – Health and Care Principles
- SP2 – Life Insurance Principles
- SP4 – Pensions and Other Benefits Principles

So now if you have made up your mind to go for it, let’s see the overall study plan which you can follow:

- Start with the core reading, understand every concept carefully and think of how it can be applied in real life.
- For all the sections from which questions are expected to come in the CM1B exam, practice the given illustrations in Excel rather than doing them on paper.
- Make sure you attempt the questions in Excel as well when reading the course notes.
- The recommended study hours for CM1 is 250 hours. However, it depends on how committed you are towards the subject. Finish the core reading as soon as possible so that you may have ample time for revision.
- The course material does not usually contain practice questions. For this purpose, you can check some Revision Materials on the Actuarial Education Company website.
- Although there are no past papers of CM1 here are some practice modules which you can refer to.

This subject is very interesting and no doubt, there is a lot to learn in it. Also, the concepts introduced are highly important for your actuarial career; so instead of learning it by rote, understand the practical application of it by studying intensively.

This is all from my side.** I wish you the very best! Happy Studying!**

The post Actuarial Science CM1 An Introductory Brief appeared first on StepUp Analytics.

]]>The post Logistics Regression in Python Using Pandas appeared first on StepUp Analytics.

]]>I thought of starting a series in which I will Implement various Machine Leaning techniques using Python.

To start with today we will look at Logistic Regression in Python and I have used iPython Notebook.

Here is the data set used as part of this demo Download

We will import the following libraries in Python

import pandas as pd #for handling datasets import statsmodels.api as sm #for statistical modeling import pylab as pl #for plotting import numpy as np #for numerical computation

dfTrain = pd.read_csv('E:\\Sajid_Mac\\DataSciencePython\\Logistic Regression with StatsModels\\test.csv') dfTest = pd.read_csv('E:\\Sajid_Mac\\ataSciencePython\\Logistic Regression with StatsModels\\train.csv')

print(dfTrain.head())

print(dfTest.tail())

# summarize the data dfTrain.describe()

# take a look at the standard deviation of each column dfTrain.std()

# frequency table cutting presitge and whether or not someone was admitted pd.crosstab(dfTest['admit'], dfTest['prestige'], rownames=['admit'])

#explore data dfTest.groupby('admit').mean()

# plot one column dfTest['gpa'].hist() pl.title('Histogram of GPA') pl.xlabel('GPA') pl.ylabel('Frequency') pl.show()

# barplot of gre score grouped by admission status (True or False) pd.crosstab(dfTest.gre, dfTest.admit.astype(bool)).plot(kind='bar') pl.title('GRE score by Admission Status') pl.xlabel('GRE score') pl.ylabel('Frequency') pl.show()

# dummify prestige dummy_ranks = pd.get_dummies(dfTest['prestige'], prefix='prestige') print(dummy_ranks.head())

# create a clean data frame for the regression cols_to_keep = ['admit', 'gre', 'gpa'] data = dfTest[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_good':]) print(data.head())

# manually add the intercept data['intercept'] = 1.0

train_cols = data.columns[1:] print (data.columns[1:])

#Logistic Regression logit = sm.Logit(data['admit'], data[train_cols])

# fit the model result = logit.fit() print (result.summary())

Result will look something like this Optimization terminated successfully. Current function value: 0.567976 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: admit No. Observations: 300 Model: Logit Df Residuals: 295 Method: MLE Df Model: 4 Date: Tue, 11 Apr 2017 Pseudo R-squ.: 0.07857 Time: 10:20:59 Log-Likelihood: -170.39 converged: True LL-Null: -184.92 LLR p-value: 7.604e-06 ===================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------------- gre 0.0020 0.001 1.649 0.099 -0.000 0.004 gpa -0.2227 0.229 -0.974 0.330 -0.671 0.225 prestige_good -1.7587 0.394 -4.460 0.000 -2.532 -0.986 prestige_ok -2.0917 0.463 -4.517 0.000 -2.999 -1.184 prestige_veryGood -1.0220 0.352 -2.903 0.004 -1.712 -0.332 =====================================================================================

# recreate the dummy variables dummy_ranks_test = pd.get_dummies(dfTest['prestige'], prefix='prestige') print (dummy_ranks_test)

#create intercept column dfTest['intercept'] = 1.0

# keep only what we need for making predictions cols_to_keep = ['gre', 'gpa', 'prestige', 'intercept'] dfTest = dfTest[cols_to_keep].join(dummy_ranks_test.ix[:, 'prestige_good':]) dfTest.head() # make predictions on the enumerated dataset dfTest['admit_pred'] = result.predict(dfTest[train_cols]) #see probabilities print(dfTest.head())

#convert probabilities to 'yes' 'no' dfTest['admit_yn']= np.where(dfTest['admit_pred'] > 0.5,'yes','no') print (dfTest.head())

cols= ['gre', 'gpa', 'admit_yn'] dfTest[cols].groupby('admit_yn').mean() gre gpa admit_yn no 573.229572 3.328288 yes 696.279070 3.732558

cols= ['gre', 'gpa', 'admit_yn'] dfTest[cols].groupby('admit_yn').mean() gre gpa admit_yn no 573.229572 3.328288 yes 696.279070 3.732558

dfTest.to_csv('E:\\Sajid_Mac\\DataSciencePython\\Logistic Regression with StatsModels\\output.csv', sep=',')

Download the entire iPython Notebook Here

The post Logistics Regression in Python Using Pandas appeared first on StepUp Analytics.

]]>