# How to do cox(PH) regression modelling using R?

**Definition:** **Cox regression** (or proportional hazards regression) is a method for investigatingthe effects of several variable upon the time a specified event takes to happen. In the context of an outcome such as death this is known as Cox regression for survival analysis. The Cox model hazard function calculates the hazard at time t of a subject, adjusted for possible explanatory variables. Cox regression does not assume any particular “survival model” but it is not truly non-parametric.

#### Why it is not truly non-parametric?

Because cox regression assume that the effects of the predictor variables upon survival are constant over time and are additive in one scale.

The formula is expressed as the product of the baseline hazard function of time and an exponential function of covariates. The baseline hazard is an unspecified form of the Cox model and the distribution of the outcome (survival

time) is unknown. This makes the Cox (PH) regression a semi-parametric model. The semi-parametric property of the Cox (PH) model makes it a robust model which can closely approximate parametric models

If the below assumptions of Cox regression are met, this function will provide better estimates of survival probabilities and cumulative hazard than those provided by the Kaplan-Meier function.

#### Assumptions of cox regression model:

- Hazard and Hazard-ratios.
- Time-dependent and covariates.
- Model analysis and deviance.
- Survival and commulative hazard rates.

##### Hazard- ratios(HR):

The Cox PH assumption states that the Hazard-ratios(HR) for any two individuals in the same study is constant over time.

In other words, the hazard for a subject is proportional to the hazard for another subject in the same study where the proportionality constant, say b is independent of time.

When the HR vary with time, for example where hazards cross or when time varying confounding variables are present the PH assumption maybe violated, making it inappropriate to use the Cox PH model. Where the Cox PH assumption is not met, variations of the Cox model can be used, for example the extended Cox regression or the stratified Cox regression depending on the context.

##### Time-dependent and Fixed covariates:

When individuals are followed over time, the values of covariates may change with time.

Covariates can thus be divided into fixed and time-dependent.

A covariate is time dependent if the difference between its values for two different subjects changes with time(e.g. blood pressure,cholestrol,smoking).

A covariate is fixed if its values can not change with time(e.g. gender or ethnicity).

##### Model Analysis and deviance:

What do you mean by “model analysis”? Yes, you guess it right, its the total analysis of your model; e.g. statistical significance( it covers all the hypothesis testing like t-test, chi-square, fisher, at all). Here the likelihood of chi-square model is calculated by comparing the deviance.

Deviance is minus twice the log of the likelihood ratio for models fitted by maximumlikelihood (-2*log(likelihood)).

The value of adding a parameter to a Cox model is tested by subtracting the deviance of the model with the new parameter from the deviance of the model without the new parameter, the difference is then tested against a chi-square distribution with degrees of freedom equal to the difference between the degrees of freedom of the old and new models.

##### Survival and cumulative hazards rate:

The survival function and the cumulative hazard function are calculated relative to the baseline (lowest value of covariates) at each time point.

1 2 3 |
<strong>If you have binary/dichotomous predictors in your model you are given the option to calculate survival and cumulative hazards for each variable separately.</strong> |

#### Lets discuss only the assumptions of Kaplan-Meier functions:

**Censored(*)**individuals have the same prospect of survival as those who continue to be followed. This can not be tested for and can lead to a**bias(**)**that artificially reduces S.- Survival prospects are the same for early as for late recruits to the study (can be tested for).
- The event studied (e.g. death) happens at the specified time. Late recording of the event studied will cause artificial inflation of S.

* Censorship in survival-time (time-to-event, failure-time) studies refers to incomplete data.

** Bias is a systematic error that leads to an incorrect estimate of effect or association.

#### I will explain the code which is a inbuilt in R

#### R-code for cox(ph) regression model:

first install or import package “mgcv”. Get the description of mgcv package.

1 2 3 4 5 |
library(mgcv) library(survival) ## to load the data |

‘**mgcv**‘ is an package, which has **gam**(Generalised Additive Model), This is similar to **glm. **For more details on ‘mgcv’ package, please follow the steps

1 2 |
open R -> type ??mgcv |

This, the above command will open help page from R documents.

We are using the survival data from ‘**survival**‘ package, which is an inbuilt in R. Now Move on

1 2 3 4 |
col1 <- colon[colon$etype==1,] ## concentrate on single event col1$differ <- as.factor(col1$differ) col1$sex <- as.factor(col1$sex) |

Changing the characteristics of differ and sex variable for our smoothness.

1 2 3 |
model <- gam(time~s(age,by=sex)+sex+s(nodes)+perfor+rx+obstruct+adhere, family=cox.ph(),data=col1,weights=status) |

Here(above) is the model formula for gam

1 2 |
summary(model) |

Summary of the above model looks like this

Let’s take a look at the plot effects of model

1 2 |
plot(model,pages=1,all.terms=TRUE) |

Here is the plot effect of the given data

Now, till now you got the idea about data’s behaviour or what kind of data is what we have.

Now lets’ work for survival plot for particular object here in this data set object is patient and we are calculating the survival plot for patient ‘**j**‘:

Below is the function for survival analysis.

1 2 3 4 5 6 7 8 9 10 |
numP <- 300;j <- 6 newd <- data.frame(time=seq(0,3000,length=numP)) dname <- names(col1) for (n in dname) newd[[n]] <- rep(col1[[n]][j],numP) newd$time <- seq(0,3000,length=numP) fv <- predict(model,newdata=newd,type="response",se=TRUE) plot(newd$time,fv$fit,type="l",ylim=c(0,1),xlab="time",ylab="survival") lines(newd$time,fv$fit+2*fv$se.fit,col=2) lines(newd$time,fv$fit-2*fv$se.fit,col=2) |

Crude plot of baseline survival

1 2 3 4 5 6 |
plot(model$family$data$tr,exp(-b$family$data$h),type="l",ylim=c(0,1), xlab="time",ylab="survival") lines(model$family$data$tr,exp(-model$family$data$h + 2*model$family$data$q^.5),col=2) lines(model$family$data$tr,exp(-model$family$data$h - 2*model$family$data$q^.5),col=2) lines(model$family$data$tr,exp(-model$family$data$km),lty=2) |

Here is the plot for above code

If you want me to explain the graphics please shoot me a mail to irrfankhann29@gmail.com. I will get back to you.

Please give your feedback in comments or shoot me a mail

Source:Click Here.