Introduction to survival analysis and the Kaplan Meier estimator

Share on facebook
Share on twitter
Share on linkedin
Share on telegram
Share on whatsapp


This article was published as part of the Data Science Blogathon.


Survival analysis

Survival analysis is an important branch of statistics that is taken into account to answer all these questions..

The Survival Analysis study should define a time frame in which this study is conducted. As in many cases, it is possible that the period of time given for the event to occur is the same with each other. Survival analysis involves modeling data from time to event. Therefore, We need to define the context of the survival analysis in the study as time as the “event” in the context of survival analysis.

There are different ways we do survival analysis. It is done in several ways, as when we define a group. Some of them are Kaplan Meier curves, Cox regression models, hazard function, survival function, etc.

When survival analysis is performed to compare the survival analysis of two different groups. There we perform the Log-Rank test.

When survival analysis likes to describe categorical and quantitative variables on survival, we like to do Cox proportional hazards regression, parametric survival models, etc.

In survival analysis, we need to define certain terms before proceeding, as the event, time, the censorship, survival function, etc.

Event, when we talk about, is the activity that is happening or will happen in the survival analysis study, such as the death of a person from a particular disease, the time to obtain the cure by a medical diagnosis, time to heal with vaccines, the time of occurrence of the machine failure on the shop floor, time for disease onset, etc.


in the survival analysis case study it is the time from the beginning of the observation of the survival analysis on the subject to the moment when the event will occur. As in the case of a mechanical machine to a fault, we need to know the

(a) time of an event when the machine will start up
(b) when will the machine fail
(c) machine loss or machine shutdown from survival analysis study.

Censorship / Censored observation

This terminology is defined as if the topic on which we are doing the survival analysis study is not affected by the defined study event, then they are described as censored. The censored subject may also not have an event after the end of the survival analysis observation.. The subject is called censored in the sense that nothing was observed outside the subject after the censoring time.

Observation censorship they are also from 3 types-

1. Censored law

Right-wing censorship is used in many problems. It occurs when we are not sure what happened to people after a certain point in time.

It occurs when the real time of the event is greater than the censored time when c <t. This happens if some people cannot be followed for the entire time because they died or were lost to follow-up or dropped out of the study..

2. Censored left

Left-wing censorship is when we are not sure what happened to people before sometime. Censorship by the left is the opposite, what happens when the real time of the event is less than the censored time when c> t.

3. censored range

Interval censorship is when we know that something has happened in an interval (not before the start time or after the end time of the study) but we don't know exactly when it happened in the interval.

Interval censoring is a concatenation of left and right censoring when time is known to have occurred between two time points.

Survival function S

Here, we will discuss the Kaplan Meier estimator.

Kaplan Meier estimator

The Kaplan Meier estimator is used to estimate the survival function for lifetime data.. It is a non-parametric statistics technique. Also known as a product limit estimator, and the concept lies in estimating the survival time during a certain time of an important medical event, a certain moment of death, machine failure or any major significant event.

There are many examples like

1. Machine parts failure after several hours of operation.

2. How long will the COVID vaccine take 19 in curing the patient.

3. How long it takes to obtain a cure from a medical diagnosis, etc.

4. Estimate how many employees will leave the company in a specified period of time.

5. How many patients will be cured with lung cancer?

To estimate Kaplan Meier survival, we first need to estimate the survival function S

Where (d) is the number of death events at the moment

Kaplan Meier survival assumptions

In real life cases, we have no idea of ​​the true function of the survival rate. Therefore, in the Kaplan Meier estimator we estimate and approximate the real survival function from the study data. There is 3 Kaplan Meier Survival assumptions

1) Survival probabilities are the same for all samples that joined at the end of the study and those that joined earlier.. It is not supposed to change the survival analysis that may affect.

2) The occurrence of an event takes place at a specific time.

3) Study censorship does not depend on the result. The Kaplan Meier method does not depend on the outcome of interest.

The interpretation of the survival analysis is the Y-axis showing the probability of a subject that has not been included in the case study. The X-axis shows the representation of the subject's interest after surviving until the time. Every drop in survival function (approximated by the Kaplan-Meier estimator) is caused by the event of interest that occurs during at least one observation.

The graph is usually accompanied by confidence intervals, to describe uncertainty about point estimates (the widest confidence intervals show high uncertainty, this happens when we have a few participants) occurs in both observations that die and those that are being censored.

Assumptions of the Kaplan Meier estimator

Important aspects to take into account for the analysis of the Kaplan Meier estimator

1) We need to perform the log rank test to make any kind of inference.

2) Kaplan Meier results can be easily biased. The Kaplan Meier is a univariate approach to solving the problem.

3) Deleting censored data will cause a change in the shape of the curve. This will create biases in the fit of the model.

4) Statistical tests and observations become misleading if continuous variable dichotomy is performed.

5) By dichotomizing the media, we take statistical measures such as the median to create groups, but this can lead to problems in the dataset.

Let's take the example in Python

Enlace a Notebook- (

Let's import the important library needed to work in Python


First, we are importing different python libraries for our work. Here, we took the lung cancer dataset. After libraries and loading, we will read the data using the pandas library. The data set contains different information

Treatment 1 = standard, 2 = test, Cell type 1 = flaky, 2 = small
cell phone, 3 = adeno, 4 = big, Survival in days, Condition 1 = dead, 0 = censored, Karnofsky score (a measure of overall performance, 100 = better), Months from diagnosis, Age in years Previous therapy 0 = no, 10 = yes, etc.

Chief Estimator of Kaplan Meier

Here we see the head and the tail.


Now, here we import the Python code to perform the Kaplan Meier Estimator


Here, we perform the analysis on the Karnofsky score, the x-axis represents the timeline and the y-axis shows the score. The best score is 1, means the subject is fit, a score of 0 means the worst score.

Then we apply the Survival code, Previous Therapy, the treatment here we will do the Kaplan Meier Estimating Analysis.

Then, we fit kmf1 = KaplanMeierFitter () to fit the Kaplan Meier function and run the following code for different data related to lung cancer problems.


The Kaplan Meier estimator after running the code shows the graph between the test-of-treatment standard and the test-of-treatment.


In this article, my key objective was to explain the survival analysis with the Kaplan Meier estimator. Things related to it and a description of the problem in real life.

Advantages and disadvantages of Kaplan Meier Estimator


1) Doesn't require too many features; only time is required for survival analysis event.

2) Provides an average overview related to the event.


1) Many variables cannot be correlated and monitored simultaneously.

2) If censorship data is removed, the model will be skewed at fit time.

3) An adequate estimate of the magnitude of the change in the event cannot be predicted.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.