Data Science Solution Using Titanic Dataset

Overview

The sinking of the Titanic is one of the most infamous shipwrecks in history. On 15 April 1912, Titanic made its first voyage. The ship sank after it collided with an iceberg. About 1500 people were killed during this incident. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Workflow

The basic workflow to solve any data science problem is as follows :

  1. Identifying the problem
  2. Acquire test and training data
  3. Clean the data
  4. Analyze the data
  5. Model, predict and solve the problem
  6. Visualize the data and come up with a solution

But here our goal is to get a generalized prediction as fast as possible. But this doesn’t mean to avoid the exploratory data analysis (EDA).

Before you begin I recommend you to read about the Random Forest Algorithm first as in this tutorial we are gonna use random forest algorithm only. But you can also implement it using any other algorithms like logistic regression, decision tree etc.

Python Code

Having a high value for n_estimators increases the number of trees which improves the prediction rate

Now we have a benchmark which can be further improved

parameter test

  1. parameters to improve the model
  2. n_estimators = number of trees in the forest
  3. max_features = number of features considered for best a split
  4. min_samples_leaf = minimum number of samples in newly created leaves
  5. n_jos = multiple processors that can be used to train and test the model

Finding The Optimal Number Of Trees

10 trees
C-stat : 0.8274933691240853

20 trees
C-stat : 0.8562218387498801

50 trees
C-stat : 0.8620644659615038

100 trees
C-stat : 0.8641256298000618

150 trees
C-stat : 0.8635770513107297

200 trees
C-stat : 0.8650230616005709

661 ms ± 87.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

920 ms ± 188 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

auto option
C-stat : 0.8650230616005709

sqrt option
C-stat : 0.8665516249640495

log2 option
C-stat : 0.8673851447075491

None option
C-stat : 0.8650230616005709

0.2 option
C-stat : 0.8639365566314086

0.9 option
C-stat : 0.8643573110067215

1 min sample
C-stat : 0.8673851447075491

2 min sample
C-stat : 0.8620831069781316

3 min sample
C-stat : 0.8419135269868661

4 min sample
C-stat : 0.8385048839463566

5 min sample
C-stat : 0.8371147967063985

6 min sample
C-stat : 0.8389096603074169

7 min sample
C-stat : 0.8339564758891764

8 min sample
C-stat : 0.8334292014188476

9 min sample
C-stat : 0.8322201983404169

10 min sample
C-stat : 0.8326249747014774

C-stat : 0.8673851447075491

Conclusion

As you can clearly observe that the prediction result improved from 0.73 to 0.86 which is fair enough.

You can also check for the prediction accuracy by implementing other machine learning algorithms and compare them to select the best among them.

You might also like More from author