The dataset I picked was from Kaggle and had data on the Titanic. The outcome variable I used was “survived”, a binary variable that showed whether the passenger survived or not. The features for predictions that I chose were passenger class, sex, age, and fare. Using these features and a decision tree Jupyter notebook, I was able to find any patterns in the features that can predict the survival or death of a passenger. I chose 15 as the best value for the minimum split. This is because the smaller the minimum split is, the more complex the tree is. We want the training set and validation set to be high and predict the outcome variable correctly.
A few insights that the decision tree provided were:
- Being female could increase your chance of survival
- Paying more for your fare could increase your chance of survival
- The younger males had a better chance of survival
The nodes with the highest probability are nodes 16 and 5 with a 100% probability of survival. This means the passengers most likely to survive are:
- Females who paid more than $33.85 for their fare
- Males younger than 36.25 who paid between $7.85 and $27.07 for their fare
The nodes with the lowest probability are nodes 3, 13, and 10 with a 0% probability of survival. This means the passengers least likely to survive are:
- Males younger than 36.25 who paid less than $7.85 for their fare
- Males older than 60.5
- Males between the ages of 36.85 and 43