Machine Learning

Exploring Patterns of Survival from the Titanic Dataset

Introduction

Titanic shipwreck was a major historical incident that shaped how we view human survival during disasters. Even a century later, this tragic incident still offers valuable insights and lessons.

The RMS Titanic was one of the largest and most luxurious ship of its time. It was nicknamed “The Unsinkable” by its proud makers. On April 10th, 1912, it set out on its first journey from England to New York. The Titanic took with it all classes of people, the wealthy and the poor. It was commanded by the Senior Captain Edward John Smith. During the course of its voyage, the Titanic received multiple warnings of ice on the Atlantic, which made it change its course twice. But on the 4th day of its voyage, 14th April, it collided with a huge iceberg that led to the beginning of the slow sinking of this luxurious ship. The ship sent radio signals to other nearby ships for help, but only one of them responded. The captain ordered the passengers to be evacuated. According to the protocol, the women and children were to be evacuated first using the lifeboats available on the ship. But as we will see in our explorations, it did not really happen as such. Certain other factors also played a role in determining the survival of the passengers aboard. It seemed as if some groups of people were more likely to survive than others, and this is what we will explore in this article.

The sinking of this “Unsinkable” ship caused the death of 1502 out of 2224 of its passengers and crew.

The Project

Titanic dataset is a very beginner-friendly dataset, and that is why it is widely used as the starting point in data science learning. Not only does it provide interesting patterns for data analytics, but it retains its value in combining both historical context with real human decision-making under crisis conditions.

In this article, we will do an exploratory data analysis of the Titanic Dataset. We will see what the data looks like, what the different attributes are at play, and how these different attributes affected the survival of the passenger. This is a beginner-friendly tutorial that requires a basic understanding of Python fundamentals, importing libraries and employing its functions for data analysis. By combining data storytelling and pattern recognition, to the previous articles and projects on it through its insights as to how social inequality, evacuation behavior, and family structure influence survival outcomes.

The Dataset

In this tutorial, we will access the Titanic dataset and use Python pandas, matplotlib, and seaborn to explore how different factors played a role in the survival of the passengers. Let us download and load the data so that it is accessible in our code.

You can get the dataset from the : Github Link

Loading the Dataset

Once you have the data URL, you can access it as a pandas dataframe. We will have to install/import pandas for this. Pandas is a powerful Python library for data analysis and manipulation. If not already installed in your IDE, install it from the terminal through pip as follows:

pip install pandas

Once the installation is complete, import the library in your Python file by aliasing it as pd:

import pandas as pd

Next, read the data using the Pandas read_csv function. Make sure you add the URL as follow:

url = "

df = pd.read_csv(url)

This will load the file as a pandas dataframe in the variable “df”. We will do the data analysis and exploration using this dataframe that has the data we need stored. Let us read the data in this dataframe using the head() function that returns the first 5 lines by default of the dataframe:

print(df.head())
df.head() (Image by Author)

We can also use the Pandas library’s iloc[0] functions to get access to all the column names/attributes:

print(df.iloc[0])
df.iloc[0] (Image by Author)

Here we can see the first 5 lines of the dataset, along with the column names. As can be seen in the image above, the dataset has the following attributes:

  1. PassengerId — this is id of the passenger, a numerical value to identify each passenger
  2. Survived — this refers to whether the passenger on board survived the shipwreck or not
  3. Pclass — this is regarding the class of the passenger
  4. Name — this is the name of the passenger, with appropriate titles
  5. Sex — gender of the passengers
  6. Age — age group of the passengers on board
  7. SibSp — this refers to the number of siblings or spouses on board
  8. Parch — this refers to the number of parents or children on board
  9. Ticket- this is the ticket number of the passenger 
  10. Fare — this refers to the ticket price 
  11. Cabin — this is the cabin number of the passenger
  12. Embarked — this refers to where the passenger embarked from C = Cherbourg, Q = Queenstown, S = Southampton

As can be seen above, there are a few columns or attributes that are of interest to us in determining whether a person survived the Titanic or not. Attributes such as names and ticket number do not seem to influence the survival of passengers. In order to have a clear view of this, let us do some data analysis to find out the relation between different attributes and how they each influence survival individually and as combinations:

Data Analysis

Before we formally start the data analysis, let us install/import the relevant Python libraries.

The first one is Matplotlib. This library offers visualization features for data. We will plot graphs using this library. The second one is Seaborn. Seaborn is a Python data visualization library based on matplotlib, and allows us to create visuals, plots, and figures based on the data. Let us install and import these into our Python file.

pip install matplotlib
pip install seaborn

Now import these with alias names just as we did with the pandas library into the main coding file.

import seaborn as sns
import matplotlib.pyplot as plt

Now, let us see how different attributes affected survival:

Describing the Dataset

First let us have a generic overview of the data. We will use the describe() function for this. We have also added the pd.set_option to stop data truncation.

pd.set_option('display.max_columns', None)
print(df.describe())
describe() function (Image by Author)

As we can see in the image above, the function describe() gives a statistical summary of the entire dataset using metrics like count, mean, standard deviation, etc. The information beneficial here is:

  • There are a total of 891 entries of passengers (from count = 891)
  • The survival rate is 38% (from the mean of survived = 0.38)
  • Most passengers belonged to 3rd class (mean of Pclass = 2.3 closer to 3)
  • Some of the passengers’ age data is missing (the count of Age is not equal to the entries)
  • Most of the passengers were young (the mean age = 29.6)
  • The youngest passenger was 0.4 years (less than 6 months), and the oldest was 80 years old
  • The average ticket price was around £32.38 (mean fare)
  • Ticket price varied enormously (high standard variation for the fare = 49.69)
  • Massive economic inequality, fare for some was 0, and for others as high as £512
  • Age quartiles: 25% were younger than 20, half were younger than 28, and 75% were younger than 38

Now that we know the generic date insights, let us deep dive into a more detailed analysis.

Survival Facts

First, let us do some general survival analysis:

survival_counts = df['Survived'].value_counts()
print(survival_counts)
Survival Facts (Image by Author)
plt.figure(figsize=(6,4))

sns.countplot(
    x='Survived',
    data=df
)

plt.title("Titanic Survival Distribution")
Titanic Survival Distribution (Image by Author)

We tapped into the survival attribute and found a count of 549 for 0, which did not survive, and 342 for 1, that is survived. This is a 38% survival rate as was previously received from the describe() function. Now, let us move to the factors that affected this survival.

Survival by Gender

Let us see how this survival rate was influenced by gender. Did one gender have an edge in survival over the other? We know the priorities were women and children, but what exactly does the data show?


gender_survival = pd.crosstab(
    df['Sex'],
    df['Survived'],
    normalize='index'
)

print(gender_survival)


plt.figure(figsize=(6,4))

sns.barplot(
    x='Sex',
    y='Survived',
    data=df
)

plt.title("Survival Rate by Gender")

plt.ylabel("Survival Rate")

plt.show()
Survival Rate by Gender (Image by Author)
Survival Rate by Gender (Image by Author)

As can be seen from both the report and the plot above, the men’s survival rate was just 18%. Whereas, as much as 74% women survived the shipwreck.

Survival by Passenger Class

Now, let us analyse how passengers from different classes survived the incident.

class_survival = pd.crosstab(
    df['Pclass'],
    df['Survived'],
    normalize='index'
)
print(class_survival)

plt.figure(figsize=(7,5))

sns.barplot(
    x='Pclass',
    y='Survived',
    data=df
)

plt.title("Survival Rate by Passenger Class")

plt.xlabel("Passenger Class")

plt.ylabel("Survival Rate")

plt.show()
Survival by Passenger Class (Image by Author)
Survival by Passenger Class (Image by Author)

As can be seen from the report and plot above, about 62% of passengers from the 1st class survived, 47% from the second class, and only 24% from the third class. We can infer from this very basic plot that the first class, which paid heavily for the ship’s luxuries, has a higher chance of survival; they were preferred over the other two classes.

Survival by Age

Let us see how passengers of different ages survived. Did children have a higher chance of survival?

plt.figure(figsize=(10,6))

sns.histplot(
    data=df,
    x='Age',
    hue='Survived',
    bins=30,
    multiple='stack',
    alpha=0.6
)

plt.title("Age Distribution by Survival")

plt.show()
Age Distribution by Survival (Image by Author)

From this stacked histogram, we can draw several meaningful insights about how age is related to survival on the Titanic.

  • Most passengers who were onboard were young adults in the age bracket of 20 and 30
  • Children less than 10 show higher survival representation with a bigger orange colored stack as compared to the blue one
  • Adult non-survivors dominated the dataset, with bars representing non-survivors between 20 and 40 being bigger
  • Survival declines in the older age group; this may be due to elderly passengers facing certain age-restricted challenges in evacuation
  • The non-survivor portions of the bars dominate most age ranges, implying that more passengers died than survived overall, aligning with the overall survival rate of approximately 38%

To summarize, the survival on the Titanic favored younger passengers, while young adult populations experienced the highest mortality rates.

Children Priority

Were the children actually prioritized? Let us answer that with some analytics:

df['IsChild'] = df['Age'] < 16
child_survival = pd.crosstab(
    df['IsChild'],
    df['Survived'],
    normalize='index'
)

print(child_survival)

sns.barplot(
    x='IsChild',
    y='Survived',
    data=df
)

plt.title("Child vs Adult Survival")
plt.show()
Child vs Adult Survival (Image by Author)
Child Priority (Image by Author)

As can be seen from the above, around 59% of the children survived, which is a direct reflection of how the children were actually prioritized.

Now let us analyse how family size impacted survival.

Family Size Analysis

The family size attribute is dependent on two different attributes of the dataset: SibSp and Parch. SibSp is the number of siblings and spouses of the passenger onboard. Whereas Parch is the number of parents and children of the passenger.

Let us see how the family size affected survival:

df['FamilySize'] = (
    df['SibSp'] + df['Parch'] + 1
)
plt.figure(figsize=(10,6))

sns.barplot(
    x='FamilySize',
    y='Survived',
    data=df
)

plt.title("Survival Rate by Family Size")
plt.show()
Survival Rate by Family Size (Image by Author)

The plot above shows how survival probability changed depending on the number of family members traveling together on the Titanic. The code is simple, it adds the number of siblings/spouse and parents/children, plus the passenger themself as the family size. the y-axis of the plot represents the survival probability so each bar shows the percentage of passengers with a particular family size to have survived. We can see from the bar chart above that:

  • Passengers traveling alone had lower survival, probably becuase the passengers traveling alone had less social support, no assistance during evacuation, or lower priority compared to families
  • Small families with family sizes of about 2, 3, and 4 had the highest survival rates, which may be because of them helping each other out during evacuation, stayed coordinated and received priority in lifeboat boarding
  • Very large families with family size greater than 6 had lower survival rates, probably due to difficulty in coordinating evacuation and families refusing to separate on lifeboats.

As we can see, survival was not linearly related to the family size, but a moderately sized family had a higher survival rate.

Survival by Fare Paid

Lastly, let us see how the ticket price affected survival. We can analyse this using a violin plot as below:

plt.figure(figsize=(12,6))

sns.violinplot(
    data=df,
    x='Survived',
    y='Fare',
    inner='quartile'
)

plt.xticks(
    [0,1],
    ['Did Not Survive', 'Survived']
)

plt.title(
    "Ticket Fare Distribution by Survival"
)

plt.ylabel("Fare Paid")

plt.show()
Ticket Fare Distribution (Image by Author)

The violin plot shows a clear relationship between ticket fare and survival on the Titanic. Survivors generally paid higher fares, while most non-survivors were concentrated in lower fare ranges. This suggests that first-class and wealthier passengers had a significant survival advantage, likely due to better cabin locations and easier access to lifeboats. However, the overlap between the two groups also indicates that wealth alone did not determine survival, as factors like gender, age, and evacuation timing also played important roles.

Concluding the Findings

We know now that certain facts like being female, a child, belonging to the first class, and having a moderate family size played a role in the passenger’s survival. Let us combine these features to determine the survival rate.


# CREATE FEATURES

# Child column
df['IsChild'] = df['Age'] < 16

# Family size column
df['FamilySize'] = (
    df['SibSp'] + df['Parch'] + 1
)

# Moderate family size
df['ModerateFamily'] = (
    (df['FamilySize'] >= 2) &
    (df['FamilySize'] <= 4)
)

# Combine all favorable conditions
combined_condition = (
    (df['Sex'] == 'female') &
    (df['Pclass'] == 1) &
    (df['ModerateFamily'] == True)
) | (
    (df['IsChild'] == True)
)

# Create a new category column
df['HighSurvivalGroup'] = combined_condition


# PLOT SURVIVAL RATE

plt.figure(figsize=(8,5))

sns.barplot(
    data=df,
    x='HighSurvivalGroup',
    y='Survived'
)

plt.xticks(
    [0,1],
    ['Other Passengers', 'High Survival Group']
)

plt.ylabel("Survival Rate")

plt.title(
    "Survival Rate Based on Combined Passenger Factors"
)

plt.show()
Survival Rate based on Combined Preferred Factors

The above code combined all the favourable circumstances for survival and compared passengers with these characteristics
vs everyone else. As can be seen from the graph, the “High Survival Group” had dramatically higher survival rates.

Conclusion

In this article, we have successfully analyzed the Titanic dataset using pandas, matplotlib, and seaborn. This is an easy and beginner-friendly tutorial to understand how we can interpret data, plot graphs, and gather insights from them. From the above findings, we can easily group certain features as being favourable to survival. Moreover, these data analytics and findings can also help us in creating an efficient machine learning algorithm in predicting the survival of the Titanic passengers.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button