A Titanic Dataset


In data analytics and data science circles, this is a very famous dataset. Many people who are learning to become data professionals work with this dataset. This post is just a short introduction to the Titanic dataset.

Titanic Project

I have a post here called Titanic Logistic Regression. It has a fair amount of exploratory data analysis.

There is a Udemy online course called Python for Data Science and Machine Learning Bootcamp that works with this Titanic dataset. The instructor has included in the course a training set and a testing set.

This post is just designed to get us familiar with the dataset by performing some exploratory data analysis (EDA) in Python. For more EDA go to the link above that has the Titanic dataset used in a logistic regression model.

Kaggle has a Titanic dataset that’s called Titanic – Machine Learning from Disaster. Click on Data. That the second menu option in the horizontal menu just under the title. Scroll down until you see test.csv and train.csv. Download them to you local machine if you want. Put them in a folder that’s where your project is. I’m using Anaconda Navigator so the Windows location is the C drive, Users, and a particular user. Then you could use the following lines of Python to import the data.

As a data professional, with this dataset, we are making a prediction. We are trying to predict whether or not a passenger would survive the Titanic based on their features, such as cabin class or the price of their ticket. Presumably, the more money you paid for your ticket, the richer you were and the more likely you would survive the disaster.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r"titanic_train.csv")
df_test = pd.read_csv(r"titanic_test.csv")

  • PassengerId – passenger id
  • Survived – 0 = No, 1 = Yes
  • Pclass – Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
  • Name – name
  • Sex – Male or Female
  • Age – Age in years
  • Sibsp – # of siblings / spouses aboard the Titanic
  • Parch – # of parents / children aboard the Titanic
  • Ticket – Ticket number
  • Fare – Passenger’s fare
  • Cabin – Cabin number
  • Embarked – Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
df.head()

Click to Enlarge








Leave a Reply