Useful Pandas Functions for Titanic Kaggle Competition

Pandas is widely used for data science, data analysis and machine learning. Here I will show tips in using Pandas to process the data set for Titanic Kaggle Competition. The pandas module has many useful methods or functions. Fist you should use pandas.read_csv to read a comma-separated values (csv) file into DataFrame.

import pandas as pd
train_data = pd.read_csv('/kaggle/input/titanic/train.csv')
test_data = pd.read_csv('/kaggle/input/titanic/test.csv')

When you want to make sure the elements in the data, use the function of pandas.DataFrame.head gives information of the first 5 rows in the data frame.

train_data.head()

 	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

With the info() function you can find if missing values happen. Looking at each column, there are some columns with less Non-Null counts than the total number of rows.

train_data.info()

RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
None

# Count the number of missing values for each column
train_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

# Find data types for each column
train_data.dtypes

PassengerId      int64
Survived       float64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Caluculate the survival rate over all the passengers in the train data.

print("% of passengers who survived:", train_data.Survived.mean())
# % of passengers who survived: 0.3838383838383838

The survival rate for women

women = train_data[train_data.Sex == 'female'].Survived
print("% of women who survived:", women.mean())
# % of women who survived: 0.7420382165605095

The survival rate for men

men = train_data[train_data.Sex == 'male'].Survived
print("% of men who survived:", men.mean())
# % of men who survived: 0.18890814558058924

train_data.Survived.groupby(train_data.Sex).agg(['mean','count'])

	mean	count
Sex		
female	0.742038	314
male	0.188908	577

Fill missing values

To fill the missing values I used DataFrame.fillna.

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)

# Fill missing values of the 'Fare' column
mean_fare = train_data.Fare.mean()
print('Mean fare:', mean_fare)
train_data.fillna({'Fare':mean_fare}, inplace=True)

Note that set the parameter of inplace True if you want to overwrite the DataFrame, The default is False.

discretization function

The discretization function is helpful to make a numerical continuous value set to a categorical value set.

pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')

Use cut when you need to segment and sort data values into bins. Each bin is equal-width in the range of the data values.

print('Youngest:',train_data.Age.min())
print('Oldest:',train_data.Age.max())
print('-'*30)

bins = 4 # The number of bins

cut_data = pd.cut(train_data.Age, bins=bins, precision=0)
print(cut_data.value_counts(sort=False, dropna=False))
print('-'*30)

Youngest: 0.42
Oldest: 80.0
------------------------------
(0.0, 20.0]     179
(20.0, 40.0]    385
(40.0, 60.0]    128
(60.0, 80.0]     22
NaN             177
Name: Age, dtype: int64
------------------------------

When you need to bin numeric data into ordinal categorized bins, set the 'labels' parameter to be 'False'. The returned bins are labeled by ordinal number. If the data include Nan, the values of Nan must be replaced by other number.

bins = 4 # The number of bins

cut_data = pd.cut(train_data.Age, bins=bins, labels=False, precision=0)
print(cut_data.value_counts(sort=False, dropna=False))
print('-'*30)

cut_data.fillna(bins, inplace=True)
# cut_data = cut_data.astype(int)
print(cut_data.value_counts(sort=False, dropna=False))
print('-'*30)

1.0    385
NaN    177
2.0    128
0.0    179
3.0     22
Name: Age, dtype: int64
------------------------------
1.0    385
4.0    177
2.0    128
0.0    179
3.0     22
Name: Age, dtype: int64
------------------------------

MeePythoner

Search This Blog