Pandas is widely used for data science, data analysis and machine learning.
Here I will show tips in using Pandas to process the data set for Titanic Kaggle Competition.
The pandas module has many useful methods or functions.
Fist you should use pandas.read_csv to read a comma-separated values (csv) file into DataFrame.
import pandas as pd train_data = pd.read_csv('/kaggle/input/titanic/train.csv') test_data = pd.read_csv('/kaggle/input/titanic/test.csv')When you want to make sure the elements in the data, use the function of pandas.DataFrame.head gives information of the first 5 rows in the data frame.
train_data.head()
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q 1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S 2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q 3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S 4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN SWith the info() function you can find if missing values happen. Looking at each column, there are some columns with less Non-Null counts than the total number of rows.
train_data.info()
RangeIndex: 418 entries, 0 to 417 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 418 non-null int64 1 Pclass 418 non-null int64 2 Name 418 non-null object 3 Sex 418 non-null object 4 Age 332 non-null float64 5 SibSp 418 non-null int64 6 Parch 418 non-null int64 7 Ticket 418 non-null object 8 Fare 417 non-null float64 9 Cabin 91 non-null object 10 Embarked 418 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 36.0+ KB None
# Count the number of missing values for each column train_data.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
# Find data types for each column train_data.dtypes
PassengerId int64 Survived float64 Pclass int64 Name object Sex object Age float64 SibSp int64 Parch int64 Ticket object Fare float64 Cabin object Embarked object dtype: objectCaluculate the survival rate over all the passengers in the train data.
print("% of passengers who survived:", train_data.Survived.mean()) # % of passengers who survived: 0.3838383838383838The survival rate for women
women = train_data[train_data.Sex == 'female'].Survived print("% of women who survived:", women.mean()) # % of women who survived: 0.7420382165605095The survival rate for men
men = train_data[train_data.Sex == 'male'].Survived print("% of men who survived:", men.mean()) # % of men who survived: 0.18890814558058924Or
train_data.Survived.groupby(train_data.Sex).agg(['mean','count'])
mean count Sex female 0.742038 314 male 0.188908 577
Fill missing values
To fill the missing values I used DataFrame.fillna.DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
# Fill missing values of the 'Fare' column mean_fare = train_data.Fare.mean() print('Mean fare:', mean_fare) train_data.fillna({'Fare':mean_fare}, inplace=True)Note that set the parameter of inplace True if you want to overwrite the DataFrame, The default is False.
discretization function
The discretization function is helpful to make a numerical continuous value set to a categorical value set.pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')Use cut when you need to segment and sort data values into bins. Each bin is equal-width in the range of the data values.
print('Youngest:',train_data.Age.min()) print('Oldest:',train_data.Age.max()) print('-'*30) bins = 4 # The number of bins cut_data = pd.cut(train_data.Age, bins=bins, precision=0) print(cut_data.value_counts(sort=False, dropna=False)) print('-'*30)
Youngest: 0.42 Oldest: 80.0 ------------------------------ (0.0, 20.0] 179 (20.0, 40.0] 385 (40.0, 60.0] 128 (60.0, 80.0] 22 NaN 177 Name: Age, dtype: int64 ------------------------------When you need to bin numeric data into ordinal categorized bins, set the 'labels' parameter to be 'False'. The returned bins are labeled by ordinal number. If the data include Nan, the values of Nan must be replaced by other number.
bins = 4 # The number of bins cut_data = pd.cut(train_data.Age, bins=bins, labels=False, precision=0) print(cut_data.value_counts(sort=False, dropna=False)) print('-'*30) cut_data.fillna(bins, inplace=True) # cut_data = cut_data.astype(int) print(cut_data.value_counts(sort=False, dropna=False)) print('-'*30)
1.0 385 NaN 177 2.0 128 0.0 179 3.0 22 Name: Age, dtype: int64 ------------------------------ 1.0 385 4.0 177 2.0 128 0.0 179 3.0 22 Name: Age, dtype: int64 ------------------------------
Comments
Post a Comment