本文用于记录机器学习中的一次入门练习,即:利用决策树进行简单的二分类。同时,结合Kaggle上的经典案例Titanic,来测试实际效果。
一、数据集 采用Kaggle 中的Titanic的数据集。数据包含分为:
训练集: training set (train.csv)
测试集: test set (test.csv)
提交标准: gender_submission.csv
由于Kaggle涉及到科学上网的操作,所以原始数据集 已经下载好放在Gighub上了。
二、数据处理 首先导入训练集,查看数据的情况:
1 2 3 4 5 6 7 8 9 from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import cross_val_score,train_test_split,GridSearchCV import pandas as pdimport numpy as npimport matplotlib.pyplot as pltdata = pd.read_csv('/Users/liz/code/jupyter-notebook/sklearn/1- DecisionTree/Titanic_train.csv' ) data.head() [out]:
PassengerId
Survived
Pclass
Name
Sex
Age
SlibSp
Parch
Ticek
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th…
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
通过以上的数据所展示的情况,我们所要做的是将Survived作为标签,其余的列作为特征。目标:以所知的特征来预测标签。这份数据集的实际意义是:通过已知数据对乘客的生还情况做一次预测。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 data.info() out: <class 'pandas .core .frame .DataFrame '> RangeIndex : 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2 ), int64(5 ), object(5 ) memory usage: 83.7 + KB
数据分析
通过以上的数据展示,共有891条数据,其中具有缺失值的特征有:Age、Cabin、Embarked;非数值型的特征有:Name,Sex,Ticket,Cabin,Embarked。
当我们采用现有的特征对乘客进行生还情况预测时,一些处理较为麻烦且不太重要的特征对可不采用。例如:这里的Name、Ticket可以不采用,因为在实际情况中乘客的名字以及所购的票对于乘客的生还情况作用不大。另外一点原因是这两者皆为非数值型数据,处理成数值形式较为复杂(在计算机中所接受的数据最终都要以数字的形式进行呈现)。
由于Cabin缺失值较多,这里采用删除的方式,理由同上。
虽然性别也为字符型数据,当在实际中性别对于逃生的可能性具有一定的影响,故对其保留。
将缺失值进行填补;将非数值型数据转化为数值型数据。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 data.drop(['Name' ,'Cabin' ,'Ticket' ],inplace=True ,axis=1 ) data.loc[:,'Age' ] = data['Age' ].fillna(int(data['Age' ].mean())) data = data.dropna() data = data.reset_index(drop = True ) data['Sex' ] = (data['Sex' ] == 'male' ).astype(int) tags = data['Embarked' ].unique().tolist() data.iloc[:,data.columns == 'Embarked' ] = data['Embarked' ].apply(lambda x : tags.index(x)) data.info() out: <class 'pandas .core .frame .DataFrame '> RangeIndex : 889 entries, 0 to 888 Data columns (total 9 columns): PassengerId 889 non-null int64 Survived 889 non-null int64 Pclass 889 non-null int64 Sex 889 non-null int64 Age 889 non-null float64 SibSp 889 non-null int64 Parch 889 non-null int64 Fare 889 non-null float64 Embarked 889 non-null int64 dtypes: float64(2 ), int64(7 ) memory usage: 62.6 KB x = data.iloc[:,data.columns != 'Survived' ] y = data.iloc[:,data.columns == 'Survived' ]
模型训练 思路:采用交叉验证来评估我们的模型;同时采用网格搜索来查找决策树中常见的最佳参数。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 parameters = {'splitter' :('best' ,'random' ) ,'criterion' :('gini' ,'entropy' ) ,'max_depth' :[*range(1 ,10 )] ,'min_samples_leaf' :[*range(1 ,50 ,5 )] ,'min_impurity_decrease' :[*np.linspace(0 ,0.5 ,20 )] } clf = DecisionTreeClassifier(random_state=30 ) GS = GridSearchCV(clf,parameters,cv=10 ) GS = GS.fit(x_train,y_train) GS.best_params_ out: {'criterion' : 'gini' , 'max_depth' : 3 , 'min_impurity_decrease' : 0.0 , 'min_samples_leaf' : 1 , 'splitter' : 'best' } GS.best_score_
确定了设置的参数的最佳值,开始训练模型:
1 2 3 4 5 6 7 8 clf_model = DecisionTreeClassifier(criterion='gini' ,max_depth=3 ,min_samples_leaf=1 ,min_impurity_decrease=0 ,splitter='best' ) clf_model = clf_model.fit(x,y)
导出模型:
1 2 3 from sklearn.externals import joblibjoblib.dump(clf_model,'/Users/liz/Code/jupyter-notebook/sklearn/1- DecisionTree/clf_model.m' )
测试集的处理:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 data_test = pd.read_csv('/Users/liz/code/jupyter-notebook/sklearn/1- DecisionTree/Titanic_test.csv' ) data_test.info() out: <class 'pandas .core .frame .DataFrame '> RangeIndex : 418 entries, 0 to 417 Data columns (total 11 columns): PassengerId 418 non-null int64 Pclass 418 non-null int64 Name 418 non-null object Sex 418 non-null object Age 332 non-null float64 SibSp 418 non-null int64 Parch 418 non-null int64 Ticket 418 non-null object Fare 417 non-null float64 Cabin 91 non-null object Embarked 418 non-null object dtypes: float64(2 ), int64(4 ), object(5 ) memory usage: 36.0 + KB data_test.drop(['Name' ,'Ticket' ,'Cabin' ],inplace=True ,axis=1 ) data_test['Age' ] = data_test['Age' ].fillna(int(data_test['Age' ].mean())) data_test['Fare' ] = data_test['Fare' ].fillna(int(data_test['Fare' ].mean())) data_test.loc[:,'Sex' ] = (data_test['Sex' ] == 'male' ).astype(int) tags = data_test['Embarked' ].unique().tolist() data_test['Embarked' ] = data_test['Embarked' ].apply(lambda x : tags.index(x))
此时测试集数据预处理完毕,导出模型并对数据进行测试:
1 2 3 4 5 6 7 8 9 10 11 12 model = joblib.load('/Users/liz/Code/jupyter-notebook/sklearn/1- DecisionTree/clf_model.m' ) Survived = model.predict(data_test) Survived = pd.DataFrame({'Survived' :Survived}) PassengerId = data_test.iloc[:,data_test.columns == 'PassengerId' ] gender_submission = pd.concat([PassengerId,Survived],axis=1 ) gender_submission.index = np.arange(1 , len(gender_submission)+1 ) gender_submission.to_csv('/Users/liz/Code/jupyter-notebook/sklearn/1- DecisionTree/gender_submission.csv' ,index=False )
导出文件:
PassengerId
Survived
0
892
0
1
893
1
2
894
0
3
895
0
4
896
1
...
...
...
413
1305
0
414
1306
1
415
1307
0
416
1308
0
417
1309
0
418 rows × 2 columns
将结果提交到Kaggle 上,最终得分:
最终得分0.77990,分数不高,最高有得满分的,此篇只是作为机器学习及Kaggle的一个入门。
最终的源代码及Kaggle的数据集都会上传到我的Github仓库中,其中也包括一些网络上搬运的相关笔记也都会上传到Github上,此仓库会持续更新…
附
下载源码