๐ค ๋ํ ์ค๋ช
Kaggle์์ ๊ฐ์ตํ ๋ํ์ค ํ๋์ธ SPACE TITANIC์ ๋๋ค.
์ ๋ช ํ ๋ฐ์ดํฐ ๋ถ์, ๋จธ์ ๋ฌ๋ ์ ๋ฌธ์ ์์ ์ค ํ๋์ธ TITANIC์์ ์ ๋งค์ฐ ์ ์ฌํ์ง๋ง ์กฐ๊ธ ๋ ๋์ ๋์ด๋๋ฅผ ๊ฐ์ง๊ณ ์์ต๋๋ค.
๊ณต๋ถ๋ฅผ ์งํํ๋ฉฐ ๋ํ์ ์ง์ ์ฐธ๊ฐํ๋ ๊ฒ์ด ๋์์ด ๋ ๊ฒ์ด๋ผ๊ณ ์๊ฐํ์ฌ ์ฐธ๊ฐํ๊ฒ ๋์์ต๋๋ค.
Space Titanic์ ์ ๋ณด์ ๋ํด ํ์ธํ์๋ ค๋ฉด ํด๋น ๋งํฌ๋ฅผ ์ฐธ๊ณ ํด์ฃผ์ธ์.
https://2t-hong.tistory.com/26
โ๏ธ Object ํํ์ ๋ฐ์ดํฐ ๋ณ๊ฒฝ
๋ฅ๋ฌ๋ ๋ชจ๋ธ์ Object ํํ์ ๋ฐ์ดํฐ๋ฅผ ํตํด ์์ธก์ ์ ํ์ง ๋ชปํ๊ธฐ ๋๋ฌธ์ ๊ฐ๊ฐ์ ๋ณํํด์ค์ผ ํฉ๋๋ค.
ํด๋น ๊ณผ์ ์ ๊ฑฐ์น๊ธฐ ์ด์ ์ Cabin ๋ฐ์ดํฐ๋ฅผ deck, number, side ์ธ ๊ฐ์ง๋ก ๋ถ๋ฅํ์ฌ ์งํํ์ต๋๋ค.
train_df_1 = train_df.copy()
test_df_1 = test_df.copy()
# Cabin์ deck, number, side ์ธ ๊ฐ์ง๋ก ๋ถ๋ฅํ์ฌ ์ฑ์ด๋ค.
train_df_1['Cabin'].fillna('Z/9999/Z', inplace=True)
test_df_1['Cabin'].fillna('Z/9999/Z', inplace=True)
train_df_1['Cabin_deck'] = train_df_1['Cabin'].apply(lambda x: x.split('/')[0])
train_df_1['Cabin_number'] = train_df_1['Cabin'].apply(lambda x: x.split('/')[1]).astype(int)
train_df_1['Cabin_side'] = train_df_1['Cabin'].apply(lambda x: x.split('/')[2])
# New features - test set
test_df_1['Cabin_deck'] = test_df_1['Cabin'].apply(lambda x: x.split('/')[0])
test_df_1['Cabin_number'] = test_df_1['Cabin'].apply(lambda x: x.split('/')[1]).astype(int)
test_df_1['Cabin_side'] = test_df_1['Cabin'].apply(lambda x: x.split('/')[2])
# Put Nan's back in (we will fill these later)
train_df_1.loc[train_df_1['Cabin_deck']=='Z', 'Cabin_deck']=np.nan
train_df_1.loc[train_df_1['Cabin_number']==9999, 'Cabin_number']=np.nan
train_df_1.loc[train_df_1['Cabin_side']=='Z', 'Cabin_side']=np.nan
test_df_1.loc[test_df_1['Cabin_deck']=='Z', 'Cabin_deck']=np.nan
test_df_1.loc[test_df_1['Cabin_number']==9999, 'Cabin_number']=np.nan
test_df_1.loc[test_df_1['Cabin_side']=='Z', 'Cabin_side']=np.nan
# Drop Cabin (we don't need it anymore)
train_df_1.drop('Cabin', axis=1, inplace=True)
test_df_1.drop('Cabin', axis=1, inplace=True)
# Plot distribution of new features
fig=plt.figure(figsize=(10,12))
plt.subplot(3,1,1)
sns.countplot(data=train_df_1, x='Cabin_deck', hue='Transported', order=['A','B','C','D','E','F','G','T'])
plt.title('Cabin deck')
plt.subplot(3,1,2)
sns.histplot(data=train_df_1, x='Cabin_number', hue='Transported',binwidth=20)
plt.vlines(300, ymin=0, ymax=200, color='black')
plt.vlines(600, ymin=0, ymax=200, color='black')
plt.vlines(900, ymin=0, ymax=200, color='black')
plt.vlines(1200, ymin=0, ymax=200, color='black')
plt.vlines(1500, ymin=0, ymax=200, color='black')
plt.vlines(1800, ymin=0, ymax=200, color='black')
plt.title('Cabin number')
plt.xlim([0,2000])
plt.subplot(3,1,3)
sns.countplot(data=train_df_1, x='Cabin_side', hue='Transported')
plt.title('Cabin side')
fig.tight_layout()
๋ํ Object ํํ์ ๋ฐ์ดํฐ๊ฐ ๊ฐ์ง๊ณ ์๋ nan๊ฐ๋ค์ ๋ชจ๋ธ์ ์์ธก์ ์ํฅ์ ์ฃผ์ง ์๊ฒํ๊ธฐ ์ํด ๋ชจ๋ -1๊ฐ์ ๊ฐ์ง๊ฒ ํ์ต๋๋ค.
์ซ์ ํํ์ ๋ฐ์ดํฐ์ธ Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck, Cabin_number์ด ๋จ์ ๊ฒ์ ์ ์ ์์ต๋๋ค.
Cabin_number์ ๊ฒฝ์ฐ ์ ํํ ๋์ฒด ๊ฐ์ ์ฐพ์ง ๋ชปํ๊ธฐ ๋๋ฌธ์ Cabin_number์ด์ ์์ ์ญ์ ํ์ต๋๋ค.
# Object ํํ์ ๋ฐ์ดํฐ์ nan๊ฐ๋ค์ ๋ชจ๋ -1์ด๋ผ๊ณ ํ๋ค.
object_cols = [i for i in train_df_1.columns if train_df_1[i].dtype == 'O']
train_df_1[object_cols] = train_df_1[object_cols].fillna(-1)
test_df_1[object_cols] = test_df_1[object_cols].fillna(-1)
print('TRAIN DATA MISSING VALUES')
print(train_df_1.isna().sum())
print('')
print('TEST DATA MISSING VALUES')
print(test_df_1.isna().sum())
TRAIN DATA MISSING VALUES
HomePlanet 0
CryoSleep 0
Destination 0
Age 179
VIP 0
RoomService 181
FoodCourt 183
ShoppingMall 208
Spa 183
VRDeck 188
Name 0
Transported 0
Cabin_deck 0
Cabin_number 199
Cabin_side 0
dtype: int64
TEST DATA MISSING VALUES
HomePlanet 0
CryoSleep 0
Destination 0
Age 91
VIP 0
RoomService 82
FoodCourt 106
ShoppingMall 98
Spa 101
VRDeck 80
Name 0
Cabin_deck 0
Cabin_number 100
Cabin_side 0
dtype: int64
# Transported์ 'True', 'False'๋ฅผ 1, 0์ผ๋ก ๋ณ๊ฒฝ
train_df_1['Transported'] = train_df_1['Transported'].astype(np.int8)
train_df_1['Transported'].value_counts()
1 4378
0 4315
Name: Transported, dtype: int64
train_df_1.head()
Europa | False | TRAPPIST-1e | 39.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Maham Ofracculy | 0 | B | 0.0 | P |
Earth | False | TRAPPIST-1e | 24.0 | False | 109.0 | 9.0 | 25.0 | 549.0 | 44.0 | Juanna Vines | 1 | F | 0.0 | S |
Europa | False | TRAPPIST-1e | 58.0 | True | 43.0 | 3576.0 | 0.0 | 6715.0 | 49.0 | Altark Susent | 0 | A | 0.0 | S |
Europa | False | TRAPPIST-1e | 33.0 | False | 0.0 | 1283.0 | 371.0 | 3329.0 | 193.0 | Solam Susent | 0 | A | 0.0 | S |
Earth | False | TRAPPIST-1e | 16.0 | False | 303.0 | 70.0 | 151.0 | 565.0 | 2.0 | Willy Santantines | 1 | F | 1.0 | S |
test_df_1.head()
Earth | True | TRAPPIST-1e | 27.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Nelly Carsoning | G | 3.0 | S |
Earth | False | TRAPPIST-1e | 19.0 | False | 0.0 | 9.0 | 0.0 | 2823.0 | 0.0 | Lerome Peckers | F | 4.0 | S |
Europa | True | 55 Cancri e | 31.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Sabih Unhearfus | C | 0.0 | S |
Europa | False | TRAPPIST-1e | 38.0 | False | 0.0 | 6652.0 | 0.0 | 181.0 | 585.0 | Meratz Caltilter | C | 1.0 | S |
Earth | False | TRAPPIST-1e | 20.0 | False | 10.0 | 0.0 | 635.0 | 0.0 | 0.0 | Brence Harperez | F | 5.0 | S |
โ๏ธ ๊ฒฐ์ธก๊ฐ ์ฒ๋ฆฌํ๊ธฐ
- HomePlanet์ Earth๊ฐ ๊ฐ์ฅ ๋ง์ผ๋ฏ๋ก Earth๋ก ์ฑ์ด๋ค.
- CryoSleep์ Nan๊ฐ์ ๊ฐ์ง ํ์ ์ญ์ ํ๋ค.
- Cabin์ deck, number, side ์ธ ๊ฐ์ง๋ก ๋ถ๋ฅํ์ฌ ์ฑ์ด๋ค.(์ผ๋จ์ ์ญ์ ํ๋ค)
- Destination์ TRAPPIST-1e๊ฐ ๊ฐ์ฅ ๋ง์ผ๋ฏ๋ก TRAPPIST-1e ๋ก ์ฑ์ด๋ค.
- Age๋ ๊ฐ์ฅ ๋ง์ ๊ฐ์ธ 24๋ก ์ฑ์ด๋ค. (์ ์ง)
- VIP๋ False๋ก ์ฑ์ด๋ค.
- RoomService, FoodCourt, ShoppingMall, Spa, VRDeck์ 0๊ฐ์ด ๊ฐ์ฅ ๋ง์ผ๋ฏ๋ก 0์ผ๋ก ์ฑ์ด๋ค (์ ์ง)
- Name์ ๋ชจ๋ ์ญ์ ํ๋ค
train_df_2 = train_df_1.copy()
test_df_2 = test_df_1.copy()
print('TRAIN DATA 1 MISSING VALUES')
print(train_df_2.isna().sum())
print('')
print('TEST DATA 1 MISSING VALUES')
print(test_df_2.isna().sum())
TRAIN DATA 1 MISSING VALUES
HomePlanet 0
CryoSleep 0
Destination 0
Age 179
VIP 0
RoomService 181
FoodCourt 183
ShoppingMall 208
Spa 183
VRDeck 188
Name 0
Transported 0
Cabin_deck 0
Cabin_number 199
Cabin_side 0
dtype: int64
TEST DATA 1 MISSING VALUES
HomePlanet 0
CryoSleep 0
Destination 0
Age 91
VIP 0
RoomService 82
FoodCourt 106
ShoppingMall 98
Spa 101
VRDeck 80
Name 0
Cabin_deck 0
Cabin_number 100
Cabin_side 0
dtype: int64
Name
# Name์ ๋ชจ๋ ์ญ์ ํ๋ค
train_df_2.drop('Name', axis=1, inplace=True)
test_df_2.drop('Name', axis=1, inplace=True)
Age
fig=plt.figure(figsize=(12,8))
train_df_2['Age'].hist(bins=20, rwidth=0.8)
<matplotlib.axes._subplots.AxesSubplot at 0x7f67f9ee2290>
train_df_2['Age'].value_counts()
24.0 324
18.0 320
21.0 311
19.0 293
23.0 292
...
72.0 4
78.0 3
79.0 3
76.0 2
77.0 2
Name: Age, Length: 80, dtype: int64
# Age๋ ํ๊ท ๊ฐ์ผ๋ก ์ฑ์ด๋ค.
# train_df_2['Age'].fillna(train_df_2['Age'].mean(), inplace=True)
# test_df_2['Age'].fillna(test_df_2['Age'].mean(), inplace=True)
# Age๋ ์ต๋น๊ฐ์ธ 24๋ก ์ฑ์ด๋ค
train_df_2['Age'] = train_df_2['Age'].fillna(24.0)
test_df_2['Age'] = test_df_2['Age'].fillna(24.0)
print(train_df_2.isna().sum())
print("")
print(test_df_2.isna().sum())
HomePlanet 0
CryoSleep 0
Destination 0
Age 0
VIP 0
RoomService 181
FoodCourt 183
ShoppingMall 208
Spa 183
VRDeck 188
Transported 0
Cabin_deck 0
Cabin_number 199
Cabin_side 0
dtype: int64
HomePlanet 0
CryoSleep 0
Destination 0
Age 0
VIP 0
RoomService 82
FoodCourt 106
ShoppingMall 98
Spa 101
VRDeck 80
Cabin_deck 0
Cabin_number 100
Cabin_side 0
dtype: int64
RS, FC, SM, S, VR
# RS, FC, SM, S, VR์ ๊ฒฐ์ธก์น๋ ๋ชจ๋ 0์ผ๋ก ์ฑ์ด๋ค -> ์ ์ง
train_df_2['RoomService'] = train_df_2['RoomService'].fillna(0)
test_df_2['RoomService'] = test_df_2['RoomService'].fillna(0)
train_df_2['FoodCourt'] = train_df_2['FoodCourt'].fillna(0)
test_df_2['FoodCourt'] = test_df_2['FoodCourt'].fillna(0)
train_df_2['ShoppingMall'] = train_df_2['ShoppingMall'].fillna(0)
test_df_2['ShoppingMall'] = test_df_2['ShoppingMall'].fillna(0)
train_df_2['Spa'] = train_df_2['Spa'].fillna(0)
test_df_2['Spa'] = test_df_2['Spa'].fillna(0)
train_df_2['VRDeck'] = train_df_2['VRDeck'].fillna(0)
test_df_2['VRDeck'] = test_df_2['VRDeck'].fillna(0)
print(train_df_2.isna().sum())
print("")
print(test_df_2.isna().sum())
HomePlanet 0
CryoSleep 0
Destination 0
Age 0
VIP 0
RoomService 0
FoodCourt 0
ShoppingMall 0
Spa 0
VRDeck 0
Transported 0
Cabin_deck 0
Cabin_number 199
Cabin_side 0
dtype: int64
HomePlanet 0
CryoSleep 0
Destination 0
Age 0
VIP 0
RoomService 0
FoodCourt 0
ShoppingMall 0
Spa 0
VRDeck 0
Cabin_deck 0
Cabin_number 100
Cabin_side 0
dtype: int64
Cabin_number
# Cabin_number๋ ์ผ๋จ์ ์ญ์ ํ๊ณ ์งํ
train_df_2.drop('Cabin_number', axis=1, inplace=True)
test_df_2.drop('Cabin_number', axis=1, inplace=True)
Object ๊ฒฐ์ธก์น๋ค ํด๊ฒฐ
- ๋จผ์ Object์ Columns๋ฅผ dummies๋ฅผ ํตํด ๋ฐ๊ฟ์ค๋ค.
train_df_2.head()
Europa | False | TRAPPIST-1e | 39.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | B | P |
Earth | False | TRAPPIST-1e | 24.0 | False | 109.0 | 9.0 | 25.0 | 549.0 | 44.0 | 1 | F | S |
Europa | False | TRAPPIST-1e | 58.0 | True | 43.0 | 3576.0 | 0.0 | 6715.0 | 49.0 | 0 | A | S |
Europa | False | TRAPPIST-1e | 33.0 | False | 0.0 | 1283.0 | 371.0 | 3329.0 | 193.0 | 0 | A | S |
Earth | False | TRAPPIST-1e | 16.0 | False | 303.0 | 70.0 | 151.0 | 565.0 | 2.0 | 1 | F | S |
obj_cols = [i for i in train_df_2.columns if train_df_2[i].dtype == "O"]
print("TRAIN DATA\n")
for i in obj_cols:
print(train_df_2[i].unique())
print("\n\nTEST DATA\n")
for i in obj_cols:
print(test_df_2[i].unique())
TRAIN DATA
['Europa' 'Earth' 'Mars' -1]
[False True -1]
['TRAPPIST-1e' 'PSO J318.5-22' '55 Cancri e' -1]
[False True -1]
['B' 'F' 'A' 'G' -1 'E' 'D' 'C' 'T']
['P' 'S' -1]
TEST DATA
['Earth' 'Europa' 'Mars' -1]
[True False -1]
['TRAPPIST-1e' '55 Cancri e' 'PSO J318.5-22' -1]
[False -1 True]
['G' 'F' 'C' 'B' 'D' 'E' -1 'A' 'T']
['S' 'P' -1]
# dummies๋ฅผ ์ด์ฉํ์ฌ ๊ฐ Column ๋ถ๋ฅ
train_df_3 = pd.concat([train_df_2, pd.get_dummies(train_df_2[obj_cols], drop_first=True)], axis=1).drop(obj_cols, axis=1)
test_df_3 = pd.concat([test_df_2, pd.get_dummies(test_df_2[obj_cols], drop_first=True)], axis=1).drop(obj_cols, axis=1)
train_df_3.head()
39.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
24.0 | 109.0 | 9.0 | 25.0 | 549.0 | 44.0 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
58.0 | 43.0 | 3576.0 | 0.0 | 6715.0 | 49.0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
33.0 | 0.0 | 1283.0 | 371.0 | 3329.0 | 193.0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
16.0 | 303.0 | 70.0 | 151.0 | 565.0 | 2.0 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
5 rows × 27 columns
test_df_3.head()
27.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
19.0 | 0.0 | 9.0 | 0.0 | 2823.0 | 0.0 | 1 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
31.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
38.0 | 0.0 | 6652.0 | 0.0 | 181.0 | 585.0 | 0 | 1 | 0 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
20.0 | 10.0 | 0.0 | 635.0 | 0.0 | 0.0 | 1 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
5 rows × 26 columns
train_df_3.columns
Index(['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
'Transported', 'HomePlanet_Earth', 'HomePlanet_Europa',
'HomePlanet_Mars', 'CryoSleep_False', 'CryoSleep_True',
'Destination_55 Cancri e', 'Destination_PSO J318.5-22',
'Destination_TRAPPIST-1e', 'VIP_False', 'VIP_True', 'Cabin_deck_A',
'Cabin_deck_B', 'Cabin_deck_C', 'Cabin_deck_D', 'Cabin_deck_E',
'Cabin_deck_F', 'Cabin_deck_G', 'Cabin_deck_T', 'Cabin_side_P',
'Cabin_side_S'],
dtype='object')
test_df_3.columns
Index(['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
'HomePlanet_Earth', 'HomePlanet_Europa', 'HomePlanet_Mars',
'CryoSleep_False', 'CryoSleep_True', 'Destination_55 Cancri e',
'Destination_PSO J318.5-22', 'Destination_TRAPPIST-1e', 'VIP_False',
'VIP_True', 'Cabin_deck_A', 'Cabin_deck_B', 'Cabin_deck_C',
'Cabin_deck_D', 'Cabin_deck_E', 'Cabin_deck_F', 'Cabin_deck_G',
'Cabin_deck_T', 'Cabin_side_P', 'Cabin_side_S'],
dtype='object')
โ๏ธ ๋ชจ๋ธ ์์ฑ ๋ฐ ์์ธก
sklearn์์ ์ ๊ณตํ๋ ๋ค์ํ ๋ชจ๋ธ๋ค์ ์ฌ์ฉํ์ฌ ์์ธกํ๊ณ ๊ฐ์ฅ ์ข์ ์ฑ๋ฅ์ ๋ณด์ธ ์ธ ๊ฐ์ ๋ชจ๋ธ์ ensembleํ์ต๋๋ค.
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, plot_confusion_matrix
from sklearn.model_selection import train_test_split
X = train_df_3.drop('Transported', axis=1)
Y = train_df_3['Transported']
X_train, X_valid, Y_train, Y_valid = train_test_split(X, Y, test_size=0.1, random_state=42, stratify=Y)
X_train.shape, Y_train.shape, X_valid.shape, Y_valid.shape
((7823, 26), (7823,), (870, 26), (870,))
# Logistic regression
lr = LogisticRegression(max_iter=1000)
lr_model = lr.fit(X_train,Y_train)
Y_pred = lr_model.predict(X_valid)
# Measuring accuracy on Testing Data
accuracy_log = accuracy_score(Y_pred, Y_valid)
accuracy_log
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:818: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,
0.7977011494252874
# Random forest
rf = RandomForestClassifier(n_estimators=100)
rf_model = rf.fit(X_train, Y_train)
Y_pred = rf_model.predict(X_valid)
# Measuring accuracy on Testing Data
accuracy_random_forest = accuracy_score(Y_pred, Y_valid)
accuracy_random_forest
0.8149425287356322
# Decision tree
dt = DecisionTreeClassifier()
dt_model = dt.fit(X_train, Y_train)
Y_pred = dt_model.predict(X_valid)
# Measuring accuracy on Testing Data
acc_decision_tree = accuracy_score(Y_pred, Y_valid)
acc_decision_tree
0.7505747126436781
# KNN
knn = KNeighborsClassifier(n_neighbors = 5)
knn_model = knn.fit(X_train, Y_train)
Y_pred = knn_model.predict(X_valid)
# Measuring accuracy on Testing Data
accuracy_knn = accuracy_score(Y_pred, Y_valid)
accuracy_knn
0.7758620689655172
# Support Vector Machines
svc = SVC()
svc_model = svc.fit(X_train, Y_train)
Y_pred = svc_model.predict(X_valid)
# Measuring accuracy on Testing Data
accuracy_svc = accuracy_score(Y_pred, Y_valid)
accuracy_svc
0.7954022988505747
# Gaussian Naive Bayes
gaussian = GaussianNB()
gaussian_model = gaussian.fit(X_train, Y_train)
Y_pred = gaussian_model.predict(X_valid)
# Measuring accuracy on Testing Data
accuracy_gaussian = accuracy_score(Y_pred, Y_valid)
accuracy_gaussian
0.7632183908045977
#Gradient boosting
g_boost = GradientBoostingClassifier()
g_boost_model = g_boost.fit(X_train,Y_train)
Y_pred = g_boost_model.predict(X_valid)
# Measuring accuracy on Testing Data
accuracy_gboost = accuracy_score(Y_pred, Y_valid)
accuracy_gboost
0.8114942528735632
#XGB Classifier
xgb_classifier = XGBClassifier()
xgb_model = xgb_classifier.fit(X_train,Y_train)
Y_pred = xgb_model.predict(X_valid)
accuracy_xgb = accuracy_score(Y_pred, Y_valid)
accuracy_xgb
0.8103448275862069
models = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
'Random Forest', 'Naive Bayes', 'Decision Tree', 'Gradient Boosting', 'XGB Classifier'],
'Score': [accuracy_svc, accuracy_knn, accuracy_log,
accuracy_random_forest, accuracy_gaussian, acc_decision_tree, accuracy_gboost, accuracy_xgb]})
models.sort_values(by='Score', ascending=False)
Random Forest | 0.814943 |
Gradient Boosting | 0.811494 |
XGB Classifier | 0.810345 |
Logistic Regression | 0.797701 |
Support Vector Machines | 0.795402 |
KNN | 0.775862 |
Naive Bayes | 0.763218 |
Decision Tree | 0.750575 |
from sklearn.ensemble import VotingClassifier
rf = RandomForestClassifier(n_estimators=100)
xgb = XGBClassifier()
gbm = GradientBoostingClassifier()
votingC = VotingClassifier(estimators=[('RandomForest', rf), ('XGBoost', xgb), ('GBM', gbm)],
voting='soft',
n_jobs=-1)
# fit training data (df_train)
votingC.fit(X_train,Y_train)
X_test = test_df_3
predictions = votingC.predict(X_test)
test_data = pd.read_csv("/content/drive/MyDrive/2022/แแ
ฎแผแแ
กแทแแ
ขแแ
กแจแแ
ญ/MOGAKCO/Space_Titanic/test.csv")
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Transported': predictions.astype(bool)})
output.to_csv('submission.csv', index=False)
predictions = rf_model.predict(test_df_3)
test_data = pd.read_csv("/content/drive/MyDrive/2022/แแ
ฎแผแแ
กแทแแ
ขแแ
กแจแแ
ญ/MOGAKCO/Space_Titanic/test.csv")
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Transported': predictions.astype(bool)})
output.to_csv('Submission_rf.csv', index=False)
print("Your submission was successfully saved!")
Your submission was successfully saved!
predictions = g_boost_model.predict(test_df_3)
test_data = pd.read_csv("/content/drive/MyDrive/2022/แแ
ฎแผแแ
กแทแแ
ขแแ
กแจแแ
ญ/MOGAKCO/Space_Titanic/test.csv")
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Transported': predictions.astype(bool)})
output.to_csv('Submission_gbost.csv', index=False)
print("Your submission was successfully saved!")
Your submission was successfully saved!
predictions = xgb_model.predict(test_df_3)
test_data = pd.read_csv("/content/drive/MyDrive/2022/แแ
ฎแผแแ
กแทแแ
ขแแ
กแจแแ
ญ/MOGAKCO/Space_Titanic/test.csv")
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Transported': predictions.astype(bool)})
output.to_csv('Submission_xgb.csv', index=False)
print("Your submission was successfully saved!")
Your submission was successfully saved!
Kaggle์ ์ ์ถํ์ ๋ ๊ฐ์ฅ ๋์ ์ฑ์ ์ ๋ธ ๋ชจ๋ธ์ ๋๋ค.
0.80383์ผ๋ก 421์์ ๊ธฐ๋ก์ ํ์ต๋๋ค.
'AI > Kaggle' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[AI] Kaggle - Space Titanic (1) ์ ๋ณด ํ์ธ (0) | 2022.07.31 |
---|