Random Oversampling¶
- Y 기준 소수 클래스 데이터를 복사
- 과적합 발생 가능성
import seaborn as sns
#sns.get_dataset_names()
titanic=sns.load_dataset('titanic')
titanic.head()
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
titanic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 survived 891 non-null int64 1 pclass 891 non-null int64 2 sex 891 non-null object 3 age 714 non-null float64 4 sibsp 891 non-null int64 5 parch 891 non-null int64 6 fare 891 non-null float64 7 embarked 889 non-null object 8 class 891 non-null category 9 who 891 non-null object 10 adult_male 891 non-null bool 11 deck 203 non-null category 12 embark_town 889 non-null object 13 alive 891 non-null object 14 alone 891 non-null bool dtypes: bool(2), category(2), float64(2), int64(4), object(5) memory usage: 80.7+ KB
titanic.sex.value_counts()
male 577 female 314 Name: sex, dtype: int64
titanic.columns
Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone'], dtype='object')
# X 는 범주형도 가능
# Y 는 범주형도 가능
XX=['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'deck']
YY='sex'
ALL=XX.copy()
ALL.append(YY)
titanic_new=titanic[ALL]
titanic_new[XX]
survived | pclass | age | sibsp | parch | fare | embarked | class | who | deck | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 22.0 | 1 | 0 | 7.2500 | S | Third | man | NaN |
1 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 | C | First | woman | C |
2 | 1 | 3 | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | NaN |
3 | 1 | 1 | 35.0 | 1 | 0 | 53.1000 | S | First | woman | C |
4 | 0 | 3 | 35.0 | 0 | 0 | 8.0500 | S | Third | man | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 0 | 2 | 27.0 | 0 | 0 | 13.0000 | S | Second | man | NaN |
887 | 1 | 1 | 19.0 | 0 | 0 | 30.0000 | S | First | woman | B |
888 | 0 | 3 | NaN | 1 | 2 | 23.4500 | S | Third | woman | NaN |
889 | 1 | 1 | 26.0 | 0 | 0 | 30.0000 | C | First | man | C |
890 | 0 | 3 | 32.0 | 0 | 0 | 7.7500 | Q | Third | man | NaN |
891 rows × 10 columns
titanic_new[YY].value_counts()
male 577 female 314 Name: sex, dtype: int64
len(titanic_new)
891
from imblearn.over_sampling import RandomOverSampler
x,y=RandomOverSampler().fit_resample(titanic_new[XX],titanic_new[[YY]])
y.value_counts() # y의 비율이 일정함
sex female 577 male 577 dtype: int64
x
survived | pclass | age | sibsp | parch | fare | embarked | class | who | deck | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 22.00 | 1 | 0 | 7.2500 | S | Third | man | NaN |
1 | 1 | 1 | 38.00 | 1 | 0 | 71.2833 | C | First | woman | C |
2 | 1 | 3 | 26.00 | 0 | 0 | 7.9250 | S | Third | woman | NaN |
3 | 1 | 1 | 35.00 | 1 | 0 | 53.1000 | S | First | woman | C |
4 | 0 | 3 | 35.00 | 0 | 0 | 8.0500 | S | Third | man | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1149 | 1 | 2 | 29.00 | 1 | 0 | 26.0000 | S | Second | woman | NaN |
1150 | 1 | 3 | 0.75 | 2 | 1 | 19.2583 | C | Third | child | NaN |
1151 | 1 | 2 | 21.00 | 0 | 0 | 10.5000 | S | Second | woman | NaN |
1152 | 1 | 2 | 5.00 | 1 | 2 | 27.7500 | S | Second | child | NaN |
1153 | 1 | 3 | 2.00 | 0 | 1 | 12.2875 | S | Third | child | NaN |
1154 rows × 10 columns