Random Oversampling¶

- Y 기준 소수 클래스 데이터를 복사
- 과적합 발생 가능성

import seaborn as sns
#sns.get_dataset_names()
titanic=sns.load_dataset('titanic')

titanic.head()

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB

titanic.sex.value_counts()

male      577
female    314
Name: sex, dtype: int64

titanic.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

# X 는 범주형도 가능
# Y 는 범주형도 가능
XX=['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'deck']
YY='sex' 

ALL=XX.copy()
ALL.append(YY)

titanic_new=titanic[ALL]

titanic_new[XX]

	survived	pclass	age	sibsp	parch	fare	embarked	class	who	deck
0	0	3	22.0	1	0	7.2500	S	Third	man	NaN
1	1	1	38.0	1	0	71.2833	C	First	woman	C
2	1	3	26.0	0	0	7.9250	S	Third	woman	NaN
3	1	1	35.0	1	0	53.1000	S	First	woman	C
4	0	3	35.0	0	0	8.0500	S	Third	man	NaN
...	...	...	...	...	...	...	...	...	...	...
886	0	2	27.0	0	0	13.0000	S	Second	man	NaN
887	1	1	19.0	0	0	30.0000	S	First	woman	B
888	0	3	NaN	1	2	23.4500	S	Third	woman	NaN
889	1	1	26.0	0	0	30.0000	C	First	man	C
890	0	3	32.0	0	0	7.7500	Q	Third	man	NaN

891 rows × 10 columns

titanic_new[YY].value_counts()

male      577
female    314
Name: sex, dtype: int64

len(titanic_new) 

from imblearn.over_sampling import RandomOverSampler
x,y=RandomOverSampler().fit_resample(titanic_new[XX],titanic_new[[YY]])

y.value_counts() # y의 비율이 일정함

sex   
female    577
male      577
dtype: int64

	survived	pclass	age	sibsp	parch	fare	embarked	class	who	deck
0	0	3	22.00	1	0	7.2500	S	Third	man	NaN
1	1	1	38.00	1	0	71.2833	C	First	woman	C
2	1	3	26.00	0	0	7.9250	S	Third	woman	NaN
3	1	1	35.00	1	0	53.1000	S	First	woman	C
4	0	3	35.00	0	0	8.0500	S	Third	man	NaN
...	...	...	...	...	...	...	...	...	...	...
1149	1	2	29.00	1	0	26.0000	S	Second	woman	NaN
1150	1	3	0.75	2	1	19.2583	C	Third	child	NaN
1151	1	2	21.00	0	0	10.5000	S	Second	woman	NaN
1152	1	2	5.00	1	2	27.7500	S	Second	child	NaN
1153	1	3	2.00	0	1	12.2875	S	Third	child	NaN

1154 rows × 10 columns

SW정리

2023년 5월 7일 일요일

Random oversampling 예제

Random Oversampling¶