import pandas as pd

사용 데이터 https://github.com/donarts/sourcecode/blob/main/datascience/datasets/Iris.csv

Normalization¶

단위 차이, 극단값 등으로 비교가 어렵거나 왜곡이 발생하는 경우 사용¶

최대-최소 변환 : Min-Max Scaling 0~1¶

${x-min\over max-min}$¶

df = pd.read_csv("Iris.csv")
df.head(2)

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa

df_min = df["SepalWidthCm"].min()
df_max = df["SepalWidthCm"].max()
df["SepalWidthCm"] = (df["SepalWidthCm"]-df["SepalWidthCm"].min())\
/(df["SepalWidthCm"].max()-df["SepalWidthCm"].min())
df.head(2)

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	5.1	0.625000	1.4	0.2	Iris-setosa
1	4.9	0.416667	1.4	0.2	Iris-setosa

역연산¶

df["SepalWidthCm"] = df["SepalWidthCm"]*(df_max-df_min)+df_min

df.head(2)

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	5.1	1.028611	1.4	0.2	Iris-setosa
1	4.9	-0.124540	1.4	0.2	Iris-setosa

sklearn 함수¶

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler().fit(df) # dataframe은 2차원으로¶

scaler.transform(newdf) # dataframe은 2차원으로¶

df = pd.read_csv("Iris.csv")
scaler = MinMaxScaler().fit(X=df[["SepalWidthCm"]])
df["SepalWidthCm"]=scaler.transform(df[["SepalWidthCm"]])
df.head(2)

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	5.1	0.625000	1.4	0.2	Iris-setosa
1	4.9	0.416667	1.4	0.2	Iris-setosa

역변환¶

df["SepalWidthCm"]=scaler.inverse_transform(df[["SepalWidthCm"]])
df.head(2)

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa

Z-score 변환 : Standard Z Scaling¶

$x-mean\over std$¶

df = pd.read_csv("Iris.csv")
df_mean = df["SepalWidthCm"].mean()
df_std = df["SepalWidthCm"].std()
print(df_mean,df_std)
df["SepalWidthCm"] = (df["SepalWidthCm"]-df["SepalWidthCm"].mean())\
/(df["SepalWidthCm"].std())
df.head(2)

3.0540000000000003 0.4335943113621737

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	5.1	1.028611	1.4	0.2	Iris-setosa
1	4.9	-0.124540	1.4	0.2	Iris-setosa

sklearn 함수¶

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(df)¶

scaler.transform(newdf)¶

df = pd.read_csv("Iris.csv")
scaler = StandardScaler().fit(X=df[["SepalWidthCm"]])
print(scaler.mean_,scaler.var_,scaler.scale_)
df["SepalWidthCm"]=scaler.transform(df[["SepalWidthCm"]])
df.head(2)

[3.054] [0.18675067] [0.43214658]

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	5.1	1.032057	1.4	0.2	Iris-setosa
1	4.9	-0.124958	1.4	0.2	Iris-setosa

역변환¶

df["SepalWidthCm"] = scaler.inverse_transform(df[["SepalWidthCm"]])
df.head(2)

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	5.1	4.566513	1.4	0.2	Iris-setosa
1	4.9	4.350440	1.4	0.2	Iris-setosa

StandardScaler / Z-score 이 다른 이유¶

위의 결과에서 비교해보면 std 값 즉 표준 편차 값이 다르기 때문입니다. 달라지는 이유는 분산과 표준 편차를 구할때 자유도에 따라 결과가 달라지기 때문입니다. 통계학에서 자유도(degrees of freedom,df)는 통계적 추정을 할 때 표본자료 중 모집단(${\displaystyle x}$)에 대한 정보를 주는 독립적인 자료의 수를 말한다.

df = pd.read_csv("Iris.csv")
df_mean = df["SepalWidthCm"].mean()
df_std = df["SepalWidthCm"].std()
print(df_mean,df_std)
N=len(df)
print("ddof 0 std:",((((df["SepalWidthCm"]-df_mean)**2).sum())/N)**0.5)
N=len(df)-1
print("ddof 1 std:",((((df["SepalWidthCm"]-df_mean)**2).sum())/N)**0.5)

3.0540000000000003 0.4335943113621737
ddof 0 std: 0.4321465800705435
ddof 1 std: 0.4335943113621737

scaler = StandardScaler().fit(X=df[["SepalWidthCm"]])
print(scaler.mean_,scaler.var_,scaler.scale_)

[3.054] [0.18675067] [0.43214658]

dataframe 의 ddof 1, StandardScaler 의 ddof 0

import numpy as np
print(np.std(df["SepalWidthCm"]))
print(np.std(df["SepalWidthCm"],ddof=1))

0.4321465800705435
0.4335943113621737

StandardScaler 에 자유도를 고려해서 변환하기¶

# ddof 0 기본값
df = pd.read_csv("Iris.csv")
scaler = StandardScaler().fit(X=df[["SepalWidthCm"]])
print(scaler.mean_,scaler.var_,scaler.scale_)
df["SepalWidthCm"]=scaler.transform(df[["SepalWidthCm"]])
df.head(2)

[3.054] [0.18675067] [0.43214658]

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	5.1	1.032057	1.4	0.2	Iris-setosa
1	4.9	-0.124958	1.4	0.2	Iris-setosa

# ddof 1 인 경우 
df = pd.read_csv("Iris.csv")
scaler = StandardScaler().fit(X=df[["SepalWidthCm"]])
print(scaler.mean_,scaler.var_,scaler.scale_)
scaler.scale_ = np.std(df["SepalWidthCm"],ddof=1) # std 값을 강제로 입력
df["SepalWidthCm"]=scaler.transform(df[["SepalWidthCm"]])
df.head(2)

[3.054] [0.18675067] [0.43214658]

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	5.1	1.028611	1.4	0.2	Iris-setosa
1	4.9	-0.124540	1.4	0.2	Iris-setosa

SW정리

2022년 12월 24일 토요일

Standard Z Scaling, sklearn StandardScaler, 왜 값이 다른가?, 그리고 자유도

Normalization¶

단위 차이, 극단값 등으로 비교가 어렵거나 왜곡이 발생하는 경우 사용¶

최대-최소 변환 : Min-Max Scaling 0~1¶

${x-min\over max-min}$¶

역연산¶

sklearn 함수¶

scaler = MinMaxScaler().fit(df) # dataframe은 2차원으로¶

scaler.transform(newdf) # dataframe은 2차원으로¶

역변환¶

Z-score 변환 : Standard Z Scaling¶

$x-mean\over std$¶

sklearn 함수¶

scaler = StandardScaler().fit(df)¶

scaler.transform(newdf)¶

역변환¶

StandardScaler / Z-score 이 다른 이유¶

StandardScaler 에 자유도를 고려해서 변환하기¶

댓글 없음:

댓글 쓰기