dataframe.apply¶

dataframe에서 빼놓을 수 없는것이 apply 입니다. 이것만 제대로 알아도 다른 복잡한 부분들을 처리가 가능합니다.

apply를 사용할때 많은 예제들이 람다 함수를 사용하게 됩니다. 람다 함수는 디버깅이 문제가 생겼을때 디버깅이 어렵기 때문 개인적인 생각으로는 그렇게 추천하지 않는 편입니다. 간단한것이라면 람다 함수를 쓰는것도 좋을것 같습니다.

apply는 가로 또는 세로 방향의 아이템을 가져와서 특정 연산을 할 수 있게 해줍니다.

사용 데이터 : https://github.com/donarts/sourcecode/blob/main/datascience/datasets/Iris.csv

import pandas as pd
df = pd.read_csv("Iris.csv")
df = df.drop("Species",axis=1)
df.head(2)

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2

def sum_c(XX):
    return XX.sum()
df.apply(sum_c) # 기본적으로 세로방향 column 단위

SepalLengthCm    876.5
SepalWidthCm     458.1
PetalLengthCm    563.8
PetalWidthCm     179.8
dtype: float64

df = pd.read_csv("Iris.csv")
df.head(2)

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa

df.Species.unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

조금 더 복잡한 처리, Species 값에 따라 다른 처리를 해보겠습니다. 인자는 다음과 같이 넘어오도록 하겠습니다. xxx[0]:SepalLengthCm, xxx[1]:PetalLengthCm, xxx[2]:PetalWidthCm, xxx[3]:Species

def fun1(xxx):
    if xxx[3]=="Iris-setosa":
        return xxx[0]
    if xxx[3]=="Iris-versicolor":
        return xxx[1]
    if xxx[3]=="Iris-virginica":
        return xxx[2]
    return -1

df[["SepalLengthCm","PetalLengthCm","PetalWidthCm","Species"]].apply(fun1,axis=1)

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    2.3
146    1.9
147    2.0
148    2.3
149    1.8
Length: 150, dtype: float64

이걸 다시 dataframe 으로 만들면 아래와 같습니다.

df["result"]=df[["SepalLengthCm","PetalLengthCm","PetalWidthCm","Species"]].apply(fun1,axis=1)
df

	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species	result
0	5.1	3.5	1.4	0.2	Iris-setosa	5.1
1	4.9	3.0	1.4	0.2	Iris-setosa	4.9
2	4.7	3.2	1.3	0.2	Iris-setosa	4.7
3	4.6	3.1	1.5	0.2	Iris-setosa	4.6
4	5.0	3.6	1.4	0.2	Iris-setosa	5.0
...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	Iris-virginica	2.3
146	6.3	2.5	5.0	1.9	Iris-virginica	1.9
147	6.5	3.0	5.2	2.0	Iris-virginica	2.0
148	6.2	3.4	5.4	2.3	Iris-virginica	2.3
149	5.9	3.0	5.1	1.8	Iris-virginica	1.8

150 rows × 6 columns

Apply는 조금만 연습하면 좀 더 복잡한 처리도 가능합니다.

dataframe.agg¶

agg/aggregate는 이러한 함수를 여러번 호출 할 수 있게 해줍니다.여러번 호출 한다는건 한번에 Series 결과가 여러개 리턴 된다는 의미입니다. 여기에서는 동작에 대해 이해를 돕기 위한 설명이고 좀 더 자세한 사용법은 아래 링크를 참고 바랍니다. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html

아래는 가장 기본적인 예제입니다. 합계와 최소 값을 구해서 한꺼번에 보여주게 됩니다.

df[["SepalLengthCm","PetalLengthCm","PetalWidthCm"]].agg(["sum","min"],axis=1)

	sum	min
0	6.7	0.2
1	6.5	0.2
2	6.2	0.2
3	6.3	0.2
4	6.6	0.2
...	...	...
145	14.2	2.3
146	13.2	1.9
147	13.7	2.0
148	13.9	2.3
149	12.8	1.8

150 rows × 2 columns

다음은 좀 더 복잡한 custom 함수를 여러개 사용하였습니다.

def fun1(xxx):
    if xxx[3]=="Iris-setosa":
        return xxx[0]
    if xxx[3]=="Iris-versicolor":
        return xxx[1]
    if xxx[3]=="Iris-virginica":
        return xxx[2]
    return -1

def fun2(xxx):
    return xxx[0]

def fun3(xxx):
    return xxx[1]

df[["SepalLengthCm","PetalLengthCm","PetalWidthCm","Species"]].agg([fun1,fun2,fun3],axis=1)

	fun1	fun2	fun3
0	5.1	5.1	1.4
1	4.9	4.9	1.4
2	4.7	4.7	1.3
3	4.6	4.6	1.5
4	5.0	5.0	1.4
...	...	...	...
145	2.3	6.7	5.2
146	1.9	6.3	5.0
147	2.0	6.5	5.2
148	2.3	6.2	5.4
149	1.8	5.9	5.1

150 rows × 3 columns

이번에는 함수 리턴 값을 상수를 사용하였습니다.

def fun2(xxx):
    return 2

def fun3(xxx):
    return 3

df[["SepalLengthCm","PetalLengthCm","PetalWidthCm","Species"]].agg([fun2,fun3],axis=1)

		SepalLengthCm	PetalLengthCm	PetalWidthCm	Species
0	fun2	2	2	2	2
0	fun3	3	3	3	3
1	fun2	2	2	2	2
1	fun3	3	3	3	3
2	fun2	2	2	2	2
...	...	...	...	...	...
147	fun3	3	3	3	3
148	fun2	2	2	2	2
148	fun3	3	3	3	3
149	fun2	2	2	2	2
149	fun3	3	3	3	3

300 rows × 4 columns

결과 보면 이상하게 동작하고 있습니다.

df[["SepalLengthCm","PetalLengthCm","PetalWidthCm","Species"]].agg([fun1,fun2,fun3],axis=1)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\apply.py:430, in Apply.agg_list_like(self)
    429 try:
--> 430     concatenated = concat(results, keys=keys, axis=1, sort=False)
    431 except TypeError as err:
    432     # we are concatting non-NDFrame objects,
    433     # e.g. a list of scalars

File ~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\util\_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File ~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\reshape\concat.py:368, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    159 """
    160 Concatenate pandas objects along a particular axis.
    161 
   (...)
    366 1   3   4
    367 """
--> 368 op = _Concatenator(
    369     objs,
    370     axis=axis,
    371     ignore_index=ignore_index,
    372     join=join,
    373     keys=keys,
    374     levels=levels,
    375     names=names,
    376     verify_integrity=verify_integrity,
    377     copy=copy,
    378     sort=sort,
    379 )
    381 return op.get_result()

File ~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\reshape\concat.py:458, in _Concatenator.__init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    454     msg = (
    455         f"cannot concatenate object of type '{type(obj)}'; "
    456         "only Series and DataFrame objs are valid"
    457     )
--> 458     raise TypeError(msg)
    460 ndims.add(obj.ndim)

TypeError: cannot concatenate object of type '<class 'float'>'; only Series and DataFrame objs are valid

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In [90], line 1
----> 1 df[["SepalLengthCm","PetalLengthCm","PetalWidthCm","Species"]].agg([fun1,fun2,fun3],axis=1)

File ~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\frame.py:9329, in DataFrame.aggregate(self, func, axis, *args, **kwargs)
   9326 relabeling, func, columns, order = reconstruct_func(func, **kwargs)
   9328 op = frame_apply(self, func=func, axis=axis, args=args, kwargs=kwargs)
-> 9329 result = op.agg()
   9331 if relabeling:
   9332     # This is to keep the order to columns occurrence unchanged, and also
   9333     # keep the order of new columns occurrence unchanged
   9334 
   9335     # For the return values of reconstruct_func, if relabeling is
   9336     # False, columns and order will be None.
   9337     assert columns is not None

File ~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\apply.py:758, in FrameApply.agg(self)
    756 result = None
    757 try:
--> 758     result = super().agg()
    759 except TypeError as err:
    760     exc = TypeError(
    761         "DataFrame constructor called with "
    762         f"incompatible data and dtype: {err}"
    763     )

File ~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\apply.py:172, in Apply.agg(self)
    169     return self.agg_dict_like()
    170 elif is_list_like(arg):
    171     # we require a list, but not a 'str'
--> 172     return self.agg_list_like()
    174 if callable(arg):
    175     f = com.get_cython_func(arg)

File ~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\apply.py:383, in Apply.agg_list_like(self)
    376 try:
    377     # Capture and suppress any warnings emitted by us in the call
    378     # to agg below, but pass through any warnings that were
    379     # generated otherwise.
    380     # This is necessary because of https://bugs.python.org/issue29672
    381     # See GH #43741 for more details
    382     with warnings.catch_warnings(record=True) as record:
--> 383         new_res = colg.aggregate(arg)
    384     if len(record) > 0:
    385         match = re.compile(depr_nuisance_columns_msg.format(".*"))

File ~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\series.py:4605, in Series.aggregate(self, func, axis, *args, **kwargs)
   4602     func = dict(kwargs.items())
   4604 op = SeriesApply(self, func, convert_dtype=False, args=args, kwargs=kwargs)
-> 4605 result = op.agg()
   4606 return result

File ~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\apply.py:1108, in SeriesApply.agg(self)
   1107 def agg(self):
-> 1108     result = super().agg()
   1109     if result is None:
   1110         f = self.f

File ~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\apply.py:172, in Apply.agg(self)
    169     return self.agg_dict_like()
    170 elif is_list_like(arg):
    171     # we require a list, but not a 'str'
--> 172     return self.agg_list_like()
    174 if callable(arg):
    175     f = com.get_cython_func(arg)

File ~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\apply.py:438, in Apply.agg_list_like(self)
    436     result = Series(results, index=keys, name=obj.name)
    437     if is_nested_object(result):
--> 438         raise ValueError(
    439             "cannot combine transform and aggregation operations"
    440         ) from err
    441     return result
    442 else:
    443     # Concat uses the first index to determine the final indexing order.
    444     # The union of a shorter first index with the other indices causes
    445     # the index sorting to be different from the order of the aggregating
    446     # functions. Reindex if this is the case.

ValueError: cannot combine transform and aggregation operations

심지어 에러까지 납니다.

조금 디버깅을 해보면 xxx[2] 이런식으로 리턴되는 값은 float 형태처럼 보이지만 실제 type 을 출력해보면 class float 형태였습니다. 즉 상수를 리턴하게 되면 조금 다른 동작을 하는것 같습니다.

이 부분은 왜 다른지 공식 document 부분에도 나와 있지 않습니다. 따라서 소스 검토가 필요한 부분이라 남겨두도록 하겠습니다.

SW정리

2023년 1월 10일 화요일

pandas dataframe apply / agg 를 이해하자

dataframe.apply¶

dataframe.agg¶

댓글 없음:

댓글 쓰기

	fun1	fun2	fun3
0	5.1	5.1	1.4
1	4.9	4.9	1.4
2	4.7	4.7	1.3
3	4.6	4.6	1.5
4	5.0	5.0	1.4
...	...	...	...
145	2.3	6.7	5.2
146	1.9	6.3	5.0
147	2.0	6.5	5.2
148	2.3	6.2	5.4
149	1.8	5.9	5.1

	fun1	fun2	fun3
0	5.1	5.1	1.4
1	4.9	4.9	1.4
2	4.7	4.7	1.3
3	4.6	4.6	1.5
4	5.0	5.0	1.4
...	...	...	...
145	2.3	6.7	5.2
146	1.9	6.3	5.0
147	2.0	6.5	5.2
148	2.3	6.2	5.4
149	1.8	5.9	5.1

	fun1	fun2	fun3
0	5.1	5.1	1.4
1	4.9	4.9	1.4
2	4.7	4.7	1.3
3	4.6	4.6	1.5
4	5.0	5.0	1.4
...	...	...	...
145	2.3	6.7	5.2
146	1.9	6.3	5.0
147	2.0	6.5	5.2
148	2.3	6.2	5.4
149	1.8	5.9	5.1