1. 지난 이야기
https://swlock.blogspot.com/2018/12/feature.html고민해서 만들어봤는데요. 실제 테스트해보니 속도가 너무 떨어집니다. 어디에 내놓고 쓸만한 내용은 아닌것 같아서 추가 분석해보았습니다.
2. 데이터 준비
일단 큰 데이터를 준비합니다. 20만건정도 되는 데이터를 준비하였습니다.생성 코드
import numpy as np import csv MAX_ITEM = 200000 MAX_COLMUM = 4 title = 'Name Product Date Sellcount'.split() name = 'micle sally pop kim park don'.split() product = 'PA PB PC PD PE PF'.split() f = open('train_long.csv', 'w', encoding='utf-8', newline='') wr = csv.writer(f, delimiter=',') wr.writerow(title) for index in range(0,MAX_ITEM): random_data = np.random.rand(MAX_COLMUM).tolist() random_data[0] = name[int((random_data[0]*len(name))%len(name))] random_data[1] = product[int((random_data[1]*len(product))%len(product))] random_data[2] = int((random_data[2]*100)%100) random_data[3] = int((random_data[3]*5)%5) wr.writerow(random_data) f.close()
4개의 컬럼을 가지고 (Name, Product, Date, Sellcount) 실제 데이터는 random_data[0],random_data[1],random_data[2],random_data[3] 이곳에 저장되며 wr.writerow(random_data)함수에 의해서 파일로 저장됩니다.
실행코드
import pandas as pd import time start = time.time() df = pd.read_csv("train_long.csv", delimiter=',', header=0) #print("## All") #print(df) #print("## Name") df_name = df["Name"].drop_duplicates() #print(df_name) #print("## Product") df_product = df["Product"].drop_duplicates() #print(df_product) PRIOD_COND=[(0,3),(4,6),(7,12),(13,24),(25,48),(49,96),(97,999)] table_dict = {} df_o = pd.DataFrame({}) for name in list(df_name): tlist = [name] for product in list(df_product): for condl,condh in PRIOD_COND: df_c = df.copy() df_c = df_c.loc[(df_c['Name'] == name)&(df_c['Product'] == product)&(df_c['Date'] >= condl)&(df_c['Date'] <= condh)] #print("## select ",name,product,condl,condh) #if(len(df_c.index)>0): #print(df_c) #print("sum",df_c["Sellcount"].sum()) tlist.append(df_c["Sellcount"].sum()) print(tlist) df_o = df_o.append( pd.DataFrame(tlist).T ) print ("## Result1") print (df_o) tlist = ["Name"] for product in list(df_product): for condl,condh in PRIOD_COND: tlist.append(product+"_"+str(condh)) df_o.columns=(tlist) print ("## Result2") print (df_o) print ("timediff :",time.time()-start,"sec")
위 코드는 이전에 작성하였던 코드에 마지막에 시간을 초단위로 나오도록 (time.time()-start)추가 하였습니다.
실행결과
['pop', 470, 353, 634, 1408, 2605, 5256, 296, 419, 297, 729, 1210, 2784, 5314, 360, 456, 336, 652, 1320, 2650, 5174, 379, 433, 315, 614, 1254, 2571, 5340, 315, 430, 312, 646, 1338, 2582, 5305, 374, 452, 379, 699, 1347, 2594, 5417, 319] ['kim', 441, 332, 745, 1272, 2617, 5440, 293, 468, 348, 743, 1261, 2711, 5417, 310, 466, 287, 631, 1335, 2579, 5238, 349, 464, 323, 626, 1225, 2620, 5426, 318, 441, 266, 678, 1279, 2727, 5484, 383, 473, 358, 677, 1320, 2653, 5247, 334] ['don', 469, 362, 668, 1259, 2725, 5206, 340, 380, 286, 694, 1276, 2833, 5439, 318, 394, 324, 613, 1342, 2687, 5278, 344, 483, 376, 727, 1254, 2889, 5296, 363, 465, 321, 609, 1443, 2532, 5208, 312, 469, 334, 666, 1446, 2834, 5488, 333] ['sally', 425, 315, 753, 1328, 2706, 5441, 357, 455, 315, 682, 1420, 2627, 5203, 257, 518, 289, 658, 1291, 2846, 5301, 354, 402, 327, 691, 1407, 2558, 5361, 374, 406, 354, 627, 1236, 2744, 5196, 309, 448, 303, 712, 1253, 2855, 5284, 338] ['micle', 440, 414, 739, 1312, 2559, 5462, 368, 459, 305, 631, 1270, 2713, 5460, 333, 367, 298, 708, 1284, 2863, 5283, 343, 447, 363, 758, 1365, 2555, 5257, 267, 477, 359, 677, 1222, 2679, 5612, 337, 475, 344, 662, 1246, 2677, 5170, 349] ['park', 445, 308, 651, 1290, 2598, 5258, 360, 517, 333, 680, 1227, 2751, 5290, 318, 418, 325, 668, 1402, 2719, 5398, 319, 415, 361, 676, 1365, 2624, 5335, 379, 459, 311, 623, 1287, 2729, 5467, 395, 448, 321, 695, 1327, 2561, 5186, 307] ## Result1 0 1 2 3 4 5 6 7 8 9 ... 33 34 35 36 37 38 39 40 41 42 0 pop 470 353 634 1408 2605 5256 296 419 297 ... 2582 5305 374 452 379 699 1347 2594 5417 319 0 kim 441 332 745 1272 2617 5440 293 468 348 ... 2727 5484 383 473 358 677 1320 2653 5247 334 0 don 469 362 668 1259 2725 5206 340 380 286 ... 2532 5208 312 469 334 666 1446 2834 5488 333 0 sally 425 315 753 1328 2706 5441 357 455 315 ... 2744 5196 309 448 303 712 1253 2855 5284 338 0 micle 440 414 739 1312 2559 5462 368 459 305 ... 2679 5612 337 475 344 662 1246 2677 5170 349 0 park 445 308 651 1290 2598 5258 360 517 333 ... 2729 5467 395 448 321 695 1327 2561 5186 307 [6 rows x 43 columns] ## Result2 Name PD_3 PD_6 PD_12 PD_24 PD_48 PD_96 PD_999 PA_3 ... PC_96 PC_999 PF_3 PF_6 PF_12 PF_24 PF_48 PF_96 PF_999 0 pop 470 353 634 1408 2605 5256 296 419 ... 5305 374 452 379 699 1347 2594 5417 319 0 kim 441 332 745 1272 2617 5440 293 468 ... 5484 383 473 358 677 1320 2653 5247 334 0 don 469 362 668 1259 2725 5206 340 380 ... 5208 312 469 334 666 1446 2834 5488 333 0 sally 425 315 753 1328 2706 5441 357 455 ... 5196 309 448 303 712 1253 2855 5284 338 0 micle 440 414 739 1312 2559 5462 368 459 ... 5612 337 475 344 662 1246 2677 5170 349 0 park 445 308 651 1290 2598 5258 360 517 ... 5467 395 448 321 695 1327 2561 5186 307 [6 rows x 43 columns] timediff : 21.010799407958984 sec
제 PC 환경에서 21초가 걸립니다. 더 큰 데이터가 존재한다면 무시할 수 없는 수치 입니다.
시간이 떨어지는 부분은 다음 구간입니다.
df_c = df_c.loc[(df_c['Name'] == name)&(df_c['Product'] == product)&(df_c['Date'] >= condl)&(df_c['Date'] <= condh)]20만건이나 되는 data에서 원하는 row를 선택할때 속도가 많이 떨어지게 됩니다. 개선 방법은 df_c['Name'] == name 만 우선 선택하는 방법을 취할 것입니다.
3. 개선된 코드
import pandas as pd import time start = time.time() df = pd.read_csv("train_long.csv", delimiter=',', header=0) #print("## All") #print(df) #print("## Name") df_name = df["Name"].drop_duplicates() #print(df_name) #print("## Product") df_product = df["Product"].drop_duplicates() #print(df_product) PRIOD_COND=[(0,3),(4,6),(7,12),(13,24),(25,48),(49,96),(97,999)] table_dict = {} df_o = pd.DataFrame({}) for name in list(df_name): tlist = [name] df_cn = df.loc[(df['Name'] == name)] for product in list(df_product): for condl,condh in PRIOD_COND: df_cnr = df_cn.loc[(df_cn['Product'] == product)&(df_cn['Date'] >= condl)&(df_cn['Date'] <= condh)] tlist.append(df_cnr["Sellcount"].sum()) print(tlist) df_o = df_o.append( pd.DataFrame(tlist).T ) print ("## Result1") print (df_o) tlist = ["Name"] for product in list(df_product): for condl,condh in PRIOD_COND: tlist.append(product+"_"+str(condh)) df_o.columns=(tlist) print ("## Result2") print (df_o) print ("timediff :",time.time()-start,"sec")
product for loop 밖에서 일부 데이터를 select하여 df_cn을 만듭니다.
df_cn = df.loc[(df['Name'] == name)]그리고 루프안에서는 아래와 같이 Name이 빠진 코드로 검색을 하게 됩니다.
for product in list(df_product):
df_cnr = df_cn.loc[(df_cn['Product'] == product)&(df_cn['Date'] >= condl)&(df_cn['Date'] <= condh)]
결과
['pop', 470, 353, 634, 1408, 2605, 5256, 296, 419, 297, 729, 1210, 2784, 5314, 360, 456, 336, 652, 1320, 2650, 5174, 379, 433, 315, 614, 1254, 2571, 5340, 315, 430, 312, 646, 1338, 2582, 5305, 374, 452, 379, 699, 1347, 2594, 5417, 319] ['kim', 441, 332, 745, 1272, 2617, 5440, 293, 468, 348, 743, 1261, 2711, 5417, 310, 466, 287, 631, 1335, 2579, 5238, 349, 464, 323, 626, 1225, 2620, 5426, 318, 441, 266, 678, 1279, 2727, 5484, 383, 473, 358, 677, 1320, 2653, 5247, 334] ['don', 469, 362, 668, 1259, 2725, 5206, 340, 380, 286, 694, 1276, 2833, 5439, 318, 394, 324, 613, 1342, 2687, 5278, 344, 483, 376, 727, 1254, 2889, 5296, 363, 465, 321, 609, 1443, 2532, 5208, 312, 469, 334, 666, 1446, 2834, 5488, 333] ['sally', 425, 315, 753, 1328, 2706, 5441, 357, 455, 315, 682, 1420, 2627, 5203, 257, 518, 289, 658, 1291, 2846, 5301, 354, 402, 327, 691, 1407, 2558, 5361, 374, 406, 354, 627, 1236, 2744, 5196, 309, 448, 303, 712, 1253, 2855, 5284, 338] ['micle', 440, 414, 739, 1312, 2559, 5462, 368, 459, 305, 631, 1270, 2713, 5460, 333, 367, 298, 708, 1284, 2863, 5283, 343, 447, 363, 758, 1365, 2555, 5257, 267, 477, 359, 677, 1222, 2679, 5612, 337, 475, 344, 662, 1246, 2677, 5170, 349] ['park', 445, 308, 651, 1290, 2598, 5258, 360, 517, 333, 680, 1227, 2751, 5290, 318, 418, 325, 668, 1402, 2719, 5398, 319, 415, 361, 676, 1365, 2624, 5335, 379, 459, 311, 623, 1287, 2729, 5467, 395, 448, 321, 695, 1327, 2561, 5186, 307] ## Result1 0 1 2 3 4 5 6 7 8 9 ... 33 34 35 36 37 38 39 40 41 42 0 pop 470 353 634 1408 2605 5256 296 419 297 ... 2582 5305 374 452 379 699 1347 2594 5417 319 0 kim 441 332 745 1272 2617 5440 293 468 348 ... 2727 5484 383 473 358 677 1320 2653 5247 334 0 don 469 362 668 1259 2725 5206 340 380 286 ... 2532 5208 312 469 334 666 1446 2834 5488 333 0 sally 425 315 753 1328 2706 5441 357 455 315 ... 2744 5196 309 448 303 712 1253 2855 5284 338 0 micle 440 414 739 1312 2559 5462 368 459 305 ... 2679 5612 337 475 344 662 1246 2677 5170 349 0 park 445 308 651 1290 2598 5258 360 517 333 ... 2729 5467 395 448 321 695 1327 2561 5186 307 [6 rows x 43 columns] ## Result2 Name PD_3 PD_6 PD_12 PD_24 PD_48 PD_96 PD_999 PA_3 ... PC_96 PC_999 PF_3 PF_6 PF_12 PF_24 PF_48 PF_96 PF_999 0 pop 470 353 634 1408 2605 5256 296 419 ... 5305 374 452 379 699 1347 2594 5417 319 0 kim 441 332 745 1272 2617 5440 293 468 ... 5484 383 473 358 677 1320 2653 5247 334 0 don 469 362 668 1259 2725 5206 340 380 ... 5208 312 469 334 666 1446 2834 5488 333 0 sally 425 315 753 1328 2706 5441 357 455 ... 5196 309 448 303 712 1253 2855 5284 338 0 micle 440 414 739 1312 2559 5462 368 459 ... 5612 337 475 344 662 1246 2677 5170 349 0 park 445 308 651 1290 2598 5258 360 517 ... 5467 395 448 321 695 1327 2561 5186 307 [6 rows x 43 columns] timediff : 2.570505666732788 sec
21초->2.5초정도 걸렸습니다. 10배 정도 단축되었습니다.
4. 추가 개선된 코드
이번에는 groupby라는 기능을 이용할 것입니다. group로 묶는 기능인데 현재 Date 기준으로 날짜를 묶어야 하는것이 필요하기 때문에 cut 기능도 동시에 사용할 것입니다. 전체 코드를 보면서 이해를 하면 쉽습니다.import pandas as pd import time start = time.time() df = pd.read_csv("train_long.csv", delimiter=',', header=0) #print("## All") #print(df) PRIOD_COND=[-1,3,6,12,24,48,96,999] PRIOD_labels = ["3", "6", "12", "24", "48","96","999"] table_dict = {} cutdata = pd.cut(df["Date"], PRIOD_COND, labels=PRIOD_labels) df["Cutdata"]=cutdata print(df) print(df.groupby([df["Name"],df["Product"],df["Cutdata"]]).sum()["Sellcount"]) print(df.groupby([df["Name"],df["Product"],df["Cutdata"]]).sum()["Sellcount"].unstack(["Product","Cutdata"])) df_o = df.groupby([df["Name"],df["Product"],df["Cutdata"]]).sum()["Sellcount"].unstack(["Product","Cutdata"]) print(df_o) print ("timediff :",time.time()-start,"sec")
Date가 숫자로만 되어있어서 그룹을 만들기가 어려웠습니다. 그래서 특정 수치일때 그것을 카테고리화 하는 함수를 필요로 합니다. 이것을 pd.cut이 도와주는데요. 여기에서는 아래 코드 부분입니다.
cutdata = pd.cut(df["Date"], PRIOD_COND, labels=PRIOD_labels)대상은 df["Date"] 첫번째 인자의 cell이 되고, PRIOD_COND=[-1,3,6,12,24,48,96,999] 는 구간을 나눌때 경계가 되는 값입니다. 즉 -1~3, 4~6, 7~12 .... 이런식이 됩니다. 이때 label을 붙여줄 수 있는데 그 코드가 이런 코드 입니다. labels=PRIOD_labels
다시 설명하지면
PRIOD_COND=[-1,3,6,12,24,48,96,999]-1~3=>"3", 4~6=>"4" 가 되는 것입니다.
PRIOD_labels = ["3", "6", "12", "24", "48","96","999"]
그 결과로 Cutdata cell에 값을 넣습니다.
df["Cutdata"]=cutdata위코드에서 unstack 설명을 하려고 다음 코드를 준비하였습니다.
print(df.groupby([df["Name"],df["Product"],df["Cutdata"]]).sum()["Sellcount"])
print(df.groupby([df["Name"],df["Product"],df["Cutdata"]]).sum()["Sellcount"].unstack(["Product","Cutdata"]))
unstack전 코드는 아래와 같고
print(df.groupby([df["Name"],df["Product"],df["Cutdata"]]).sum()["Sellcount"])
결과를 살펴보면 아래와 같은 형태가 됩니다.
Name Product Cutdata
don PA 3 380
6 286
12 694
24 1276
48 2833
96 5439
999 318
PB 3 483
6 376
12 727
24 1254
48 2889
96 5296
999 363
Name PD_3 PD_6 PD_12 PD_24 PD_48 PD_96 PD_999
...생략...
그래서 unstack(["Product","Cutdata"] 과 같은 코드가 필요하게 되고 결과는 아래와 같이 됩니다.
Product PA PB
Cutdata 3 6 12 24 48 96 999 3
Name
don 380 286 694 1276 2833 5439 318 483
...생략...
시간은 0.7초 정도 걸렸습니다. 결과는 아래 참고하기 바랍니다.
결과
Name Product Date Sellcount Cutdata 0 pop PD 68 3 96 1 kim PA 1 2 3 2 don PA 34 3 48 3 sally PE 38 3 48 4 micle PA 17 4 24 5 pop PB 48 0 48 6 park PB 35 0 48 7 sally PC 69 4 96 8 sally PD 8 0 12 9 sally PE 71 1 96 10 park PC 96 3 96 11 micle PD 11 2 12 12 pop PF 0 3 3 13 don PB 49 4 96 14 don PF 97 2 999 15 park PF 21 0 24 16 kim PD 37 0 48 17 pop PC 72 2 96 18 don PF 80 1 96 19 pop PA 92 4 96 20 don PF 88 2 96 21 don PC 38 4 48 22 pop PF 68 4 96 23 sally PA 96 1 96 24 park PA 0 2 3 25 micle PD 50 4 96 26 park PE 54 2 96 27 micle PE 5 3 6 28 don PF 89 1 96 29 kim PF 32 3 48 ... ... ... ... ... ... 199970 sally PA 53 0 96 199971 park PA 27 1 48 199972 micle PF 72 3 96 199973 park PB 88 2 96 199974 micle PC 25 3 48 199975 pop PC 53 3 96 199976 pop PE 36 3 48 199977 micle PC 68 0 96 199978 don PF 20 0 24 199979 sally PA 21 2 24 199980 micle PB 80 3 96 199981 sally PF 37 0 48 199982 sally PB 87 0 96 199983 pop PC 36 0 48 199984 don PA 24 4 24 199985 kim PD 84 2 96 199986 kim PA 51 2 96 199987 don PA 6 4 6 199988 kim PB 70 4 96 199989 pop PE 68 1 96 199990 micle PE 96 1 96 199991 pop PE 7 4 12 199992 sally PD 11 4 12 199993 kim PD 15 1 24 199994 micle PD 12 0 12 199995 micle PD 67 1 96 199996 park PC 57 2 96 199997 don PF 62 3 96 199998 pop PF 54 0 96 199999 pop PF 40 1 48 [200000 rows x 5 columns] Name Product Cutdata don PA 3 380 6 286 12 694 24 1276 48 2833 96 5439 999 318 PB 3 483 6 376 12 727 24 1254 48 2889 96 5296 999 363 PC 3 465 6 321 12 609 24 1443 48 2532 96 5208 999 312 PD 3 469 6 362 12 668 24 1259 48 2725 96 5206 999 340 PE 3 394 6 324 ... sally PB 96 5361 999 374 PC 3 406 6 354 12 627 24 1236 48 2744 96 5196 999 309 PD 3 425 6 315 12 753 24 1328 48 2706 96 5441 999 357 PE 3 518 6 289 12 658 24 1291 48 2846 96 5301 999 354 PF 3 448 6 303 12 712 24 1253 48 2855 96 5284 999 338 Name: Sellcount, Length: 252, dtype: int64 Product PA PB ... PE PF Cutdata 3 6 12 24 48 96 999 3 6 12 ... 48 96 999 3 6 12 24 48 96 999 Name ... don 380 286 694 1276 2833 5439 318 483 376 727 ... 2687 5278 344 469 334 666 1446 2834 5488 333 kim 468 348 743 1261 2711 5417 310 464 323 626 ... 2579 5238 349 473 358 677 1320 2653 5247 334 micle 459 305 631 1270 2713 5460 333 447 363 758 ... 2863 5283 343 475 344 662 1246 2677 5170 349 park 517 333 680 1227 2751 5290 318 415 361 676 ... 2719 5398 319 448 321 695 1327 2561 5186 307 pop 419 297 729 1210 2784 5314 360 433 315 614 ... 2650 5174 379 452 379 699 1347 2594 5417 319 sally 455 315 682 1420 2627 5203 257 402 327 691 ... 2846 5301 354 448 303 712 1253 2855 5284 338 [6 rows x 42 columns] Product PA PB ... PE PF Cutdata 3 6 12 24 48 96 999 3 6 12 ... 48 96 999 3 6 12 24 48 96 999 Name ... don 380 286 694 1276 2833 5439 318 483 376 727 ... 2687 5278 344 469 334 666 1446 2834 5488 333 kim 468 348 743 1261 2711 5417 310 464 323 626 ... 2579 5238 349 473 358 677 1320 2653 5247 334 micle 459 305 631 1270 2713 5460 333 447 363 758 ... 2863 5283 343 475 344 662 1246 2677 5170 349 park 517 333 680 1227 2751 5290 318 415 361 676 ... 2719 5398 319 448 321 695 1327 2561 5186 307 pop 419 297 729 1210 2784 5314 360 433 315 614 ... 2650 5174 379 452 379 699 1347 2594 5417 319 sally 455 315 682 1420 2627 5203 257 402 327 691 ... 2846 5301 354 448 303 712 1253 2855 5284 338 [6 rows x 42 columns] timediff : 0.7497549057006836 sec
5. 결과의 사용
groupby를 사용하면 그결과로 컬럼이 Multiindex형태로 나올 수 있습니다. select하기도 쉽지 않고요. 이럴때는 index를 이용하여 join(Dataframe에서는 merge입니다.)하면 됩니다.참고 코드
import pandas as pd import time start = time.time() df = pd.read_csv("train_long.csv", delimiter=',', header=0) #print("## All") #print(df) PRIOD_COND=[-1,3,6,12,24,48,96,999] PRIOD_labels = ["3", "6", "12", "24", "48","96","999"] table_dict = {} cutdata = pd.cut(df["Date"], PRIOD_COND, labels=PRIOD_labels) df["Cutdata"]=cutdata #print(df) #print(df.groupby([df["Name"],df["Product"],df["Cutdata"]]).sum()["Sellcount"]) #print(df.groupby([df["Name"],df["Product"],df["Cutdata"]]).sum()["Sellcount"].unstack(["Product","Cutdata"])) df_o = df.groupby([df["Name"],df["Product"],df["Cutdata"]]).sum()["Sellcount"].unstack(["Product","Cutdata"]) print(df_o) df_o=pd.merge(df,df_o,how="left",on='Name') print(df_o) print ("timediff :",time.time()-start,"sec")
결과
Product PA PB ... PE PF Cutdata 3 6 12 24 48 96 999 3 6 12 ... 48 96 999 3 6 12 24 48 96 999 Name ... don 380 286 694 1276 2833 5439 318 483 376 727 ... 2687 5278 344 469 334 666 1446 2834 5488 333 kim 468 348 743 1261 2711 5417 310 464 323 626 ... 2579 5238 349 473 358 677 1320 2653 5247 334 micle 459 305 631 1270 2713 5460 333 447 363 758 ... 2863 5283 343 475 344 662 1246 2677 5170 349 park 517 333 680 1227 2751 5290 318 415 361 676 ... 2719 5398 319 448 321 695 1327 2561 5186 307 pop 419 297 729 1210 2784 5314 360 433 315 614 ... 2650 5174 379 452 379 699 1347 2594 5417 319 sally 455 315 682 1420 2627 5203 257 402 327 691 ... 2846 5301 354 448 303 712 1253 2855 5284 338 [6 rows x 42 columns] e:\ProgramData\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py:544: UserWarning: merging between different levels can give an unintended result (1 levels on the left, 2 on the right) warnings.warn(msg, UserWarning) Name Product Date Sellcount Cutdata ... (PF, 12) (PF, 24) (PF, 48) (PF, 96) (PF, 999) 0 pop PD 68 3 96 ... 699 1347 2594 5417 319 1 kim PA 1 2 3 ... 677 1320 2653 5247 334 2 don PA 34 3 48 ... 666 1446 2834 5488 333 3 sally PE 38 3 48 ... 712 1253 2855 5284 338 4 micle PA 17 4 24 ... 662 1246 2677 5170 349 5 pop PB 48 0 48 ... 699 1347 2594 5417 319 6 park PB 35 0 48 ... 695 1327 2561 5186 307 7 sally PC 69 4 96 ... 712 1253 2855 5284 338 8 sally PD 8 0 12 ... 712 1253 2855 5284 338 9 sally PE 71 1 96 ... 712 1253 2855 5284 338 10 park PC 96 3 96 ... 695 1327 2561 5186 307 11 micle PD 11 2 12 ... 662 1246 2677 5170 349 12 pop PF 0 3 3 ... 699 1347 2594 5417 319 13 don PB 49 4 96 ... 666 1446 2834 5488 333 14 don PF 97 2 999 ... 666 1446 2834 5488 333 15 park PF 21 0 24 ... 695 1327 2561 5186 307 16 kim PD 37 0 48 ... 677 1320 2653 5247 334 17 pop PC 72 2 96 ... 699 1347 2594 5417 319 18 don PF 80 1 96 ... 666 1446 2834 5488 333 19 pop PA 92 4 96 ... 699 1347 2594 5417 319 20 don PF 88 2 96 ... 666 1446 2834 5488 333 21 don PC 38 4 48 ... 666 1446 2834 5488 333 22 pop PF 68 4 96 ... 699 1347 2594 5417 319 23 sally PA 96 1 96 ... 712 1253 2855 5284 338 24 park PA 0 2 3 ... 695 1327 2561 5186 307 25 micle PD 50 4 96 ... 662 1246 2677 5170 349 26 park PE 54 2 96 ... 695 1327 2561 5186 307 27 micle PE 5 3 6 ... 662 1246 2677 5170 349 28 don PF 89 1 96 ... 666 1446 2834 5488 333 29 kim PF 32 3 48 ... 677 1320 2653 5247 334 ... ... ... ... ... ... ... ... ... ... ... ... 199970 sally PA 53 0 96 ... 712 1253 2855 5284 338 199971 park PA 27 1 48 ... 695 1327 2561 5186 307 199972 micle PF 72 3 96 ... 662 1246 2677 5170 349 199973 park PB 88 2 96 ... 695 1327 2561 5186 307 199974 micle PC 25 3 48 ... 662 1246 2677 5170 349 199975 pop PC 53 3 96 ... 699 1347 2594 5417 319 199976 pop PE 36 3 48 ... 699 1347 2594 5417 319 199977 micle PC 68 0 96 ... 662 1246 2677 5170 349 199978 don PF 20 0 24 ... 666 1446 2834 5488 333 199979 sally PA 21 2 24 ... 712 1253 2855 5284 338 199980 micle PB 80 3 96 ... 662 1246 2677 5170 349 199981 sally PF 37 0 48 ... 712 1253 2855 5284 338 199982 sally PB 87 0 96 ... 712 1253 2855 5284 338 199983 pop PC 36 0 48 ... 699 1347 2594 5417 319 199984 don PA 24 4 24 ... 666 1446 2834 5488 333 199985 kim PD 84 2 96 ... 677 1320 2653 5247 334 199986 kim PA 51 2 96 ... 677 1320 2653 5247 334 199987 don PA 6 4 6 ... 666 1446 2834 5488 333 199988 kim PB 70 4 96 ... 677 1320 2653 5247 334 199989 pop PE 68 1 96 ... 699 1347 2594 5417 319 199990 micle PE 96 1 96 ... 662 1246 2677 5170 349 199991 pop PE 7 4 12 ... 699 1347 2594 5417 319 199992 sally PD 11 4 12 ... 712 1253 2855 5284 338 199993 kim PD 15 1 24 ... 677 1320 2653 5247 334 199994 micle PD 12 0 12 ... 662 1246 2677 5170 349 199995 micle PD 67 1 96 ... 662 1246 2677 5170 349 199996 park PC 57 2 96 ... 695 1327 2561 5186 307 199997 don PF 62 3 96 ... 666 1446 2834 5488 333 199998 pop PF 54 0 96 ... 699 1347 2594 5417 319 199999 pop PF 40 1 48 ... 699 1347 2594 5417 319 [200000 rows x 47 columns] timediff : 1.005450963973999 sec
결과의 column name을 확인해보면 (PF, 12) (PF, 24) (PF, 48) (PF, 96) (PF, 999) 이런 형태로 이름이 되어있는것을 확인이 됩니다.
6. 큰 데이터의 사용
groupby를 하는 경우 메모리 부족으로 오류가 나는 경우도 있습니다.이럴때는 일부를 나누어서 합니다. 위 예제에서는 Key가 Name이 되므로 사람 이름별로 한사람씩 하던지 1/2 사람씩 나누어서 select를 한뒤 groupby를 한뒤 결과는 join을 하면 됩니다.
query를 이용한 select하는 예제입니다.
import pandas as pd import time start = time.time() df = pd.read_csv("train_long.csv", delimiter=',', header=0) cutdata = pd.cut(df["Date"], PRIOD_COND, labels=PRIOD_labels) df["Cutdata"]=cutdata df_s = df.query("(Name < 'h')") print(df_s) df_s = df.query("(Name >= 'h') and (Name < 'p')") print(df_s) df_s = df.query("(Name >= 'p')") print(df_s) print ("timediff :",time.time()-start,"sec")
각각의 결과들의 합계는 20만건으로 로딩한 전체 dataframe 크기와 같습니다.
결과
Name Product Date Sellcount Cutdata 2 don PA 34 3 48 13 don PB 49 4 96 14 don PF 97 2 999 18 don PF 80 1 96 20 don PF 88 2 96 21 don PC 38 4 48 28 don PF 89 1 96 31 don PC 15 4 24 34 don PE 52 3 96 36 don PD 85 4 96 37 don PE 95 0 96 39 don PB 91 0 96 40 don PE 21 3 24 45 don PA 24 1 24 46 don PE 55 1 96 50 don PE 37 3 48 53 don PA 2 1 3 70 don PB 40 1 48 81 don PD 91 1 96 90 don PC 64 1 96 96 don PD 5 1 6 103 don PC 39 3 48 113 don PC 64 2 96 120 don PD 33 0 48 125 don PE 21 4 24 126 don PE 11 4 12 127 don PF 14 0 24 128 don PF 73 2 96 137 don PC 50 0 96 152 don PB 37 2 48 ... ... ... ... ... ... 199815 don PA 74 3 96 199816 don PC 84 2 96 199818 don PC 40 3 48 199820 don PF 72 3 96 199821 don PB 90 3 96 199832 don PF 53 4 96 199853 don PF 56 1 96 199877 don PA 9 4 12 199878 don PA 30 3 48 199879 don PD 55 2 96 199880 don PA 1 4 3 199883 don PD 34 2 48 199893 don PF 87 1 96 199916 don PF 15 0 24 199919 don PE 63 3 96 199927 don PE 23 4 24 199932 don PD 28 4 48 199936 don PF 85 1 96 199939 don PA 68 3 96 199943 don PD 88 2 96 199948 don PB 29 2 48 199955 don PE 66 4 96 199956 don PD 26 4 48 199959 don PE 66 0 96 199961 don PA 35 4 48 199968 don PB 87 2 96 199978 don PF 20 0 24 199984 don PA 24 4 24 199987 don PA 6 4 6 199997 don PF 62 3 96 [33597 rows x 5 columns] Name Product Date Sellcount Cutdata 1 kim PA 1 2 3 4 micle PA 17 4 24 11 micle PD 11 2 12 16 kim PD 37 0 48 25 micle PD 50 4 96 27 micle PE 5 3 6 29 kim PF 32 3 48 30 micle PD 99 4 999 32 kim PD 75 1 96 38 micle PB 32 0 48 42 micle PF 97 1 999 43 kim PC 40 1 48 51 micle PE 27 3 48 57 kim PD 8 3 12 61 micle PD 26 3 48 65 kim PD 53 3 96 69 micle PF 58 1 96 74 kim PC 53 2 96 77 micle PF 92 3 96 80 micle PC 74 0 96 83 kim PA 64 1 96 85 micle PF 92 0 96 86 kim PB 29 4 48 87 micle PE 36 0 48 89 micle PB 29 2 48 93 micle PC 80 1 96 94 kim PF 12 4 12 95 kim PA 60 3 96 102 micle PB 46 4 48 104 kim PF 13 1 24 ... ... ... ... ... ... 199915 micle PE 86 1 96 199917 micle PF 3 2 3 199921 micle PF 70 0 96 199922 micle PE 26 2 48 199924 micle PD 64 4 96 199928 kim PC 87 3 96 199933 kim PF 78 0 96 199934 micle PE 97 2 999 199937 micle PE 12 2 12 199938 micle PC 6 2 6 199941 kim PE 88 4 96 199947 micle PC 17 0 24 199949 micle PF 31 2 48 199951 kim PC 37 3 48 199958 micle PB 63 3 96 199962 kim PD 6 3 6 199963 kim PB 48 2 48 199967 micle PA 61 0 96 199969 micle PD 0 0 3 199972 micle PF 72 3 96 199974 micle PC 25 3 48 199977 micle PC 68 0 96 199980 micle PB 80 3 96 199985 kim PD 84 2 96 199986 kim PA 51 2 96 199988 kim PB 70 4 96 199990 micle PE 96 1 96 199993 kim PD 15 1 24 199994 micle PD 12 0 12 199995 micle PD 67 1 96 [66849 rows x 5 columns] Name Product Date Sellcount Cutdata 0 pop PD 68 3 96 3 sally PE 38 3 48 5 pop PB 48 0 48 6 park PB 35 0 48 7 sally PC 69 4 96 8 sally PD 8 0 12 9 sally PE 71 1 96 10 park PC 96 3 96 12 pop PF 0 3 3 15 park PF 21 0 24 17 pop PC 72 2 96 19 pop PA 92 4 96 22 pop PF 68 4 96 23 sally PA 96 1 96 24 park PA 0 2 3 26 park PE 54 2 96 33 pop PB 98 2 999 35 pop PF 84 4 96 41 sally PB 17 4 24 44 park PD 52 3 96 47 pop PE 38 3 48 48 park PB 19 0 24 49 pop PF 43 0 48 52 pop PB 46 4 48 54 sally PB 86 0 96 55 park PF 64 2 96 56 pop PC 18 0 24 58 park PE 66 4 96 59 sally PA 5 0 6 60 sally PC 32 2 48 ... ... ... ... ... ... 199935 pop PA 29 4 48 199940 park PC 75 1 96 199942 park PC 19 1 24 199944 pop PA 60 4 96 199945 pop PF 14 2 24 199946 sally PB 72 0 96 199950 park PA 50 3 96 199952 park PE 47 1 48 199953 park PD 44 3 48 199954 park PC 57 1 96 199957 park PA 0 3 3 199960 pop PF 54 1 96 199964 pop PF 52 1 96 199965 sally PB 58 3 96 199966 sally PE 86 2 96 199970 sally PA 53 0 96 199971 park PA 27 1 48 199973 park PB 88 2 96 199975 pop PC 53 3 96 199976 pop PE 36 3 48 199979 sally PA 21 2 24 199981 sally PF 37 0 48 199982 sally PB 87 0 96 199983 pop PC 36 0 48 199989 pop PE 68 1 96 199991 pop PE 7 4 12 199992 sally PD 11 4 12 199996 park PC 57 2 96 199998 pop PF 54 0 96 199999 pop PF 40 1 48 [99554 rows x 5 columns] timediff : 0.6376185417175293 sec
댓글 없음:
댓글 쓰기