2019년 3월 3일 일요일

naver 금융(주식) 데이터 읽기 with python



네이버 금융 데이터의 종목 데이터 저장 하기

저장하기전에 어떤 페이지가 해당 정보를 제공하는지 알아야 합니다. 종목에 naver를 선택하면 아래와 같은 화면이 나옵니다. 제일 아래 날짜별 정보 2page에 마우스 커서를 가져가면 link주소가 나옵니다.



해당주소는 아래와 같은 형태이며
https://finance.naver.com/item/sise_day.nhn?code=035420&page=2
페이지를 열면 아래와 같은 데이터가 됩니다. code값은 종목 코드이고 page는 페이지 주소가 됩니다.


html 코드

소스보기로 html 코드를 열어보면 아래와 같습니다.

<html lang="ko">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
<title>네이버 금융</title>
<link rel="stylesheet" type="text/css" href="/css/newstock.css?20190228170426">
<link rel="stylesheet" type="text/css" href="/css/common.css?20190228170426">
<link rel="stylesheet" type="text/css" href="/css/layout.css?20190228170426">
<link rel="stylesheet" type="text/css" href="/css/main.css?20190228170426">
<link rel="stylesheet" type="text/css" href="/css/newstock2.css?20190228170426">
<link rel="stylesheet" type="text/css" href="/css/newstock3.css?20190228170426">
<link rel="stylesheet" type="text/css" href="/css/world.css?20190228170426">
</head>
<body>
    <h4 class="tlline2"><strong><span class="red03">일별</span>시세</strong></h4>   
    <table cellspacing="0" class="type2">
    <tr>
    <th>날짜</th>
    <th>종가</th>
    <th>전일비</th>
    <th>시가</th>
    <th>고가</th>
    <th>저가</th>
    <th>거래량</th>
    </tr>
    <tr>
    <td colspan="7" height="8"></td>
    </tr>
     <tr onMouseOver="mouseOver(this)" onMouseOut="mouseOut(this)">
     <td align="center"><span class="tah p10 gray03">2019.02.14</span></td>
     <td class="num"><span class="tah p11">127,500</span></td>
     <td class="num">
    <img src="https://ssl.pstatic.net/imgstock/images/images4/ico_up.gif" width="7" height="6" style="margin-right:4px;" alt="상승"><span class="tah p11 red02">
    2,500
    </span>
   </td>
     <td class="num"><span class="tah p11">125,000</span></td>
     <td class="num"><span class="tah p11">130,500</span></td>
     <td class="num"><span class="tah p11">125,000</span></td>
     <td class="num"><span class="tah p11">700,427</span></td>
     </tr>
     <tr onMouseOver="mouseOver(this)" onMouseOut="mouseOut(this)">
     <td align="center"><span class="tah p10 gray03">2019.02.13</span></td>
     <td class="num"><span class="tah p11">125,000</span></td>
     <td class="num">
    <img src="https://ssl.pstatic.net/imgstock/images/images4/ico_up.gif" width="7" height="6" style="margin-right:4px;" alt="상승"><span class="tah p11 red02">
    1,500
    </span>
   </td>
     <td class="num"><span class="tah p11">123,500</span></td>
     <td class="num"><span class="tah p11">126,500</span></td>
     <td class="num"><span class="tah p11">123,500</span></td>
     <td class="num"><span class="tah p11">373,571</span></td>
     </tr>
     <tr onMouseOver="mouseOver(this)" onMouseOut="mouseOut(this)">
     <td align="center"><span class="tah p10 gray03">2019.02.12</span></td>
     <td class="num"><span class="tah p11">123,500</span></td>
     <td class="num">
    <img src="https://ssl.pstatic.net/imgstock/images/images4/ico_down.gif" width="7" height="6" style="margin-right:4px;" alt="하락"><span class="tah p11 nv01">
    2,000
    </span>
   </td>
     <td class="num"><span class="tah p11">123,500</span></td>
     <td class="num"><span class="tah p11">124,000</span></td>
     <td class="num"><span class="tah p11">121,500</span></td>
     <td class="num"><span class="tah p11">1,160,217</span></td>
     </tr>
     <tr onMouseOver="mouseOver(this)" onMouseOut="mouseOut(this)">
     <td align="center"><span class="tah p10 gray03">2019.02.11</span></td>
     <td class="num"><span class="tah p11">125,500</span></td>
     <td class="num">
    <img src="https://ssl.pstatic.net/imgstock/images/images4/ico_down.gif" width="7" height="6" style="margin-right:4px;" alt="하락"><span class="tah p11 nv01">
    500
    </span>
   </td>
     <td class="num"><span class="tah p11">127,000</span></td>
     <td class="num"><span class="tah p11">128,500</span></td>
     <td class="num"><span class="tah p11">125,000</span></td>
     <td class="num"><span class="tah p11">801,465</span></td>
     </tr>
     <tr onMouseOver="mouseOver(this)" onMouseOut="mouseOut(this)">
     <td align="center"><span class="tah p10 gray03">2019.02.08</span></td>
     <td class="num"><span class="tah p11">126,000</span></td>
     <td class="num">
    <img src="https://ssl.pstatic.net/imgstock/images/images4/ico_down.gif" width="7" height="6" style="margin-right:4px;" alt="하락"><span class="tah p11 nv01">
    3,000
    </span>
   </td>
     <td class="num"><span class="tah p11">126,500</span></td>
     <td class="num"><span class="tah p11">129,000</span></td>
     <td class="num"><span class="tah p11">124,500</span></td>
     <td class="num"><span class="tah p11">704,393</span></td>
     </tr>
    <tr>
    <td colspan="7" height="8"></td>
    </tr>
    <tr>
    <td colspan="7" height="1" bgcolor="#e1e1e1"></td>
    </tr>
    <tr>
    <td colspan="7" height="8"></td>
    </tr>
     <tr onMouseOver="mouseOver(this)" onMouseOut="mouseOut(this)">
     <td align="center"><span class="tah p10 gray03">2019.02.07</span></td>
     <td class="num"><span class="tah p11">129,000</span></td>
     <td class="num">
    <img src="https://ssl.pstatic.net/imgstock/images/images4/ico_down.gif" width="7" height="6" style="margin-right:4px;" alt="하락"><span class="tah p11 nv01">
    4,500
    </span>
   </td>
     <td class="num"><span class="tah p11">132,000</span></td>
     <td class="num"><span class="tah p11">134,000</span></td>
     <td class="num"><span class="tah p11">128,500</span></td>
     <td class="num"><span class="tah p11">737,938</span></td>
     </tr>
     <tr onMouseOver="mouseOver(this)" onMouseOut="mouseOut(this)">
     <td align="center"><span class="tah p10 gray03">2019.02.01</span></td>
     <td class="num"><span class="tah p11">133,500</span></td>
     <td class="num">
    <img src="https://ssl.pstatic.net/imgstock/images/images4/ico_down.gif" width="7" height="6" style="margin-right:4px;" alt="하락"><span class="tah p11 nv01">
    2,500
    </span>
   </td>
     <td class="num"><span class="tah p11">138,000</span></td>
     <td class="num"><span class="tah p11">140,000</span></td>
     <td class="num"><span class="tah p11">133,000</span></td>
     <td class="num"><span class="tah p11">530,284</span></td>
     </tr>
     <tr onMouseOver="mouseOver(this)" onMouseOut="mouseOut(this)">
     <td align="center"><span class="tah p10 gray03">2019.01.31</span></td>
     <td class="num"><span class="tah p11">136,000</span></td>
     <td class="num">
    <span class="tah p11">0</span>
   </td>
     <td class="num"><span class="tah p11">138,000</span></td>
     <td class="num"><span class="tah p11">143,500</span></td>
     <td class="num"><span class="tah p11">136,000</span></td>
     <td class="num"><span class="tah p11">1,054,276</span></td>
     </tr>
     <tr onMouseOver="mouseOver(this)" onMouseOut="mouseOut(this)">
     <td align="center"><span class="tah p10 gray03">2019.01.30</span></td>
     <td class="num"><span class="tah p11">136,000</span></td>
     <td class="num">
    <span class="tah p11">0</span>
   </td>
     <td class="num"><span class="tah p11">139,000</span></td>
     <td class="num"><span class="tah p11">139,500</span></td>
     <td class="num"><span class="tah p11">133,000</span></td>
     <td class="num"><span class="tah p11">462,280</span></td>
     </tr>
     <tr onMouseOver="mouseOver(this)" onMouseOut="mouseOut(this)">
     <td align="center"><span class="tah p10 gray03">2019.01.29</span></td>
     <td class="num"><span class="tah p11">136,000</span></td>
     <td class="num">
    <img src="https://ssl.pstatic.net/imgstock/images/images4/ico_up.gif" width="7" height="6" style="margin-right:4px;" alt="상승"><span class="tah p11 red02">
    4,000
    </span>
   </td>
     <td class="num"><span class="tah p11">130,000</span></td>
     <td class="num"><span class="tah p11">136,500</span></td>
     <td class="num"><span class="tah p11">130,000</span></td>
     <td class="num"><span class="tah p11">411,369</span></td>
     </tr>
    <tr>
    <td colspan="7" height="8"></td>
    </tr>
    </table>
    <!--- 페이지 네비게이션 시작--->
    <table summary="페이지 네비게이션 리스트" class="Nnavi" align="center">
    <caption>페이지 네비게이션</caption>
    <tr>  
   
    <td class="pgLL">
    <a href="/item/sise_day.nhn?code=035420&amp;page=1"  >
    <img src="https://ssl.pstatic.net/static/n/cmn/bu_pgarLL.gif" width="7" height="5" alt="">맨앞
    </a>
    </td>
                
                <td>
    <a href="/item/sise_day.nhn?code=035420&amp;page=1"  >1</a>
    </td>
<td class="on">
    <a href="/item/sise_day.nhn?code=035420&amp;page=2"  >2</a>
    </td>
<td>
    <a href="/item/sise_day.nhn?code=035420&amp;page=3"  >3</a>
    </td>
<td>
    <a href="/item/sise_day.nhn?code=035420&amp;page=4"  >4</a>
    </td>
<td>
    <a href="/item/sise_day.nhn?code=035420&amp;page=5"  >5</a>
    </td>
<td>
    <a href="/item/sise_day.nhn?code=035420&amp;page=6"  >6</a>
    </td>
<td>
    <a href="/item/sise_day.nhn?code=035420&amp;page=7"  >7</a>
    </td>
<td>
    <a href="/item/sise_day.nhn?code=035420&amp;page=8"  >8</a>
    </td>
<td>
    <a href="/item/sise_day.nhn?code=035420&amp;page=9"  >9</a>
    </td>
<td>
    <a href="/item/sise_day.nhn?code=035420&amp;page=10"  >10</a>
    </td>

                <td class="pgR">
    <a href="/item/sise_day.nhn?code=035420&amp;page=11"  >
    다음<img src="https://ssl.pstatic.net/static/n/cmn/bu_pgarR.gif" width="3" height="5" alt="" border="0">
    </a>
    </td>
                <td class="pgRR">
    <a href="/item/sise_day.nhn?code=035420&amp;page=404"  >맨뒤
    <img src="https://ssl.pstatic.net/static/n/cmn/bu_pgarRR.gif" width="8" height="5" alt="" border="0">
    </a>
    </td>
    </tr>
    </table>
    <!--- 페이지 네비게이션 끝--->
</body>

이중에 data를 읽어야 하는 부분은 아래 html 입니다. 하루 날짜 부분이고, 아래 부분이 날짜별로 연속해서 나옵니다.
읽는 방식은 TD tag 안에 들어있는 text를 읽어서 날짜인지를 보고 날짜이면 읽어들일 데이터라고 판단합니다.
<td align="center"><span class="tah p10 gray03">2019.02.14</span></td>
<td class="num"><span class="tah p11">127,500</span></td>
<td class="num">
<img src="https://ssl.pstatic.net/imgstock/images/images4/ico_up.gif" width="7" height="6" style="margin-right:4px;" alt="상승"><span class="tah p11 red02">
2,500
</span>
</td>
<td class="num"><span class="tah p11">125,000</span></td>
<td class="num"><span class="tah p11">130,500</span></td>
<td class="num"><span class="tah p11">125,000</span></td>
<td class="num"><span class="tah p11">700,427</span></td>

구현 코드

해당 내용을 BeautifulSoup를 이용 하여 구현하였습니다.
아래는 구현한 전체 코드 입니다. PAGE_TEST 값에 의해 전체 페이지를 읽는 방식이 아닌 2 page만 읽도록 구현하였으며 db에 저장하는 부분은 빠져 있습니다.

#getstockdata
import requests as re
from bs4 import BeautifulSoup
import datetime as date
import datetime
import time
import os
import database

PAGE_TEST = True

def naver_stock_crawling(db, code, name):
    # 코스피
    NAME ='종목 %s %s' % (code, name)
    print("%s Start crawling"%(NAME))
    url = 'https://finance.naver.com/item/sise_day.nhn?code=%s&page=%d'
    key_prefix = 'K%s_' % (code)
    count = 0
    need_to_break = False
    date_list = []
    endpage = 100000
    if PAGE_TEST==True: 
        endpage = 2
    for i in range(1,endpage):
        url_ = re.get(url % (code,i))
        url_ = url_.content
        html = BeautifulSoup(url_,'html.parser')
        add_data = False

        tds = html.find_all('td')
        index = 100
        for td in tds :
            try : 
                text = td.text.strip()

                try :
                    datetime.datetime.strptime(text, '%Y.%m.%d')
                    get_date = True
                except :
                    get_date = False

                if get_date==True :
                    index = 0
                    date_ = td.text.strip().replace('.','-')
                    if not date_ in date_list:
                        date_list.append(date_)
                    else:
                        #마지막 페이지
                        print("%s %d End page Done"%(NAME, count))
                        return

                if(index == 1):
                    #종가
                    add_data = True
                    d = td.text.strip().replace(',','')
                    d = float(d)
                    if PAGE_TEST==True:
                        print("%s %f %s close"%(date_,d,key_prefix))
                    else:
                        if db.insert_with_external_column_name(datetime.datetime.strptime(date_, '%Y-%m-%d'),d,key_prefix+'close')==False:
                            need_to_break = True
                if(index == 3):
                    #시가
                    add_data = True
                    d = td.text.strip().replace(',','')
                    d = float(d)
                    if PAGE_TEST==True:
                        print("%s %f %s open"%(date_,d,key_prefix))
                    else:
                        if db.insert_with_external_column_name(datetime.datetime.strptime(date_, '%Y-%m-%d'),d,key_prefix+'open')==False:
                            need_to_break = True
                if(index == 4):
                    #고가
                    add_data = True
                    d = td.text.strip().replace(',','')
                    d = float(d)
                    if PAGE_TEST==True:
                        print("%s %f %s high"%(date_,d,key_prefix))
                    else:
                        if db.insert_with_external_column_name(datetime.datetime.strptime(date_, '%Y-%m-%d'),d,key_prefix+'high')==False:
                            need_to_break = True
                if(index == 5):
                    #저가
                    add_data = True
                    d = td.text.strip().replace(',','')
                    d = float(d)
                    if PAGE_TEST==True:
                        print("%s %f %s low"%(date_,d,key_prefix))
                    else:
                        if db.insert_with_external_column_name(datetime.datetime.strptime(date_, '%Y-%m-%d'),d,key_prefix+'low')==False:
                            need_to_break = True
                if(index == 6):
                    #거래량
                    add_data = True
                    d = td.text.strip().replace(',','')
                    d = float(d)
                    if PAGE_TEST==True:
                        print("%s %f %s vol"%(date_,d,key_prefix))
                        count = count + 1
                    else:
                        if db.insert_with_external_column_name(datetime.datetime.strptime(date_, '%Y-%m-%d'),d,key_prefix+'vol')==False:
                            need_to_break = True
                        else:
                            count = count + 1
            except:
                pass

            index += 1

        if add_data == False:
            # 마지막 페이지
            print("%s %d End page Done"%(NAME, count))
            return
        if need_to_break == True :
            print("%s %d Done"%(NAME, count))
            return

if PAGE_TEST==True :
    db = None

naver_stock_crawling(db,'015760','한국전력')

if PAGE_TEST!=True :
    db.close()

아래 부분은 html에서 모든 td들을 찾아서 tds list로 만들고 td tag를 하나씩 차례로 열어서 값을 확인 하기 위한 부분입니다.
        tds = html.find_all('td')
        index = 100
        for td in tds :

td의 text가 날짜 타입인지 확인을 합니다.
                text = td.text.strip()

                try :
                    datetime.datetime.strptime(text, '%Y.%m.%d')
                    get_date = True
                except :
                    get_date = False

날짜 타입이면 index = 0으로 설정합니다.
                if get_date==True :
                    index = 0
                    date_ = td.text.strip().replace('.','-')

그리고 td를 읽을때 마다 index를 한씩 증가 시킵니다. index=1 이면 종가가 들어오고 index=3이면 시가입니다. index=2는 전일가 대비 변화량이라 필요가 없어서 사용하지 않았습니다. index=6 거래량까지 읽고 나면 더이상 읽을 필요가 없기때문에 count를 하나 증가 시킵니다. 그리고 다음 날짜가 들어오기만을 기다립니다.
끝나는 조건은 읽은 데이터가 지속들어 오는 조건입니다. 여기에는 db에 입력 하지 않았지만, page수가 커지면 데이터의 날짜가 변화하지 않고 같은 날짜가 계속 들어옵니다. 그래서 db에 저장된 값이 이미 존재한다면 끝이라는 판단을 하게 됩니다.
해당 코드는 아래와 같은 형태를 취하고 있고 db에 값을 넣었는데 실패하면(이미 해당 날
짜에 값이 있다면 실패함) need_to_break 라는 변수로 탈출을 하게 됩니다.
if db.insert_with_external_column_name(datetime.datetime.strptime(date_, '%Y-%m-%d'),d,key_prefix+'high')==False:
    need_to_break = True

결과

위 테스트 코드는 2 page분량만 가져오도록 되어있습니다. 그래서 결과는 아래와 같습니다.
종목 015760 한국전력 Start crawling
2019-02-28 34850.000000 K015760_ close
2019-02-28 34800.000000 K015760_ open
2019-02-28 35600.000000 K015760_ high
2019-02-28 34500.000000 K015760_ low
2019-02-28 1872254.000000 K015760_ vol
2019-02-27 34950.000000 K015760_ close
2019-02-27 34750.000000 K015760_ open
2019-02-27 35150.000000 K015760_ high
2019-02-27 34550.000000 K015760_ low
2019-02-27 1037064.000000 K015760_ vol
2019-02-26 34800.000000 K015760_ close
2019-02-26 34350.000000 K015760_ open
2019-02-26 35000.000000 K015760_ high
2019-02-26 34250.000000 K015760_ low
2019-02-26 1445412.000000 K015760_ vol
2019-02-25 34000.000000 K015760_ close
2019-02-25 34300.000000 K015760_ open
2019-02-25 34350.000000 K015760_ high
2019-02-25 33800.000000 K015760_ low
2019-02-25 933156.000000 K015760_ vol
2019-02-22 34350.000000 K015760_ close
2019-02-22 33250.000000 K015760_ open
2019-02-22 34600.000000 K015760_ high
2019-02-22 33100.000000 K015760_ low
2019-02-22 2306216.000000 K015760_ vol
2019-02-21 33300.000000 K015760_ close
2019-02-21 33200.000000 K015760_ open
2019-02-21 33450.000000 K015760_ high
2019-02-21 33000.000000 K015760_ low
2019-02-21 873960.000000 K015760_ vol
2019-02-20 33400.000000 K015760_ close
2019-02-20 33100.000000 K015760_ open
2019-02-20 33650.000000 K015760_ high
2019-02-20 33100.000000 K015760_ low
2019-02-20 1049441.000000 K015760_ vol
2019-02-19 33000.000000 K015760_ close
2019-02-19 33050.000000 K015760_ open
2019-02-19 33400.000000 K015760_ high
2019-02-19 32900.000000 K015760_ low
2019-02-19 616276.000000 K015760_ vol
2019-02-18 33050.000000 K015760_ close
2019-02-18 33500.000000 K015760_ open
2019-02-18 33700.000000 K015760_ high
2019-02-18 32850.000000 K015760_ low
2019-02-18 1146918.000000 K015760_ vol
2019-02-15 33500.000000 K015760_ close
2019-02-15 33850.000000 K015760_ open
2019-02-15 34200.000000 K015760_ high
2019-02-15 33250.000000 K015760_ low
2019-02-15 940571.000000 K015760_ vol



댓글 없음:

댓글 쓰기