그새 많은 것들이 바뀐것 같습니다. 대부분의 글들은 GPT와 같은 LLM들을 많이 사용하게 되었고, 손으로 직접 적는 블로그들이 많이 사라진것 같습니다. 그동안 포스팅이 뜸했던 것도 비슷한 사유가 되겠습니다.

앞으로 글을 쓰면서 LLM을 이용한 부분에 대해서는 표기를 하도록 하겠습니다.

이번글은 주어지는 text들을 비지도 학습으로 분류하는 방법 입니다. 여러가지 방법이 있겠지만 GPT를 이용해서 이것저것 시도해 보았고 그중에 추천하는 방법에 대해서 샘플을 실행해서 동작하는 것을 보고 가져왔습니다. GPT에 묻더라도 장황하고, 몇단계로 질문은 해야 정리가 되고 결과물도 실제 테스트된 코드도 아니기 때문에 이글에서는 해당 내용들을 정리했다고 보면 됩니다.

다음 순서로 진행 됩니다.

문장 분리 → Sentence-BERT 문장 임베딩 → 문단 임베딩 (문장 평균) → UMAP → HDBSCAN

→ 클러스터 ID

입력 포맷입니다.

JSONL 예시 데이터


{"input":"오늘 날씨가 정말 좋다. 하늘이 맑아서 기분이 상쾌하다. 오랜만에 산책을 나갔다."}
{"input":"비가 하루 종일 내렸다. 우산을 챙기지 않아서 옷이 다 젖었다. 날씨가 우울하다."}
{"input":"기온이 갑자기 떨어졌다. 바람도 많이 불어서 체감 온도가 낮다."}

{"input":"점심으로 김치찌개를 먹었다. 국물이 진하고 고기가 부드러웠다. 다음에는 또 오고 싶다."}
{"input":"저녁에 친구들과 치킨을 시켜 먹었다. 양이 많고 바삭해서 만족스러웠다."}
{"input":"이탈리안 레스토랑에 갔다. 파스타와 피자가 정말 맛있었다. 분위기도 좋았다."}

{"input":"파이썬으로 머신러닝을 공부하고 있다. 데이터 전처리가 가장 어렵다. 그래도 재미있다."}
{"input":"딥러닝 모델의 성능을 개선했다. 하이퍼파라미터 튜닝이 효과적이었다."}
{"input":"자연어 처리를 위해 Sentence-BERT를 사용하고 있다. 임베딩 성능이 매우 좋다."}

{"input":"주식 시장이 크게 하락했다. 투자 심리가 위축되고 있다. 당분간 변동성이 클 것 같다."}
{"input":"비트코인 가격이 다시 상승했다. 가상자산 시장에 관심이 몰리고 있다."}
{"input":"환율 변동이 심해지고 있다. 수입 기업들의 부담이 커지고 있다."}

{"input":"The weather is really nice today. The sky is clear and blue. I decided to take a walk."}
{"input":"It rained all day. I forgot my umbrella and got completely soaked."}
{"input":"The temperature dropped suddenly. It feels much colder than yesterday."}

{"input":"I had pasta for lunch. The sauce was rich and delicious. I would love to visit again."}
{"input":"We ordered fried chicken for dinner. It was crispy and well seasoned."}
{"input":"The restaurant had a great atmosphere. The food was amazing and fresh."}

{"input":"I am studying machine learning with Python. Data preprocessing is harder than expected."}
{"input":"The deep learning model achieved better accuracy. Hyperparameter tuning really helped."}
{"input":"Sentence embeddings are useful for text clustering. SBERT works very well."}

{"input":"The stock market dropped significantly today. Investors are getting nervous."}
{"input":"Bitcoin prices are rising again. The crypto market is becoming volatile."}
{"input":"Exchange rates are fluctuating rapidly. Global markets are reacting."}

{"input":"오늘은 회의가 많았다. 프로젝트 일정에 대해 논의했다. 생각보다 시간이 오래 걸렸다."}
{"input":"팀원들과 협업이 잘 되고 있다. 커뮤니케이션이 프로젝트 성공의 핵심이다."}

{"input":"I had a meeting all morning. We discussed the project timeline in detail."}
{"input":"Team collaboration is going well. Clear communication makes everything easier."}

{"input":"파이썬으로 데이터 분석을 진행했다. 판다스와 넘파이를 활용했다. 결과가 만족스럽다."}
{"input":"I analyzed the dataset using pandas and numpy. The results were quite interesting."}

아래는 코드 입니다.

    


# python -m pip install kss spacy langdetect sentence-transformers umap-learn hdbscan
# python -m spacy download en_core_web_sm

import json
import numpy as np
from langdetect import detect, DetectorFactory
from kss import split_sentences as split_ko
import spacy
from sentence_transformers import SentenceTransformer
import umap
import hdbscan

# 2	자잘한 클러스터 폭증
# 5	적당 (추천 시작점)
# 10	큰 주제만 남음
min_cluster_size = 5
min_samples = min_cluster_size // 2

# -----------------------------------
# 0. 초기 설정
# -----------------------------------

DetectorFactory.seed = 42  # langdetect 재현성
nlp_en = spacy.load("en_core_web_sm")

model = SentenceTransformer(
    "paraphrase-multilingual-mpnet-base-v2"
)

# -----------------------------------
# 1. 언어별 문장 분리 함수
# -----------------------------------

def split_english(text: str):
    doc = nlp_en(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]

def split_korean(text: str):
    try:
        # mecab이 있으면 사용
        return split_ko(text, backend="mecab", strip=True)
    except ImportError:
        # 없으면 기본 backend로 fallback
        return split_ko(text, strip=True)

def split_sentences_auto(text: str):
    """
    언어 감지 후 자동 분기
    """
    try:
        lang = detect(text)
    except Exception:
        return [text]

    if lang == "ko":
        sents = split_korean(text)
    elif lang == "en":
        sents = split_english(text)
    else:
        # 기타 언어 or 짧은 텍스트
        sents = [text]

    return sents if sents else [text]

# -----------------------------------
# 2. JSONL 로드 + 문장 분리
# -----------------------------------

documents = []    # 문장 리스트
raw_texts = []    # 원문

with open("input.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()

        # ✅ 빈 줄 무시
        if not line:
            continue
            
        obj = json.loads(line)
        text = obj["input"].strip()

        raw_texts.append(text)
        sentences = split_sentences_auto(text)
        documents.append(sentences)

# -----------------------------------
# 3. Sentence-BERT 문장 임베딩
# -----------------------------------

doc_embeddings = []

for sentences in documents:
    sent_embeddings = model.encode(
        sentences,
        convert_to_numpy=True,
        normalize_embeddings=True
    )

    # ✅ 문단 임베딩 = 문장 평균
    doc_embedding = sent_embeddings.mean(axis=0)
    doc_embeddings.append(doc_embedding)

doc_embeddings = np.vstack(doc_embeddings)

# -----------------------------------
# 4. UMAP 차원 축소
# -----------------------------------

umap_embeddings = umap.UMAP(
    n_neighbors=15,
    n_components=5,
    metric="cosine",
    random_state=42
).fit_transform(doc_embeddings)

# -----------------------------------
# 5. HDBSCAN 클러스터링
# -----------------------------------

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=min_cluster_size,
    min_samples=min_samples,
    metric="euclidean"
)

labels = clusterer.fit_predict(umap_embeddings)

# -----------------------------------
# 6. 결과 출력
# -----------------------------------

for i, label in enumerate(labels):
    print(f"[클러스터 {label}] {raw_texts[i]}")


from collections import defaultdict

# 클러스터별로 묶기
clustered = defaultdict(list)

for text, label in zip(raw_texts, labels):
    clustered[label].append(text)

# 결과 출력
for label in sorted(clustered.keys()):
    if label == -1:
        print("\n[Noise / Outlier]")
    else:
        print(f"\n[클러스터 {label}] ({len(clustered[label])}개)")

    for t in clustered[label]:
        print(f"- {t}")
        
# -----------------------------------
# 7. 결과 JSONL 저장
# -----------------------------------

with open("result.jsonl", "w", encoding="utf-8") as f:
    for i, label in enumerate(labels):
        out = {
            "input": raw_texts[i],
            "cluster_id": int(label)
        }
        f.write(json.dumps(out, ensure_ascii=False) + "\n")

실행결과

    
[클러스터 0] 오늘 날씨가 정말 좋다. 하늘이 맑아서 기분이 상쾌하다. 오랜만에 산책을 나갔다.
[클러스터 1] 비가 하루 종일 내렸다. 우산을 챙기지 않아서 옷이 다 젖었다. 날씨가 우울하다.
[클러스터 1] 기온이 갑자기 떨어졌다. 바람도 많이 불어서 체감 온도가 낮다.
[클러스터 0] 점심으로 김치찌개를 먹었다. 국물이 진하고 고기가 부드러웠다. 다음에는 또 오고 싶다.
[클러스터 0] 저녁에 친구들과 치킨을 시켜 먹었다. 양이 많고 바삭해서 만족스러웠다.
[클러스터 0] 이탈리안 레스토랑에 갔다. 파스타와 피자가 정말 맛있었다. 분위기도 좋았다.
[클러스터 1] 파이썬으로 머신러닝을 공부하고 있다. 데이터 전처리가 가장 어렵다. 그래도 재미있다.
[클러스터 1] 딥러닝 모델의 성능을 개선했다. 하이퍼파라미터 튜닝이 효과적이었다.
[클러스터 1] 자연어 처리를 위해 Sentence-BERT를 사용하고 있다. 임베딩 성능이 매우 좋다.
[클러스터 1] 주식 시장이 크게 하락했다. 투자 심리가 위축되고 있다. 당분간 변동성이 클 것 같다.
[클러스터 1] 비트코인 가격이 다시 상승했다. 가상자산 시장에 관심이 몰리고 있다.
[클러스터 1] 환율 변동이 심해지고 있다. 수입 기업들의 부담이 커지고 있다.
[클러스터 0] The weather is really nice today. The sky is clear and blue. I decided to take a walk.
[클러스터 1] It rained all day. I forgot my umbrella and got completely soaked.
[클러스터 1] The temperature dropped suddenly. It feels much colder than yesterday.
[클러스터 0] I had pasta for lunch. The sauce was rich and delicious. I would love to visit again.
[클러스터 0] We ordered fried chicken for dinner. It was crispy and well seasoned.
[클러스터 0] The restaurant had a great atmosphere. The food was amazing and fresh.
[클러스터 1] I am studying machine learning with Python. Data preprocessing is harder than expected.
[클러스터 1] The deep learning model achieved better accuracy. Hyperparameter tuning really helped.
[클러스터 1] Sentence embeddings are useful for text clustering. SBERT works very well.
[클러스터 1] The stock market dropped significantly today. Investors are getting nervous.
[클러스터 1] Bitcoin prices are rising again. The crypto market is becoming volatile.
[클러스터 1] Exchange rates are fluctuating rapidly. Global markets are reacting.
[클러스터 1] 오늘은 회의가 많았다. 프로젝트 일정에 대해 논의했다. 생각보다 시간이 오래 걸렸다.
[클러스터 1] 팀원들과 협업이 잘 되고 있다. 커뮤니케이션이 프로젝트 성공의 핵심이다.
[클러스터 -1] I had a meeting all morning. We discussed the project timeline in detail.
[클러스터 1] Team collaboration is going well. Clear communication makes everything easier.
[클러스터 1] 파이썬으로 데이터 분석을 진행했다. 판다스와 넘파이를 활용했다. 결과가 만족스럽다.
[클러스터 1] I analyzed the dataset using pandas and numpy. The results were quite interesting.

[Noise / Outlier]
- I had a meeting all morning. We discussed the project timeline in detail.

[클러스터 0] (8개)
- 오늘 날씨가 정말 좋다. 하늘이 맑아서 기분이 상쾌하다. 오랜만에 산책을 나갔다.
- 점심으로 김치찌개를 먹었다. 국물이 진하고 고기가 부드러웠다. 다음에는 또 오고 싶다.
- 저녁에 친구들과 치킨을 시켜 먹었다. 양이 많고 바삭해서 만족스러웠다.
- 이탈리안 레스토랑에 갔다. 파스타와 피자가 정말 맛있었다. 분위기도 좋았다.
- The weather is really nice today. The sky is clear and blue. I decided to take a walk.
- I had pasta for lunch. The sauce was rich and delicious. I would love to visit again.
- We ordered fried chicken for dinner. It was crispy and well seasoned.
- The restaurant had a great atmosphere. The food was amazing and fresh.

[클러스터 1] (21개)
- 비가 하루 종일 내렸다. 우산을 챙기지 않아서 옷이 다 젖었다. 날씨가 우울하다.
- 기온이 갑자기 떨어졌다. 바람도 많이 불어서 체감 온도가 낮다.
- 파이썬으로 머신러닝을 공부하고 있다. 데이터 전처리가 가장 어렵다. 그래도 재미있다.
- 딥러닝 모델의 성능을 개선했다. 하이퍼파라미터 튜닝이 효과적이었다.
- 자연어 처리를 위해 Sentence-BERT를 사용하고 있다. 임베딩 성능이 매우 좋다.
- 주식 시장이 크게 하락했다. 투자 심리가 위축되고 있다. 당분간 변동성이 클 것 같다.
- 비트코인 가격이 다시 상승했다. 가상자산 시장에 관심이 몰리고 있다.
- 환율 변동이 심해지고 있다. 수입 기업들의 부담이 커지고 있다.
- It rained all day. I forgot my umbrella and got completely soaked.
- The temperature dropped suddenly. It feels much colder than yesterday.
- I am studying machine learning with Python. Data preprocessing is harder than expected.
- The deep learning model achieved better accuracy. Hyperparameter tuning really helped.
- Sentence embeddings are useful for text clustering. SBERT works very well.
- The stock market dropped significantly today. Investors are getting nervous.
- Bitcoin prices are rising again. The crypto market is becoming volatile.
- Exchange rates are fluctuating rapidly. Global markets are reacting.
- 오늘은 회의가 많았다. 프로젝트 일정에 대해 논의했다. 생각보다 시간이 오래 걸렸다.
- 팀원들과 협업이 잘 되고 있다. 커뮤니케이션이 프로젝트 성공의 핵심이다.
- Team collaboration is going well. Clear communication makes everything easier.
- 파이썬으로 데이터 분석을 진행했다. 판다스와 넘파이를 활용했다. 결과가 만족스럽다.
- I analyzed the dataset using pandas and numpy. The results were quite interesting.

Random Oversampling¶

- Y 기준 소수 클래스 데이터를 복사
- 과적합 발생 가능성

import seaborn as sns
#sns.get_dataset_names()
titanic=sns.load_dataset('titanic')

titanic.head()

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB

titanic.sex.value_counts()

male      577
female    314
Name: sex, dtype: int64

titanic.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

# X 는 범주형도 가능
# Y 는 범주형도 가능
XX=['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'deck']
YY='sex' 

ALL=XX.copy()
ALL.append(YY)

titanic_new=titanic[ALL]

titanic_new[XX]

	survived	pclass	age	sibsp	parch	fare	embarked	class	who	deck
0	0	3	22.0	1	0	7.2500	S	Third	man	NaN
1	1	1	38.0	1	0	71.2833	C	First	woman	C
2	1	3	26.0	0	0	7.9250	S	Third	woman	NaN
3	1	1	35.0	1	0	53.1000	S	First	woman	C
4	0	3	35.0	0	0	8.0500	S	Third	man	NaN
...	...	...	...	...	...	...	...	...	...	...
886	0	2	27.0	0	0	13.0000	S	Second	man	NaN
887	1	1	19.0	0	0	30.0000	S	First	woman	B
888	0	3	NaN	1	2	23.4500	S	Third	woman	NaN
889	1	1	26.0	0	0	30.0000	C	First	man	C
890	0	3	32.0	0	0	7.7500	Q	Third	man	NaN

891 rows × 10 columns

titanic_new[YY].value_counts()

male      577
female    314
Name: sex, dtype: int64

len(titanic_new) 

from imblearn.over_sampling import RandomOverSampler
x,y=RandomOverSampler().fit_resample(titanic_new[XX],titanic_new[[YY]])

y.value_counts() # y의 비율이 일정함

sex   
female    577
male      577
dtype: int64

	survived	pclass	age	sibsp	parch	fare	embarked	class	who	deck
0	0	3	22.00	1	0	7.2500	S	Third	man	NaN
1	1	1	38.00	1	0	71.2833	C	First	woman	C
2	1	3	26.00	0	0	7.9250	S	Third	woman	NaN
3	1	1	35.00	1	0	53.1000	S	First	woman	C
4	0	3	35.00	0	0	8.0500	S	Third	man	NaN
...	...	...	...	...	...	...	...	...	...	...
1149	1	2	29.00	1	0	26.0000	S	Second	woman	NaN
1150	1	3	0.75	2	1	19.2583	C	Third	child	NaN
1151	1	2	21.00	0	0	10.5000	S	Second	woman	NaN
1152	1	2	5.00	1	2	27.7500	S	Second	child	NaN
1153	1	3	2.00	0	1	12.2875	S	Third	child	NaN

1154 rows × 10 columns

SW정리

2026년 1월 25일 일요일

text encoding classification

JSONL 예시 데이터

2023년 5월 7일 일요일

Random oversampling 예제

Random Oversampling¶