CNN을 적용한 한국어 상품평 감성분석 : 형태소 임베딩을 중심으로 :형태소 임베딩을 중심으로

박현정; 송민채; 신경식

추천

검색

질문

자료유형: 학술저널

저자정보: 박현정 (이화여자대학교) 송민채 (이화여자대학교) 신경식 (이화여자대학교)

저널정보: 한국지능정보시스템학회 지능정보연구 지능정보연구 제24권 제2호

발행연도: 2018.6

수록면: 59 - 83 (25page)

이용수

📌

연구주제

📖

연구배경

🔬

연구방법

🏆

연구결과

초록· 키워드

오류제보하기

고객과 대중의 니즈를 파악하기 위한 감성분석의 중요성이 커지면서 최근 영어 텍스트를 대상으로 다양한 딥러닝 모델들이 소개되고 있다. 본 연구는 영어와 한국어의 언어적인 차이에 주목하여 딥러닝 모델을 한국어 상품평 텍스트의 감성분석에 적용할 때 부딪히게 되는 기본적인 이슈들에 대하여 실증적으로 살펴본다. 즉, 딥러닝 모델의 입력으로 사용되는 단어 벡터(word vector)를 형태소 수준에서 도출하고, 여러 형태소 벡터(morpheme vector) 도출 대안에 따라 감성분석의 정확도가 어떻게 달라지는지를 비정태적(non-static) CNN(Convolutional Neural Network) 모델을 사용하여 검증한다. 형태소 벡터 도출 대안은 CBOW(Continuous Bag-Of-Words)를 기본적으로 적용하고, 입력 데이터의 종류, 문장 분리와 맞춤법 및 띄어쓰기 교정, 품사 선택, 품사 태그 부착, 고려형태소의 최소 빈도수 등과 같은 기준에 따라 달라진다.
형태소 벡터 도출 시, 문법 준수도가 낮더라도 감성분석 대상과 같은 도메인의 텍스트를 사용하고, 문장 분리 외에 맞춤법 및 띄어쓰기 전처리를 하며, 분석불능 범주를 포함한 모든 품사를 고려할 때 감성분석의 분류정확도가 향상되는 결과를 얻었다. 동음이의어 비율이 높은 한국어 특성 때문에 고려한 품사 태그 부착 방안과 포함할 형태소에 대한 최소 빈도수 기준은 뚜렷한 영향이 없는 것으로 나타났다.

With the increasing importance of sentiment analysis to grasp the needs of customers and the public, various types of deep learning models have been actively applied to English texts. In the sentiment analysis of English texts by deep learning, natural language sentences included in training and test datasets are usually converted into sequences of word vectors before being entered into the deep learning models. In this case, word vectors generally refer to vector representations of words obtained through splitting a sentence by space characters. There are several ways to derive word vectors, one of which is Word2Vec used for producing the 300 dimensional Google word vectors from about 100 billion words of Google News data. They have been widely used in the studies of sentiment analysis of reviews from various fields such as restaurants, movies, laptops, cameras, etc.
Unlike English, morpheme plays an essential role in sentiment analysis and sentence structure analysis in Korean, which is a typical agglutinative language with developed postpositions and endings. A morpheme can be defined as the smallest meaningful unit of a language, and a word consists of one or more morphemes. For example, for a word "예쁘고", the morphemes are "예쁘(= adjective)" and "고(=connective ending)". Reflecting the significance of Korean morphemes, it seems reasonable to adopt the morphemes as a basic unit in Korean sentiment analysis. Therefore, in this study, we use "morpheme vector" as an input to a deep learning model rather than "word vector" which is mainly used in English text. The morpheme vector refers to a vector representation for the morpheme and can be derived by applying an existent word vector derivation mechanism to the sentences divided into constituent morphemes.
By the way, here come some questions as follows. What is the desirable range of POS(Part-Of-Speech) tags when deriving morpheme vectors for improving the classification accuracy of a deep learning model? Is it proper to apply a typical word vector model which primarily relies on the form of words to Korean with a high homonym ratio? Will the text preprocessing such as correcting spelling or spacing errors affect the classification accuracy, especially when drawing morpheme vectors from Korean product reviews with a lot of grammatical mistakes and variations?
We seek to find empirical answers to these fundamental issues, which may be encountered first when applying various deep learning models to Korean texts. As a starting point, we summarized these issues as three central research questions as follows. First, which is better effective, to use morpheme vectors from grammatically correct texts of other domain than the analysis target, or to use morpheme vectors from considerably ungrammatical texts of the same domain, as the initial input of a deep learning model? Second, what is an appropriate morpheme vector derivation method for Korean regarding the range of POS tags, homonym, text preprocessing, minimum frequency? Third, can we get a satisfactory level of classification accuracy when applying deep learning to Korean sentiment analysis?
As an approach to these research questions, we generate various types of morpheme vectors reflecting the research questions and then compare the classification accuracy through a non-static CNN(Convolutional Neural Network) model taking in the morpheme vectors. As for training and test datasets, Naver Shopping"s 17,260 cosmetics product reviews are used. To derive morpheme vectors, we use data from the same domain as the target one and data from other domain; Naver shopping"s about 2 million cosmetics product reviews and 520,000 Naver News data arguably corresponding to Google’s News data.
The six primary sets of morpheme vectors constructed in this study differ in terms of the following three criteria. First, they come from two types of data source; Naver news of high grammatical correctness and Naver shopping’s cosmetics product reviews of low grammatical correctness. Second, they are distinguished in the degree of data preprocessing, namely, only splitting sentences or up to additional spelling and spacing corrections after sentence separation. Third, they vary concerning the form of input fed into a word vector model; whether the morphemes themselves are entered into a word vector model or with their POS tags attached. The morpheme vectors further vary depending on the consideration range of POS tags, the minimum frequency of morphemes included, and the random initialization range. All morpheme vectors are derived through CBOW(Continuous Bag-Of-Words) model with the context window 5 and the vector dimension 300.
It seems that utilizing the same domain text even with a lower degree of grammatical correctness, performing spelling and spacing corrections as well as sentence splitting, and incorporating morphemes of any POS tags including incomprehensible category lead to the better classification accuracy. The POS tag attachment, which is devised for the high proportion of homonyms in Korean, and the minimum frequency standard for the morpheme to be included seem not to have any definite influence on the classification accuracy.

#감성분석 #형태소 벡터 #단어 벡터 #딥러닝 #CNN #CBOW #Sentiment Analysis #Morpheme Vector #Word Vector #Deep Learning #CNN #CBOW

참고문헌 (41)

참고문헌 신청

An, J.-y., J.-w. Bae, N.-g. Han, and M. Song, “A Study of ‘Emotion Trigger’ by Text Mining Techniques,” Journal of Intelligence and Information Systems, Vol.21, No.2(2015), 69~92. Chen, P., Z. Sun, L. Bing, and W. Yang, “Recurrent Attention Network on Memory for Aspect Sentiment Analysis,” Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, (2017). Cho, K., van M. Bart, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation,” arXiv preprint arXiv:1406.1078, (2014). Collobert, R., J. Weston, L. Bottou, M. Karlen, K. Kavukcuglu, and P. Kuksa, “Natural Language Processing (Almost) from Scratch,” Journal of Machine Learning Research, Vol.12, (2011), 2493~2537. Cui, M.-n., Y.-s. Jin, and O.-b. Kwon, “A Method of Analyzing Sentiment Polarity of Multilingual Social Media: A Case of Korean-Chinese Languages,” Journal of Intelligence and Information Systems, Vol.22, No.3(2016), 91~111.

함께 읽어보면 좋을 논문

논문 유사도에 따라 DBpia 가 추천하는 논문입니다. 함께 보면 좋을 연관 논문을 확인해보세요!

이 논문의 저자 정보

박현정

소속기관 고려대학교

주요연구분야 공학 > 산업공학 공학 > 컴퓨터학

논문수 10 이용수 3,795

송민채

소속기관 농협중앙회

주요연구분야 공학 > 산업공학 TOP 10% 공학 > 공학 일반

논문수 7 이용수 2,689

신경식

소속기관 이화여자대학교

주요연구분야 공학 > 산업공학 TOP 5% 공학 > 컴퓨터학

논문수 56 이용수 13,553

이 논문과 함께 이용한 논문

동아시아공동체 담론의 제도화 : ASEAN의 인식과 전략

변창구 정치정보연구 2010 .12

딥러닝 기법을 활용한 소셜미디어 텍스트의 문자 기반 다범주 감성분석 방안

김동성 , 김기태 , 김종우 한국경영과학회 학술대회논문집 2017 .04

한국어 텍스트 분석과 적용 : 머신러닝을 통한 증권발행신고서의 비정형화된 텍스트 분석

김용석 , 조성욱 한국증권학회지 2019 .04

Korean Movie-review Sentiment Analysis Using Parallel Stacked Bidirectional LSTM Model

오영택 , 김민태 , 김우주 Journal of KIISE 2019 .01

Categorization of Korean News Articles Based on Convolutional Neural Network Using Doc2Vec and Word2Vec

김도우 , 구명완 Journal of KIISE 2017 .07

최근 본 자료

전체보기

UCI(KEPA) : I410-ECN-0101-2018-003-003162078

구분	그룹	데이터 항목
AI 학습용 데이터	원문	원문 PDF 파일
AI 학습용 데이터	원문 + 메타 (기본/상세)	원문 PDF 파일 및 서지정보 CSV
대량 구매용 데이터	B2B 구독 방식	특정 자료 한정으로 원문 접근 권한 부여
대량 구매용 데이터	URL 전달 방식	바로 PDF 뷰어를 열람할 수 있는 URL 제공

구분	그룹	데이터 항목
AI 학습용 데이터	기본 메타	발행기관명, 간행물명, 권호명, 권(vol), 호(issue), 통권, 발행연도, 발행월, 논문명, 저자명, 시작페이지, 종료페이지, 전체페이지, 상세페이지URL
상세 메타 데이터	발행기관 메타	발행기관 이명, 영문명, 창립연도, 홈페이지URL, 발행기관 소개
	간행물 메타	부제목, 간행물 유형, ISSN, ISBN, 최초발행연도, 폐간연도, 간행빈도, 발행주기, 등재사항, 이용수, 피인용수, 권호수, 논문수, 표지이미지
	논문 메타	작성 언어, 부제목, 대등제목, 목차, 키워드, 초록, 이미지, 참고문헌, 이용수, 피인용수, 논문활용도, DBpia통합주제분류, KDC분류, DDC분류, 한국연구재단분류, UCI, DOI
	저자 메타	소속기관, 소속부서, 직급, 연구분야, 연구키워드, 이용수, 피인용수, 저자 논문활용도

구분	그룹	데이터 항목
※ 결합형/맞춤형 메타 데이터는 신청 내용에 따라 다양하게 제공 가능
이용순위 정보	주제분야별 많이 이용된 논문	“인문학”에서 많이 이용된 논문 TOP100
	이용기관별 많이 이용된 논문	“중고등학교”에서 많이 이용된 논문 TOP100
	세부기관별 많이 이용된 논문	“서울대학교”에서 많이 이용된 논문 TOP100
	키워드별 많이 이용된 논문	“Chat GPT”에서 많이 이용된 논문 TOP100
키워드 정보	많이 이용된 키워드	특정기간/분야/저널 내 많이 이용된 키워드
	많이 발행된 키워드	특정기간/분야/저널 내 많이 발행된 키워드
	많이 검색된 키워드	특정기간/분야/저널 내 많이 검색된 키워드
	연구 트렌드 키워드	특정 키워드 연관 연구동향 분석 데이터 키워드

논문 기본 정보

초록· 키워드

AI 요약

연구주제

연구배경

연구방법

연구결과

주요내용

목차

참고문헌 (41)

함께 읽어보면 좋을 논문

이 논문의 저자 정보

이 논문과 함께 이용한 논문

최근 본 자료

댓글(0)