Study/NLP

Text classification & Sentence representation

yudam 2022. 1. 18. 00:06

Text classification & Sentence representation

๐Ÿ’ก
text classification์„ ํ•˜๋ฉด์„œ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ sentencs representation์„ ํ•˜๊ฒŒ ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ค‘์š”. ์ด ์•„ํ‹ฐํด์ด ์–ด๋А ๋ถ„์•ผ์ธ์ง€ ์ž๋™์œผ๋กœ assign, ์ŠคํŒธ๋ฉ”์ผ ๋ถ„๋ฅ˜, ์˜ํ™”๋ฆฌ๋ทฐ๊ฐ€ positive์ธ์ง€ negative์ธ์ง€ ๋“ฑ๋“ฑ.. ํ•  ์ˆ˜ ์žˆ์Œ

 

text

  • input : sentence/paragraph
  • output : categories A
  • ์˜ˆ์‹œ) ๊ฐ์„ฑ๋ถ„์„, ์นดํ…Œ๊ณ ๋ฆฌ ๋ถ„๋ฅ˜, ์˜๋„ ๋ถ„๋ฅ˜ ๋“ฑ๋“ฑ

 

๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ทธ๋ ‡๋‹ค๋ฉด ๋ฌธ์žฅ์„ ์–ด๋–ป๊ฒŒ ์ปดํ“จํ„ฐ ์–ธ์–ด๋กœ ํ‘œํ˜„ํ•˜๋‚˜?

  • ๋ฌธ์žฅ์€ ์ผ๋ จ์˜ tokens์ด๋‹ค. ํ…์ŠคํŠธ ํ† ํฐ์€ arbitrary ํ•œ ์„ฑ๊ฒฉ์„ ๋ค๋‹ค.
  • ํ† ํฐ์„ ๋‚˜๋ˆ„๋Š” ๊ธฐ์ค€์€ ๋‹ค์–‘ํ•˜๋‹ค. ๊ณต๋ฐฑ(white space), ํ˜•ํƒœ์†Œ(morphs), ์–ด์ ˆ, ๋น„ํŠธ ์ˆซ์ž ๋“ฑ๋“ฑ.
  1. vocabulary๋ฅผ ๋งŒ๋“ ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ทธ๊ฒƒ์„ ์ค‘๋ณต๋˜์ง€ ์•Š๋Š” index๋กœ ๋ฐ”๊พผ๋‹ค.
  1. integer sequence๋กœ ์ธ์ฝ”๋”ฉ
  1. one hot encoding, one of k encoding ๋“ฑ ๋ฐฉ๋ฒ•์ด ์žˆ์Œ. integer๋ฅผ binary vector๋กœ ๋ณ€ํ™˜. ๋ชจ๋“  element๋Š” 0. token์„ index๋กœ correspondingํ•˜๋Š” ๊ฒƒ๋งŒ 1๋กœ ์„ธํŒ….

 

์šฐ๋ฆฌ๊ฐ€ ์‹ค์ œ๋กœ ์›ํ•˜๋Š” ๊ฑด token๊ฐ„ ๊ด€๊ณ„์„ฑ. ์šฐ๋ฆฌ๋Š” ๋น„์Šทํ•œ ์˜๋ฏธ์˜ ๋‹จ์–ด๋Š” ๊ฐ™์ด ์žˆ๊ณ , ์•„๋‹ˆ๋ฉด ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ ์žˆ๋Š” ๊ด€๊ณ„๋ฅผ ๋งŒ๋“ค๊ณ  ์‹ถ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์–ด๋–ป๊ฒŒ ๊ด€๊ณ„๋ฅผ ๋งŒ๋“ค์–ด์ค„๊นŒ? one hot encoding์œผ๋กœ ๊ธธ์ด๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์œผ๋‚˜, ๋ชจ๋“  ํ† ํฐ ๊ฐ„ ๊ฑฐ๋ฆฌ๊ฐ€ ๊ฐ™๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์–ด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์—†์Œ.

 

Embedding: ํ† ํฐ์„ ์—ฐ์† ๋ฒกํ„ฐ ๊ณต๊ฐ„(continuous vector space)์— ํˆฌ์˜ํ•˜๋ฉด ๊ด€๊ณ„๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค!

  • ๋‹จ์–ด๋Š” discrete space๋‹ค ๋ณด๋‹ˆ๊นŒ matrix๋ฅผ ๋ชจ๋ฅด๋Š” ๊ฒƒ. ์ด๋“ค์„ continuousํ•˜๊ฒŒ ๋งŒ๋“ค ์ˆœ ์—†์„๊นŒ?
  • weigh matrix์™€ one hot vector๋ฅผ ๊ณฑํ•œ๋‹ค. ์ฆ‰ table lookup ํ•˜๋Š” ๊ฒƒ. ์ด๊ฒƒ๋„ DAG ์•ˆ์˜ node๋กœ ๊ตฌํ˜„๋จ.
  • ์ฆ‰, ๋ฌธ์žฅ์ด sequence of continuous, high-dimensional vector ์ด ๋˜๋Š” ๊ฒƒ
  • ๋ฒกํ„ฐ์˜ ์‚ฌ์ด์ฆˆ๋Š” ์นดํ…Œ๊ณ ๋ฆฌ ๊ฐœ์ˆ˜์™€ ๋™์ผ. softmax function ์ ์šฉํ•˜๋ฉด distribution ๊ณ„์‚ฐ ๊ฐ€๋Šฅ

 

ํ† ํฐ์— ๋ฒกํ„ฐ๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๊ฒŒ table lookup์ด๋ผ๋ฉด, ๋ฌธ์žฅ์— ๋ฒกํ„ฐ๋ฅผ ๋ถ€์—ฌํ•˜๋ ค๋ฉด? —CBoW, RN, CNN

  • continuous bag-of-words (CBoW)
    • ์•„์˜ˆ ๊ฐ€๋ฐฉ์œผ๋กœ ๋ด์„œ ๋‹จ์–ด๋ฅผ ๋ฌถ์–ด n-gram์œผ๋กœ
    • ๋‹จ์–ด์˜ ์ˆœ์„œ๋Š” ๋ฌด์‹œ
    • ์ƒ๊ฐ๋ณด๋‹ค ํšจ๊ณผ๊ฐ€ ์ข‹๊ธฐ ๋•Œ๋ฌธ์— baseline์œผ๋กœ ์‹œ๋„ํ•ด๋ณด์ž
    • facebook์— fasttext๊ฐ€ ์žˆ๋‹ค. ์ฐธ๊ณ ํ•˜์ž
    • ์ˆœ์„œ
      1. ๋ฌธ์žฅ ํ† ํฐ์ด t๊ฐœ ์ฃผ์–ด์ ธ์žˆ์Œ
      1. table lookup layer ํ†ตํ•ด์„œ vectors๋กœ ๋ฐ”๋€œ (sequence of token→ sequence of vector)
      1. average node๋กœ averaging (sentence representation)
        • ์ด๋•Œ ๋ณดํŽธ์ ์ธ representation์ด ์žˆ๋‹ค๊ธฐ๋ณด๋‹ค๋Š” ๋‚ด๊ฐ€ ํ’€๊ณ ์žํ•˜๋Š” ๋ฌธ์ œ์— ์ ํ•ฉํ•œ representation ์ด ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค (training ํ•œ model์— ๋”ฐ๋ผ)
      1. ์ฆ‰ ๊ณต๊ฐ„์ƒ์—์„œ ๊ฐ€๊นŒ์šฐ๋ฉด ๋น„์Šทํ•œ ์˜๋ฏธ, ์•„๋‹ˆ๋ฉด ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ ์žˆ์„ ๊ฒƒ . classificationํ•˜๊ธฐ์— ์ ํ•ฉํ•ด์ง
      1. softmax classifier ์ ์šฉ, probability distribution ๋‚˜์˜ด → Negative log probability ๊ณ„์‚ฐํ•˜๊ณ  backpropagation ์“ฐ๊ณ  Stochastic gradient descent → early stopping → training ๋

 

 

  • Relation Network (Skip-bigram)
    • n gram์ด๋ž‘ ๋‹ค๋ฅธ ์  : ํ† ํฐ n๊ฐœ๋ฅผ ๋ณผ ๊ฑด๋ฐ ์ค‘๊ฐ„์„ ๋„์›Œ์„œ ๋ณธ๋‹ค.
    • ๋ฌธ์žฅ์•ˆ์— ์žˆ๋Š” ๋ชจ๋“  ํ† ํฐ ์Œ(pairs)์„ ๋ณด๊ณ , ๊ฐ ์Œ์— ๋Œ€ํ•ด์„œ ์‹ ๊ฒฝ๋ง์„ ๋งŒ๋“ค์–ด์„œ ๋ฌธ์žฅํ‘œํ˜„์„ ์ฐพ๋Š”๋‹ค.
    • ๋ชจ๋“  ๋‹ค๋ฅธ ํ† ํฐ์˜ ๊ด€๊ณ„๋ฅผ ๋ด…๋‹ˆ๋‹ค. ๋ชจ๋“  ๋‹จ์–ด๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋ด์„œ ํšจ์œจ์ ์ด์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค.
    • ht=f(xt,x1)+โ‹ฏ+f(xt,xt−1)+f(xt,xt+1)+โ‹ฏ+f(xt,xT)
    • ์žฅ์ : ์—ฌ๋Ÿฌ ๋‹จ์–ด๋กœ ๋œ ํ‘œํ˜„์„ ํƒ์ง€
    • ๋‹จ์ : ๋ชจ๋“  ๋‹จ์–ด๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋ณด๊ธฐ ๋•Œ๋ฌธ์—, ์ „ํ˜€ ์—ฐ๊ด€์ด ์—†๋Š” ๋‹จ์–ด๋„ ๋ณด๊ฒŒ ๋จ. (computational efficiency ์•ˆ์ข‹์•„์ง)
  • Convolution Neural Network (CNN)
    • k-gram์„ hierachicallyํ•˜๊ฒŒ ๋ณด๊ฒŒ ๋จ
    • layer์˜ ๋ฒ”์ฃผ
    • 1์ฐจ์› cnn
    • ์ž‘์€ ๋ฒ”์œ„์˜ ํ† ํฐ์˜ ๊ด€๊ณ„๋ฅผ ๋ด…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋” ๋จผ ๊ฑฐ๋ฆฌ์˜ ๋‹จ์–ด๊ฐ„์˜ ๊ด€๊ณ„๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ ํƒ์ง€ํ•  ์ˆ˜ ์—†๊ฑฐ๋‚˜ ๋” ๋งŽ์€ convolution ์ธต์„ ์Œ“์•„์•ผํ•ฉ๋‹ˆ๋‹ค. ht=f(xt,xtk)+โ‹ฏ+f(xt,xt)+โ‹ฏ+f(xt,xt+k)
    • ๋‹จ์  : ๋‹จ์–ด ์‚ฌ์ด ๊ฑฐ๋ฆฌ๊ฐ€ ๊ธธ ๊ฒฝ์šฐ layer๋ฅผ ๋งŽ์ด ์Œ“์•„์•ผ ํ•จ.

 

๊ทธ๋Ÿฌ๋‹ˆ๊นŒ, RN์€ ๋„ˆ๋ฌด ๊ธด ๋ฒ”์œ„(๋ชจ๋“  ๋ฒ”์œ„)๋ฅผ ๋ณด๊ณ  CNN์€ ๋„ˆ๋ฌด ์ž‘์€ ๋ฒ”์œ„๋ฅผ ๋ณด๋Š” ๊ฒƒ์ด๋‹ค. ์ด ๋‘˜์„ ์ž˜ ์ ˆ์ถฉํ•  ์ˆ˜ ์—†์„๊นŒ? - self attention, Recurrent Neural Network(RNN)

 

  • self attention
    • t๋ฒˆ์งธ ํ† ํฐ์˜ representation์„ ๋ฝ‘์„ ๋•Œ, ๋˜ ๋‹ค๋ฅธ function A(์•ŒํŒŒ)๊ฐ€ ๊ด€๊ณ„์„ฑ์„ ํŒŒ์•…
    • ์žฅ์ :
      • Long range & short range dependency ๊ทน๋ณตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
      • ๊ด€๊ณ„๊ฐ€ ๋‚ฎ์€ ํ† ํฐ์€ ์–ต์ œํ•˜๊ณ  ๊ด€๊ณ„๊ฐ€ ๋†’์€ ํ† ํฐ์€ ๊ฐ•์กฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ๋‹จ์ 
      • ๊ณ„์‚ฐ ๋ณต์žก๋„๊ฐ€ ๋†’๊ณ  counting ๊ฐ™์€ ํŠน์ • ์—ฐ์‚ฐ์ด ์‰ฝ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  • Recurrent Neural Network(RNN)
    • ht(๋ฉ”๋ชจ๋ฆฌ)๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์–ด์„œ ์ฝ๋Š” ์ •๋ณด ์ €์žฅ ๊ฐ€๋Šฅ
    • ๋ฌธ์žฅ์˜ ์ •๋ณด๋ฅผ ์‹œ๊ฐ„ ์ˆœ์–ด์„ธ ๋”ฐ๋ผ ์••์ถ•ํ•  ์ˆ˜ ์žˆ์Œ
    • ๋‹จ์ :
      • ๋ฌธ์žฅ์ด ๋งŽ์ด ๊ธธ์–ด์งˆ ์ˆ˜๋ก ๊ณ ์ •๋œ ๋ฉ”๋ชจ๋ฆฌ์— ์••์ถ•๋œ ์ •๋ณด๋ฅผ ๋‹ด์•„์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์•ž์—์„œ ํ•™์Šตํ•œ ์ •๋ณด๋ฅผ ์žŠ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ณง ์ •๋ณด์˜ ์†์‹ค์„ ๋œปํ•ฉ๋‹ˆ๋‹ค.
      • ํ† ํฐ์„ ์ˆœ์ฐจ์ ์œผ๋กœ ํ•˜๋‚˜์”ฉ ์ฝ์–ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ํ›ˆ๋ จ ํ• ๋•Œ ์†๋„๊ฐ€ ๊ธฐํƒ€ ๋„คํŠธ์›Œํฌ ๋ณด๋‹ค ๋А๋ฆฝ๋‹ˆ๋‹ค.
    • Long Term Dependency ํ•ด๊ฒฐ๋ฐฉ๋ฒ•:
      • bidirectional network๋ฅผ ์“ฐ๊ฒŒ๋ฉ๋‹ˆ๋‹ค.
      • LSTM, GRU ๋“ฑ RNN์˜ ๋ณ€ํ˜•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

 

์ด ๋‹ค์„ฏ๊ฐ€์ง€ ๋ฐฉ๋ฒ•๋ก ์€ ๋…ธ๋“œ๋กœ์จ ํ•จ๊ป˜ ์“ฐ์ผ ์ˆ˜ ์žˆ๋‹ค. ๋ชจ๋‘ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

๐Ÿ’ก
๋ณธ ํŽ˜์ด์ง€๋Š” ์กฐ๊ฒฝํ˜„ ๊ต์ˆ˜๋‹˜ ‘๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•œ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ’ ๊ฐ•์˜๋ฅผ ์ •๋ฆฌํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.

'Study > NLP' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

NLP historical overview by weekly NLP  (0) 2022.01.18