Study/Basics

Basic Machine Learning : Supervised Learning

yudam 2022. 1. 17. 23:57

Supervised learning - Overview

 

๐Ÿ’ก
๋จธ์‹ ๋Ÿฌ๋‹์€ data-driven algorithm design์ด๋ผ๊ณ ๋„ ํ•œ๋‹ค. ์ด ๋•Œ data์˜ specification์ด ์• ๋งคํ•  ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์–ผ๊ตด ์‚ฌ์ง„์ด๋ผ๊ณ  ํ•  ๋•Œ, ํ‘œ์ •, ๊ฐ๋„ ๋“ฑ๋“ฑ ์–ด๋–ค ๊ฒƒ์„ ‘์–ผ๊ตด’์ด๋ผ๊ณ  ์ •์˜ํ•  ๊ฒƒ์ธ์ง€์— ๋Œ€ํ•œ ๋ถ„๋ฅ˜์ฒด๊ณ„๊ฐ€ ๋ชจํ˜ธํ•˜๋‹ค. ๋Œ€์‹ , example์ด ์ฃผ์–ด์ง„๋‹ค. input-output pairs๊ฐ€ ์ฃผ์–ด์ง„ ์ƒํ™ฉ์—์„œ, ์ด๋ฅผ ์ด์šฉํ•ด ‘์–ผ๊ตด’์ด๋ผ๋Š” specification์„ ์ฐพ์•„๋‚ผ ์ˆ˜ ์žˆ๋А๋ƒ๊ฐ€ ๊ด€๊ฑด

 

์šฐ๋ฆฌ์—๊ฒŒ ์ฃผ์–ด์ง€๋Š” ๊ฒƒ๋“ค

  1. ์–ด๋–ค input-output pairs๊ฐ€ training set์œผ๋กœ ์ฃผ์–ด์ง€๊ณ ,
  1. ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ M์ด training sets๋ฅผ ๋ฐ›์•„์„œ output์„ ๋ƒˆ์„ ๋•Œ, ์‹ค์ œ training set๊ณผ์˜ ์ฐจ์ด๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋Š” function์ด ์ฃผ์–ด์ง„๋‹ค(loss function)
  1. ํ‰๊ฐ€์šฉ sets๊ฐ€ ์ฃผ์–ด์ง„๋‹ค. (validation & test)

 

๊ฒฐ์ •ํ•ด์•ผํ•˜๋Š” ๊ฒƒ๋“ค

  1. ๊ฐ€์„ค (hyperthesis) ex svm ๋“ฑ๋“ฑ... ๋ญ˜๋กœ ํ•  ๊ฑด์ง€
  1. ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜ (optimization algorithm)

 

๊ทธ๋Ÿผ supervised learning์€ ๋ญ˜ ํ•˜๋‚˜? — ์ ์ ˆํ•œ algorithm๊ณผ model์„ ์ž๋™์œผ๋กœ ์ฐพ๋Š”๋‹ค!

  • ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ด์šฉํ•ด์„œ ๊ฐ ๊ฐ€์„ค ์•ˆ์˜ ๊ฐ€์žฅ ์ข‹์€ ๋ชจ๋ธ์„ ์ฐพ๋Š”๋‹ค.
  1. training set์„ ์‚ฌ์šฉํ•ด ๊ฐ€์„ค ๋ณ„ ์ตœ์ ์˜ ๋ชจ๋ธ์„ ์ฐพ๋Š”๋‹ค
  1. model selection : validation set์„ ์‚ฌ์šฉํ•ด ๊ทธ ์ค‘ ๊ฐ€์žฅ ์ž˜๋˜๋Š” ๊ฒƒ์„ ๋ฝ‘๋Š”๋‹ค (hyperparameter optimization์ด ์ด ๋‹จ๊ณ„์— ํ•ด๋‹น๋œ๋‹ค)
  1. reporting : test set์œผ๋กœ ํ‰๊ฐ€ (์ด ๋•Œ, ๊ฐ set๋“ค์€ ์ ˆ๋Œ€ ์„ž์ด๋ฉด ์•ˆ๋˜๊ณ , ๋ฐ์ดํ„ฐ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ์ž˜ ๋‚˜๋ˆ ์•ผ ํ•œ๋‹ค.)

 

๊ทธ๋Ÿฌ๋‹ˆ๊นŒ overview ๋ญ๋ƒ๋ฉด

  1. ํฐ ๋ฐ์ดํ„ฐ ์…‹์„ training, validation, test๋กœ ๊ตฌ๋ถ„
  1. ๊ฐ fair example ์„ ๋ชจ๋ธ์— ๋„ฃ์–ด ์‹ค์ œ์™€์˜ ์ฐจ์ด๋ฅผ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๋Š” loss function
  1. hyperthesis ์ •ํ•˜๊ณ 
  1. opimization algorithm ์ •ํ•ด์„œ ๋จธ์‹ ์— ์ฃผ๋ฉด
  1. ๋ชจ๋ธ๊ณผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ž๋™์œผ๋กœ ์ฐพ์•„์ค€๋‹ค

 

์ฆ‰, ๋ญ๊ฐ€ ์ค‘์š”ํ•˜๋ƒ๋ฉด

 

๐Ÿ’ก
์ „์ฒด์š”์•ฝ - hypothesis set ๋งŒ๋“ค๊ธฐ. - ๋‹ค์‹œ๋งํ•ด DAG ๋””์ž์ธ์„ ํ•จ์œผ๋กœ์จ neural net architecture ์ •ํ•˜๋Š” ๊ฒƒ. - parameter๋ฅผ ์–ด๋–ป๊ฒŒ ํ•˜๋Š”์ง€์— ๋”ฐ๋ผ hypothesis ๊ฐ€ ๊ฒฐ์ •๋จ. - hypothesis๋ฅผ ๋งŒ๋“ค ๋•Œ ์ตœ์ข…์ ์œผ๋กœ ํ•ด์•ผํ•˜๋Š” ๊ฑด loss function ์ •ํ•˜๊ธฐ. - neural network ์ž์ฒด๊ฐ€ ํŠน์ • value๊ฐ€ ์•„๋‹ˆ๋ผ distribution์„ outputํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ํšจ์œจ์ ์ž„ - distribution์ด output๋˜๋ฉด negative log probability๋ฅผ ์‚ฌ์šฉํ•ด์„œ loss function์„ ํ™•์ •ํ•  ์ˆ˜ ์žˆ์Œ. - ์–ด๋–ป๊ฒŒ ์ข‹์€ ๋ชจ๋ธ์„ ์ฐพ๋Š”๊ฐ€์— ๋Œ€ํ•œ ๊ฒƒ์€ optimization methods. - hypothesis set์€ ๋ฐฉ๋Œ€ํ•˜๋ฏ€๋กœ, gradient based optimization ํ™œ์šฉํ•œ๋‹ค. - gradient๋Š” backpropagation algorithm์„ ์‚ฌ์šฉํ•ด ๊ณ„์‚ฐ (framework ๊ฐœ๋ฐœ์ž๋“ค์ด module๋กœ ๋งŒ๋“ค์–ด๋‘์—ˆ์œผ๋‹ˆ ๊ฐ€์ ธ๋‹ค ์‚ฌ์šฉํ•˜๋ฉด ๋œ๋‹ค.) ์ด๋ž˜๋„ full gradient ๊ณ„์‚ฐํ•˜๊ธฐ์— ๋น„ํšจ์œจ์ ์ด๋ผ๋ฉด, stochastic gradient descent algorithm์„ ์‚ฌ์šฉํ•ด approximation์„ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด ๋•Œ early stopping์„ ์ž˜ ํ•ด์•ผํ•˜๋ฉฐ learning rate ๊ณ„์‚ฐ์ด ํž˜๋“ค ๋•Œ๋Š” adaptive learning rate algorithm์„ ์‚ฌ์šฉํ•ด์•ผํ•œ๋‹ค.

 

 


 

 

How to decide Hypothesis set

 

์–ด๋–ค ์ข…๋ฅ˜์˜ ๋ฌธ์ œ๋ฅผ ์ฒ˜๋ฆฌํ•  ๊ฑด์ง€? ์˜ˆ๋ฅผ๋“ค์–ด classification์ธ์ง€, regression์ธ์ง€ ๋“ฑ๋“ฑ..

  • ์ด์— ๋”ฐ๋ผ SVM์ผ์ง€ Naive bayes classifier์ผ์ง€ ๋“ฑ๋“ฑ / SVR์ผ์ง€ linear regression์ผ์ง€ gaussian process์ผ์ง€ ๋“ฑ๋“ฑ.. ์ •ํ•œ๋‹ค.

 

hyperparameters set์€ ์–ด๋–ป๊ฒŒ ํ• ์ง€

  • ์ด๊ฑธ ์ •ํ•  ๋•Œ๋งˆ๋‹ค hypothesis set์ด ํ•˜๋‚˜์”ฉ ์ƒ๊ธฐ๋Š” ๊ฒƒ์ž„
  • ์ผ๋ฐ˜์ ์œผ๋กœ NN์˜ ๊ฒฝ์šฐ ์ •ํ•ด์•ผํ•˜๋Š” ๊ฑด 2๊ฐœ. (weights, bias vector)

 

neural network architecture์„ ์–ด๋–ป๊ฒŒ ๋””์ž์ธํ•˜๋‚˜?

  • ์ •ํ•ด์ง„ ๊ฒƒ์ด ์—†๋‹ค. ๋Œ€์‹  ์–ด๋–ค ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‚˜ ์•„ํ‚คํ…์ณ๊ฐ€ ์žˆ๋Š”์ง€ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ๋‹ค
  • DAG (์ฐธ๊ณ )

 

์˜ˆ๋ฅผ ๋“ค์–ด classification์„ ํ•˜๊ณ  ์‹ถ์€๋ฐ..

  • logistic regression
    • positive / negative class ๋ถ„๋ฅ˜
    • dot product + bias + sigmoid function
    • ์žฅ์  : reusability — tensorflow, pytorch ๋“ฑ์—์„œ node๋ฅผ ๊ตฌํ˜„ํ•ด๋‘์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์—ฐ๊ฒฐ์ ๋งŒ ๊ณ ๋ฏผํ•˜๋ฉด ๋จ
    • ์ฆ‰, ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆฌ๊ณ  hyperparameter ๊ฐ’ ์„ค์ •ํ•˜๋ฉด ๋

 

How to decide a loss function

 

loss function์„ ์–ด๋–ป๊ฒŒ define ํ• ๊นŒ?

๋ณดํ†ต ์ž˜ ์“ฐ๋Š” ๊ฒƒ๋“ค MSE .. MLL ์ด๋Ÿฐ๊ฒŒ ์žˆ๊ธด ํ•œ๋ฐ ์“ฐ๋‹ค๋ณด๋ฉด ์–ด์งธ์„œ ๊ทธ๊ฑธ ์จ์•ผํ•˜๋Š”์ง€ ๊ถ๊ธˆ์ฆ์ด ์ƒ๊ธฐ๊ธฐ ์‹œ์ž‘ — ์‚ฌ์‹ค์€ probability ๋ฅผ ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ์ด์—ˆ๋‹ค!

 

probability

  • ์‚ฌ๊ฑด์ง‘ํ•ฉ(Event Set) : ๋ชจ๋“  ๊ฐ€๋Šฅํ•œ ์‚ฌ๊ฑด์˜ ์ง‘ํ•ฉ
    • Ω={e1,e2,โ‹ฏ,eD}
    • ์ด๋ฒคํŠธ ๊ฐฏ์ˆ˜๊ฐ€ ์œ ํ•œ์ผ๋•Œ : ์ด์‚ฐ(Descrete)
    • ์ด๋ฒคํŠธ ๊ฐฏ์ˆ˜๊ฐ€ ๋ฌดํ•œ์ผ๋•Œ : ์—ฐ์†(Continuous)
  • ํ™•๋ฅ ๋ณ€์ˆ˜(Random Variable): ์‚ฌ๊ฑด์ง‘ํ•ฉ ์•ˆ์— ์†ํ•˜์ง€๋งŒ ์ •์˜๋˜์ง€ ์•Š์€ ์–ด๋–ค ๊ฐ’. event set ์ค‘ ์–ด๋–ค ๊ฒƒ์ด๋ผ๋„ ๋  ์ˆ˜ ์žˆ๋‹ค.
  • ํ™•๋ฅ (Probability): ์‚ฌ๊ฑด์ง‘ํ•ฉ์— ์†ํ•œ ํ™•๋ฅ ๋ณ€์ˆ˜์—๊ฒŒ ์–ด๋–ค ๊ฐ’์„ ์ง€์ •ํ•ด์ฃผ๋Š” ํ•จ์ˆ˜
    • p(X=ei)
    • x ๊ฐ€ i๋ฒˆ์งธ event๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์ด ‘์–ผ๋งˆ๋‚˜ ๊ฐ€๋Šฅํ•œ์ง€’ ์ธก์ •
  • ํŠน์„ฑ(Properties)
    1. Non-negatives: p(X=ei)≥0 , ํ™•๋ฅ ์€ (-)๊ฐ€ ๋  ์ˆ˜ ์—†์Œ.
    1. Unit volume: ∑e∈Ωp(X=e)=1 , ๋ชจ๋“  ํ™•๋ฅ ์˜ ํ•ฉ์€ 1.

 

probability๋กœ ์–ด๋–ป๊ฒŒ ๋น„์šฉํ•จ์ˆ˜ (loss function)๋ฅผ ์ฐพ๋‚˜?

  • Input(x) ๊ฐ’์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ์˜ Output(y) ๊ฐ’์ด y’ ์ผ ํ™•๋ฅ ์„ ๊ตฌํ•˜๋Š” ๊ฒƒ
  • ์ธ๊ณต์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์ด ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์ถœ๋ ฅํ•˜๋ฉด ์ด๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋น„์šฉํ•จ์ˆ˜๋ฅผ ์ •์˜ ํ•  ์ˆ˜ ์žˆ๋‹ค!

 

๊ทธ๋ ‡๋‹ค๋ฉด ์–ด๋–ค ํ™•๋ฅ  ๋ถ„ํฌ๊ฐ€ ์žˆ๋‚˜?

  • ์ด์ง„ ๋ถ„๋ฅ˜: ๋ฒ ๋ฅด๋ˆ„์ด(Bernoulli) ๋ถ„ํฌ
    • y๊ฐ€ 0์ผ ํ™•๋ฅ  (1-๋ฎค)
    • y๊ฐ€ 1์ผ ํ™•๋ฅ  (๋ฎค)
    • ์ฆ‰ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋ฎค (0<๋ฎค<1) ํ•˜๋‚˜ → ์—ฌ๊ธฐ์— sigmoid function(input์„ 0๋˜๋Š” 1๋กœ ๋งคํ•‘ํ•ด์ฃผ๋Š” ํ•จ์ˆ˜)๋ฅผ ์ ์šฉํ•ด distribution output ๋งŒ๋“ค์–ด๋ƒ„
  • ๋‹ค์ค‘ ๋ถ„๋ฅ˜: ์นดํ…Œ๊ณ ๋ฆฌ(Categorical) ๋ถ„ํฌ
    • ์˜ˆ๋ฅผ๋“ค์–ด text classification. ์•„์›ƒํ’‹์ด ์„ธ ๊ฐœ ์ด์ƒ์ผ ๋•Œ.
    • ๋‚˜์˜จ ๋ชจ๋“  output์„ ๋”ํ–ˆ์„ ๋•Œ 1์ด ๋˜์–ด์•ผ ํ•จ
    • softmax function(๋‹ค ๋”ํ•œ ํ›„ 1๋กœ ๋…ธ๋ฉ€ํ™”, ๋น„์œจ๋กœ ๋‚˜๋ˆ ๋ฒ„๋ฆฐ๋‹ค)
    • function space ๊ด€์ ์œผ๋กœ ๋ณผ ๋•Œ sigmoid function๊ณผ softmax function์ด ๋น„์Šทํ•ด๋ณด์ด์ง€๋งŒ, parameter space ๊ด€์ ์—์„œ ๋‘˜์€ ๋™์ผํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ๊ตฌ๋ถ„ํ•ด ์‚ฌ์šฉ.
  • ์„ ํ˜• ํšŒ๊ท€: ๊ฐ€์šฐ์‹œ์•ˆ(Gaussian) ๋ถ„ํฌ
  • ๋‹คํ•ญ ํšŒ๊ท€: ๊ฐ€์šฐ์‹œ์•ˆ ๋ฏน์Šค์ณ(Mixture of Gaussians)

 

์ž๋™์œผ๋กœ ๋น„์šฉํ•จ์ˆ˜(loss function)์ •์˜ํ•˜๊ธฐ

  • ์ธ๊ณต์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์ด ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์ถœ๋ ฅํ•˜๋ฉด ์ด๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋น„์šฉํ•จ์ˆ˜๋ฅผ ์ •์˜ ํ•  ์ˆ˜ ์žˆ๋‹ค.
    • ์ตœ๋Œ€ํ•œ ๋ชจ๋ธ์ด ์ถœ๋ ฅํ•œ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  ๋ถ„ํฌ๊ฐ€ ํ›ˆ๋ จ ์ƒ˜ํ”Œ์˜ ํ™•๋ฅ ๋ถ„ํฌ์™€ ๊ฐ™๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ.
    • ์ฆ‰ ๋ชจ๋“  ํ›ˆ๋ จ ์ƒ˜ํ”Œ์ด ๋‚˜์˜ฌ ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™”
    • ์ตœ๋Œ€ ์šฐ๋„ ์ถ”์ •(Maximum Likelihood Estimation) : ์ž๋™์œผ๋กœ loss function ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ์ตœ์†Œํ™”๋ฅผ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์•ž์— ๋งˆ์ด๋„ˆ์Šค ๋ถ€ํ˜ธ๋ฅผ ๋ถ™์—ฌ์ค๋‹ˆ๋‹ค. (-1 ์„ ๊ณฑํ•ฉ๋‹ˆ๋‹ค.)
  • ์ตœ์ข…์ ์œผ๋กœ, loss function์€ ์Œ์˜ ๋กœ๊ทธํ™•๋ฅ (Negative Log-probabilities)์˜ ํ•ฉ์œผ๋กœ ๊ฒฐ์ •.

 

 

How to optimize

 

So far,

  • hyperthesis๋ฅผ ๊ฒฐ์ •ํ•˜๊ณ  DAG๋ฅผ ๊ทธ๋ ค๋ณด์ž.
  • DAG์˜ parameter๋“ค์ด ์ •ํ•ด์กŒ๋‹ค๋ฉด ๋ฌด์ˆ˜ํ•œ hyperthesis sets ์ค‘์—์„œ ๋ชจ๋ธ์„ ๊ฒฐ์ •ํ•œ ๊ฒƒ์ด๋‹ค.
  • neural network์—๋Š” ๋‹น์—ฐํžˆ output์ด ์žˆ์„ ๊ฒƒ์ด๋‹ค. ์ด ๋•Œ output์ด ํ•˜๋‚˜์˜ value๊ฐ€ ์•„๋‹ˆ๋ผ distribution์ด๋ผ๋ฉด ์Œ์˜ ๋กœ๊ทธํ™•๋ฅ (Negative Log-probability)๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ์ด ํ™•๋ฅ ์˜ ํ•ฉ์ด ์ตœ์ข…์ ์œผ๋กœ loss function์ด ๋œ๋‹ค. loss function์€ ์ž‘์„ ์ˆ˜๋ก ์ข‹๋‹ค.

 

  • ์ฆ‰, ์ง€๊ธˆ๊นŒ์ง€ ์šฐ๋ฆฌ๊ฐ€ ์•„๋Š” ๊ฒƒ
  1. arbitrary architecture๋กœ neural network๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค.
  1. loss๋ฅผ ๊ณ„์‚ฐํ–ˆ๋‹ค.
  1. ์ด ๋‘๊ฐ€์ง€๋Š” DAG๋กœ ํ‘œํ˜„๋˜์—ˆ๋‹ค.

 

  • ์ด์ œ ๋ญ˜ ๋” ์•Œ์•„์•ผ ํ•˜๋‚˜?
  1. optimization algorithm ์„ ์ •ํ•ด์•ผ๊ฒ ๊ณ ,
  1. ์ •ํ•œ optimization algorithm์„ ์–ด๋–ป๊ฒŒ DAG์— ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ์จ ํ™œ์šฉํ• ์ง€ ์•Œ์•„์•ผ๊ฒ ๋‹ค.

 

 

์–ด์ฐจํ”ผ ๋ฌด์ˆ˜ํ•œ ๊ฒฝ์šฐ์˜ ์ˆ˜ ์ค‘ ํ•˜๋‚˜๋ฅผ ๋ฐ”๋กœ ์ฐพ๊ธฐ ํž˜๋“ค๋‹ค๋ฉด, ๋ฐ˜๋ณต์ ์œผ๋กœ ์ตœ์ ํ™”ํ•˜์ž.

  • random guided search
    • ๋””๋ฉ˜์…˜์„ ๋†’์ผ์ˆ˜๋ก ์ƒ˜ํ”Œ๋ง์„ ๋งŽ์ดํ•ด์•ผํ•ด์„œ ํšจ์œจ์„ฑ์ด ๋–จ์–ด์ง
    • ์žฅ์ : ์–ด๋–ค ๋น„์šฉํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด๋„ ๋ฌด๊ด€
    • ๋‹จ์ : ์ฐจ์›์ด ์ž‘์„ ๋•Œ๋Š” ์ž˜ ๋˜์ง€๋งŒ, ์ฐจ์›์˜ ์ €์ฃผ ๋•Œ๋ฌธ์— ์ปค์งˆ ์ˆ˜๋ก ์˜ค๋ž˜๊ฑธ๋ฆผ. ์ƒ˜ํ”Œ๋ง(sampling) ์— ๋”ฐ๋ผ์„œ ์˜ค๋ž˜๊ฑธ๋ฆผ.
    • ๋งŒ์•ฝ continuous, differentiableํ•˜๋‹ค๋ฉด? → gradient-based optimization
  • gradient descent
    • ๋ฏธ๋ถ„์„ ํ™œ์šฉํ•ด ์ตœ์ ํ™”ํ•œ๋‹ค.
    • learning rate ๋“ฑ ์•„์ฃผ ์ž‘์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•œ๋‹ค. ์–ด๋–ค ๋ฐฉํ–ฅ์œผ๋กœ ํ˜๋Ÿฌ๊ฐˆ ๋•Œ ์ž‘์•„์ง€๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ์–ด์ง„๋‹ค.
    • ํ•œ ๋ฒˆ gradient๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด optimization ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€์ ธ๋‹ค ์“ธ ์ˆ˜ ์žˆ๋‹ค
    • random guided search์™€ ๋‹ค๋ฅธ ์ ์€? learning rate- ์ด๊ฒƒ์„ ๋งค์šฐ ์„ฌ์„ธํ•˜๊ฒŒ ์ •ํ•ด์•ผ ํ•œ๋‹ค. ๋„ˆ๋ฌด ํฌ๋ฉด ์˜ค๋ฒ„์ŠˆํŒ… ์ผ์–ด๋‚œ๋‹ค.
    • ์žฅ์ : Random Guided search ์— ๋น„ํ•ด์„œ ํƒ์ƒ‰์˜์—ญ์€ ์ž‘์ง€๋งŒ ํ™•์‹คํ•œ ๋ฐฉํ–ฅ์€ ์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.
    • ๋‹จ์ : ํ•™์Šต๋ฅ (Learning Rate)์ด ๋„ˆ๋ฌด ํฌ๊ฑฐ๋‚˜ ์ž‘์œผ๋ฉด ์ตœ์ ์˜ ๊ฐ’์œผ๋กœ ๋ชป๊ฐˆ ์ˆ˜๋„ ์žˆ๋‹ค.

 

๊ทธ๋Ÿผ gradient๋ฅผ ์–ด๋–ป๊ฒŒ ๊ณ„์‚ฐํ•˜๋‚˜?

  • automatic differentiation (autograd, ์ž๋™๋ฏธ๋ถ„๋ฒ•)
    • ํŒŒ๋ผ๋ฏธํ„ฐ๋„ ๋งŽ๊ณ , ๋…ธ๋“œ๋„ ๋งŽ๋‹ค. gradient ๊ณ„์‚ฐ์ด ์–ด๋ ต๋‹ค. ๊ทธ๋ž˜์„œ ๊ณ ์•ˆ๋œ ๊ฒƒ.
    • DAG๋Š” differentiable(๋ฏธ๋ถ„๊ฐ€๋Šฅ) ํ•˜๋‹ค. chain rule์ด ๊ฐ€๋Šฅํ•˜๋‹ค
    • ๊ณผ์ •
      1. ๊ฐ ๋…ธ๋“œ์—์„œ Jacobian-vector product์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
      1. ๋น„์ˆœํ™˜ ๊ทธ๋ž˜ํ”„(DAG) ์—ญ์ˆœ์œผ๋กœ ์ „ํŒŒํ•ฉ๋‹ˆ๋‹ค.
    • ์žฅ์ 
      • Gradient ๋ฅผ ์†์œผ๋กœ ์ง์ ‘ ๊ณ„์‚ฐํ•  ํ•„์š” ์—†์Œ
      • [front-end]์™€ [back-end]์˜ ๋ถ„๋ฆฌ
        • [front-end] ๋น„์ˆœํ™˜ ๊ทธ๋ž˜ํ”„(DAG)๋ฅผ ๋งŒ๋“ฆ์œผ๋กœ์จ ๋‰ด๋Ÿด๋„ท(๋ชจ๋ธ)์„ ๋””์ž์ธํ•˜๋Š”๋ฐ ์ง‘์ค‘.
        • [back-end] ๋””์ž์ธ ๋œ ๋น„์ˆœํ™˜ ๊ทธ๋ž˜ํ”„(DAG)๋Š” ํƒ€๊ฒŸ ์ปดํ“จํ„ฐ ์žฅ์น˜๋ฅผ ์œ„ํ•œ ํšจ๊ณผ์ ์ธ ์ฝ”๋“œ๋กœ ์ปดํŒŒ์ผ๋จ. — ํŒŒ์ดํ† ์น˜, ํ…์„œํ”Œ๋กœ์šฐ ๊ฐ™์€ Framework ๊ฐœ๋ฐœ์ž๋“ค์ด ํƒ€๊ฒŸ์— ๋งž์ถฐ ์•Œ๋งž๊ฒŒ ๊ตฌํ˜„ํ•˜๋ฉด ์‚ฌ์šฉ์ž๋“ค์€ ์‚ฌ์šฉ๋งŒ ํ•˜๋ฉด ๋จ (๋ฌผ๋ก  ์•Œ์•„์•ผ ๋„์›€์ด ๋˜๊ธด ํ•œ๋‹ค..)

 

Gradient-based optimization

  • gradient-based optimization (๊ฒฝ์‚ฌ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™” ๊ธฐ๋ฒ•)
    • training set loss๊ฐ€ ๊ฐ sample์— ๋Œ€ํ•œ loss์˜ ํ•ฉ์ด๋ฏ€๋กœ, parameter๊ฐ€ ๋งŽ์œผ๋ฉด ์‹œ๊ฐ„์ด ์˜ค๋ž˜๊ฑธ๋ฆฐ๋‹ค.
  • stochastic gradient descent (ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•)
    • noise๋‚˜ random value๊ฐ€ ๋ผ์–ด์žˆ์Œ.
    • ๋ฐ์ดํ„ฐ๊ฐ€ ํŽธํ–ฅ๋˜์—ˆ๋‹ค๋ฉด? correnction์— ์‹ ๊ฒฝ์“ฐ๋ฉด ๋œ๋‹ค. ๋ชฌํ…Œ์นด๋ฅผ๋กœ(Monte Carlo method), ๋˜๋Š” important sampling. ์–ด๋–ค ๊ฒƒ์ด ํŽธํ–ฅ๋˜์—ˆ๋Š”์ง€๋„ ๋ชจ๋ฅด๋Š” ์ƒํ™ฉ์ด๋ผ๋ฉด validation set์„ ์กฐ์‹ฌ์Šค๋Ÿฝ๊ฒŒ ์„ค์ •ํ•˜๋Š” ์ˆ˜๋ฐ–์— ์—†๋‹ค. natural distribution์ด๋ž‘ ๋น„์Šทํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์„œ early stopping์„ ํ•˜๋Š” ๊ฒŒ ์ตœ์„ ์ž„.
    • sample์˜ ์ผ๋ถ€๋ฅผ ๋ช‡ ๊ฐœ๋งŒ ๊ณจ๋ผ์„œ ๊ณ„์‚ฐํ•˜๋ฉด ์ „์ฒด ๋น„์šฉ์˜ ๊ทผ์‚ฌ๊ฐ’(approximate)์™€ ๊ฐ™๋‹ค๋Š” ๊ฐ€์ •
    1. mini batch ์„ ํƒ : m๊ฐœ์˜ ํ›ˆ๋ จ ์ƒ˜ํ”Œ ์„ ํƒ
    1. mini batch ๊ฒฝ์‚ฌ ๊ณ„์‚ฐ
    1. parameter ์—…๋ฐ์ดํŠธ
    1. validation set loss์— ๋” ์ด์ƒ ์ง„์ „์ด ์—†์„ ๋•Œ stop (early stopping)

    • early stopping : overfitting ๋ฐฉ์ง€ํ•˜๊ธฐ์— ๊ฐ€์žฅ ์ข‹์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜. validation loss ๊ฐ€์žฅ ๋‚ฎ์€ ๊ณณ์—์„œ ํ›ˆ๋ จ์„ ๋ฉˆ์ถ”๋Š” ๊ฒƒ.
    • adaptive learning rate : stochastic gradinet descent๋Š” learning rate์— ๋ฏผ๊ฐํ•จ. noise๊ฐ€ ํฐ ์ƒํ™ฉ์—์„œ ์–ด๋–ค ๊ฒƒ๋“ค์„ ์จ์•ผ optimization ํ•  ์ˆ˜ ์žˆ๋‚˜? → adam, adadelta ๋“ฑ๋“ฑ์ด module๋กœ ๋‚˜์™€์žˆ์Œ.



  • ๐Ÿ’ก
    ๋ณธ ํŽ˜์ด์ง€๋Š” ์กฐ๊ฒฝํ˜„ ๊ต์ˆ˜๋‹˜ ‘๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•œ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ’ ๊ฐ•์˜๋ฅผ ์ •๋ฆฌํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.
  •  

'Study > Basics' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

Module and Package  (0) 2021.12.29
Python Object Oriented Programming  (1) 2021.12.29
RNN  (0) 2021.12.29
CNN  (0) 2021.12.29
Optimization  (0) 2021.12.29