๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
KAIST MASTER๐Ÿ“š/CS231n

[CS231n] Lecture 3 - Loss Functions and Optimization

by ๋ง๋ž‘e 2021. 1. 31.

 

www.youtube.com/watch?v=h7iBpEHGVNc&list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv&index=3

Stanford University์—์„œ 2017๋…„๋„์— ๊ฐ•์˜ํ•œ CS231n๋ฅผ ๋“ค์œผ๋ฉฐ ์ •๋ฆฌ, ์š”์•ฝํ–ˆ๋‹ค.

3๊ฐ• Loss Functions and Optimization ๊ฐ•์˜ ์š”์•ฝ ์‹œ์ž‘!


Lecture 2 ๋ณต์Šต

๊ณ ์–‘์ด ์‚ฌ์ง„์„ ์ธ์‹ํ•˜๊ฒŒ ํž˜๋“ค๊ฒŒ ๋งŒ๋“œ๋Š” ์›์ธ

์ง€๋‚œ ์‹œ๊ฐ„์˜ ๋ณต์Šต

W์˜ row๋ฅผ visualize

์–ด๋–ค W๊ฐ€ ์ข‹์€ W๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์„๊นŒ? 

car์€ ์˜ˆ์ธก์„ ์ž˜ํ•˜๊ณ  frog๋Š” ์—„์ฒญ ๋ชปํ•จ

์–ด๋–ค W๊ฐ€ best์ธ์ง€ ์•Œ๋ ค๋ฉด loss function์„ ์ด์šฉํ•ด์•ผ ํ•จ

์–ด๋–ค W ๊ฐ€ least bed ํ•œ๊ฐ€?

 

loss function

x๋Š” input (ํ”ฝ์…€ ์ด๋ฏธ์ง€) y๋Š” ๋ผ๋ฒจ, ํƒ€๊ฒŸ ์ด๋ผ๊ณ  ๋ถˆ๋ฆผ

x๋ž‘ y ๋น„๊ตํ•ด์„œ ์–ด๋–ค ๋ฐฉ์‹์œผ๋กœ ์ฐจ์ด ๊ฐ’์„ ๊ตฌํ•˜๊ณ , ๋ชจ๋“  N๊ฐœ์˜ ์ด๋ฏธ์ง€์—์„œ ์ฐจ์ด ํ•ฉ์ด total Loss

 

multiclass SVM

Loss function์˜ ์ผ์ข…

์ง„์งœ ์นดํ…Œ๊ณ ๋ฆฌ์ธ y_i๋ฅผ ์ œ์™ธํ•œ ๋‹ค๋ฅธ ์นดํ…Œ๊ณ ๋ฆฌ j๋“ค์— ๋Œ€ํ•ด์„œ

์ง„์งœ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์˜ˆ์ธกํ•œ score์ด ๋‹ค๋ฅธ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์˜ˆ์ธกํ•œ ์ ์ˆ˜ + 1 (1์€ margin๊ฐ’) ๋ณด๋‹ค ๋†’์œผ๋ฉด,

์ฆ‰ S_yi >= S_j + 1์ด๋ผ๋ฉด, ๊ฐ’์„ 0์„ ์ค€๋‹ค. 

๊ทธ๋ ‡์ง€ ์•Š๋‹ค๋ฉด, S_j + 1 - S_yi ๊ฐ’์„ ์ค€๋‹ค

ํ•œ ์ด๋ฏธ์ง€์— ๋ชจ๋“  ์นดํ…Œ๊ณ ๋ฆฌ์— ๋Œ€ํ•ด ๊ตฌํ•œ ์ด ๊ฐ’๋“ค์„ ํ•ฉํ•˜๋ฉด, 1 data set์— ๋Œ€ํ•œ loss ๊ฐ’์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

๊ทธ๋‹ค์Œ ๋ชจ๋“  training data set์— ๋Œ€ํ•ด์„œ ํ‰๊ท ์„ ๊ตฌํ•œ๋‹ค.

 

True class score์ด ๋‹ค๋ฅธ class์˜ ์ ์ˆ˜๋ณด๋‹ค ๋†’์œผ๋ฉด~~ ์˜ if ๋ฌธ์„ max๋กœ ๋ฐ”๊พผ ์ˆ˜์‹๋„ ๋‚˜์™€์žˆ๋‹ค.

x์ถ•์€ s_yi๊ณ  y์ถ•์€ loss

s_yi๊ฐ€ s_j ๋ณด๋‹ค 1๋งŒํผ์€ ์ปค์•ผ์ง€ loss๊ฐ€ 0 ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

example

์ง€๊ธˆ cat ์‚ฌ์ง„์— ๋Œ€ํ•ด loss๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

car score - cat score +1๊ณผ 0์„ ๋น„๊ต, max๊ฐ’ ์„ ํƒ, ๋‹ค๋ฅธ ํด๋ž˜์Šค๋ž‘๋„ ๊ฐ™์€ ๊ณผ์ • ํ›„ ํ•ฉํ•จ

car์— ๋Œ€ํ•œ total loss๋Š” 0์ด๋‹ค 

frog์‚ฌ์ง„์— ๋Œ€ํ•œ total loss๋Š” 12.9

์ดํ•ฉ์— ๋Œ€ํ•œ ํ‰๊ท ์€ 5.27์ด๋‹ค

 

SVM Loss์— ๋Œ€ํ•œ ์ง๊ด€๋ ฅ์„ ๊ธฐ๋ฅผ ์ˆ˜ ์žˆ๋Š” ์งˆ๋ฌธ :

Q. car ์ด๋ฏธ์ง€์—์„œ ๋งŒ์•ฝ car score์ด ์กฐ๊ธˆ ๋ณ€ํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ?

loss๋Š” ๋ณ€ํ•˜์ง€ ์•Š์„ ๊ฒƒ : ์ด๋ฏธ ๋‹ค๋ฅธ ์ ์ˆ˜๋ณด๋‹ค ์œ ์˜๋ฏธํ•˜๊ฒŒ ๋†’์•„์„œ ์—ฌ์ „ํžˆ 0

Q. SVM loss์˜ ์ตœ๋Œ€ ์ตœ์†Ÿ๊ฐ’์€?

min์€ 0, max๋Š” ๋ฌดํ•œ๋Œ€

Q. W์˜ ๋ชจ๋“  ๊ฐ’์ด ๋‹ค 0์— ๊ฐ€๊น๋‹ค๋ฉด SVM ๋กœ์Šค ๊ฐ’์€?

# class - 1

์™œ๋ƒ๋ฉด ๋ชจ๋“  ์ ์ˆ˜๊ฐ€ 0, 0,.., 0์ด ๋  ๊ฒƒ์ด๊ณ  1์ด ํด๋ž˜์Šค -1 ๋งŒํผ ๋”ํ•ด์งˆ ๊ฒƒ ์ด๊ธฐ ๋•Œ๋ฌธ

-> ์œ ์šฉํ•œ ๋””๋ฒ„๊น… ๋ฐฉ๋ฒ•, ๋งŒ์•ฝ W๋ฅผ ์„ค์ •ํ–ˆ๋Š”๋ฐ loss๊ฐ€ C-1์ด๋ฉด ์ž˜๋ชป๋œ ๊ฒƒ

Q. ๋งŒ์•ฝ loss๊ณ„์‚ฐ์— y_i๊ฐ€ ํฌํ•จ๋œ๋‹ค๋ฉด? (including j = y_i)

๋กœ์Šค ๊ฐ’์€ 1๋งŒํผ ์ฆ๊ฐ€ํ•  ๊ฒƒ, S_yi - S_yi + 1์ด ํฌํ•จ๋  ๊ฒƒ์ด๋‹ˆ๊นŒ

Q. ๋งŒ์•ฝ ์œ„ ๋„ค๋ชจ ๋ฐ•์Šค์˜ sum๋Œ€์‹  mean์„ ์“ฐ๋ฉด?

์ „๊ณผ ํฌ๊ฒŒ ๋ณ€ํ•  ๊ฒƒ์ด ์—†๋‹ค, ๊ทธ๋ƒฅ rescale ํ•œ ๊ฒƒ

Q. ๋งŒ์•ฝ max()^2๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด?

์ด๊ฒƒ์€ ๋‹ค๋ฅธ loss function์ด๋‹ค, ์ฐจ์ด๊ฐ€ ํฌ๋ฉด ์ปค์งˆ์ˆ˜๋ก really bad ํ•˜๋‹ค๋Š” ์ •๋ณด๋ฅผ ์ฃผ๊ฒŒ ๋จ

์–ด๋–ค loss๋ฅผ ์‚ฌ์šฉํ• ์ง€ ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ๋„ ์ค‘์š”

 

SVM LOSS: Example code

W์ผ ๋•Œ L=0์ด๋ผ๋ฉด 2W์ผ ๋•Œ๋„ L=0

zero loss๊ฐ€ ๋˜๋Š” ๊ฒŒ ํ•œ๋‘ ๊ฐœ๊ฐ€ ์•„๋‹˜

 

Loss : Data loss, Regularization

blue๋ฅผ ๋ถ„๋ฅ˜ -> ์™„์ „ ๋˜‘๊ฐ™์ด ์„  ๊ทธ ์Œ -> ์ƒˆ๋กœ์šด input ๋“ค์–ด์˜ด -> ์ดˆ๋ก์ƒ‰์ด ๋” ์•Œ๋งž๋‹ค

ํŒŒ๋ž€์ƒ‰์€ ๋‚˜์˜ training data set์— overfiting ๋˜์—ˆ๊ณ  ๋„ˆ๋ฌด high dimension์ด๋‹ค

๋” ๊ฐ„๋‹จํ•œ W๋ฅผ ๋งŒ๋“ค๊ฒŒ ๋„์™€์ฃผ๋Š” term์ด regularzation : ์˜ค์ปด์˜ ์นผ๋‚  -> ์‹ฌํ”Œ ์ด์Šค ๋” ๋ฒ ์ŠคํŠธ

whole idea of regulization is just any thing that you do to your model, 
that sort of penalizes somehow the complexity of the model, 
rather than explicitly trying to fit the training data.

ํ•œ๋งˆ๋””๋กœ ๋„ˆ๋ฌด ๋‚ด data์— ํŽธํ–ฅ๋˜๋Š” ๊ฒƒ์„ ๋ง‰๋Š” ๋ถ€๋ถ„.

๋”ฐ๋ผ์„œ ๋žŒ๋‹ค ๊ฐ’์„ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ๋„ ์ค‘์š”ํ•œ ์š”์ธ์ด ๋  ์ˆ˜ ์žˆ๋‹ค.

๋žŒ๋‹ค๊ฐ€ ์ปค์ง€๋ฉด ๋” ์ฐจ์›์ด ๋‚ฎ์€ ๋ชจ๋ธ, ๋žŒ๋‹ค๊ฐ€ ์ž‘์•„์ง€๋ฉด ๋” ์ฐจ์›์ด ๋†’์€ ๋ชจ๋ธ(ํŒŒ๋ž€์ƒ‰ ์„ ์ฒ˜๋Ÿผ) ๋  ๊ฐ€๋Šฅ์„ฑ ์žˆ๋‹ค

 

Regularization

๋ณดํ†ต L2 regularization์„ ์‚ฌ์šฉํ•œ๋‹ค

L1๊ณผ L2๋ฅผ ํ•จ๊ป˜ ์“ธ ์ˆ˜๋„ ์žˆ๋‹ค

์ด ๋ถ€๋ถ„์„ ์ž์„ธํžˆ ๋‹ค๋ฃจ์ง„ ์•Š์Œ, ์ข€ ๋” ๊ณต๋ถ€ํ•ด์•ผ ํ•  ๊ฒƒ ๊ฐ™๋‹ค

 

๋‹ค๋ฅธ loss์ข…๋ฅ˜

svm์€ loss๊ฐ€ ์ฃผ๋Š” ์˜๋ฏธ๊ฐ€ ๋ณ„๋กœ  ์—†๋‹ค - score์„ ์œ ์˜๋ฏธํ•˜๊ฒŒ ํ•ด์„ํ•˜์ง€ ์•Š์Œ

ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ํ•ด์„ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ

0~1 ์‚ฌ์ด๋กœ ํ™•๋ฅ ์ด ๋‚˜์˜ด

 

์˜ˆ์‹œ

 

score์„ exp ํ•˜๊ฒŒ ํ•œ ๋’ค 0~1fh normalize, ๊ทธ ํ›„ Loss ์ทจํ•จ

 

Q. min/max Loss ๊ฐ’์€?

1 and 0

P= 1์ผ ๋•Œ Loss๊ฐ€ 0 ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค

๊ทธ๋Ÿฐ๋ฐ ๊ทธ๋ ‡๊ฒŒ ๋˜๋ ค๋ฉด y_i๊ฐ€ ์–‘์˜ ๋ฌดํ•œ, ๋‚˜๋จธ์ง€ ํด๋ž˜์Šค๊ฐ€ ์Œ์˜ ๋ฌดํ•œ์ด ์—ฌ์•ผํ•จ

Q ๋ชจ๋“  ํด๋ž˜์Šค์˜ S๊ฐ€ 0์— ๊ทผ์ ‘ํ•˜๋‹ค๋ฉด Loss๋Š”?

normalizeํ›„์— 1/C๊ฐ€ ๋  ๊ฒƒ์ด๋‹ˆ L = -log(1/C) = log C ๊ฐ€ ๋  ๊ฒƒ์ด๋‹ค.

SVM loss์™€ Softmax์˜ ๋น„๊ต

SVM์€ ์•„๊นŒ jiggle์— ํฐ ์ƒ๊ด€์ด ์—†์—ˆ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ softmax๋Š” ์–ด๋Š ๊ฐ’ ์ด์ƒ์ด๋ผ๊ณ  Loss๊ฐ€ ๊ฐ™์€ ๊ฐ’์„ ๊ฐ€์ง€์ง€ ์•Š๋Š”๋‹ค

๊ทธ๋ž˜์„œ ์ตœ์ ์˜ W๋Š” ์–ด๋–ป๊ฒŒ ์ฐพ์„ ๊ฑด๋ฐ??

 

Optimization

large valley๋ฅผ ๊ฑธ์–ด ๋‹ค๋‹Œ๋‹ค

๊ฐ€์žฅ ์ตœํ•˜์ ์€ ์–ด๋””์ผ๊นŒ?

๋žœ๋ค์œผ๋กœ ์•„๋ฌด ์ ์ด๋‚˜ ์ฐพ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ - ํ˜„๋ช…ํ•˜์ง€ ์•Š์€ ๋ฐฉ๋ฒ•

๊ฑธ์–ด ๋‹ค๋‹ˆ๋ฉด์„œ ๋ˆˆ์œผ๋กœ ๋ณด๊ณ  ๋ฐœ์˜ ๊ฐ๊ฐ์„ ๋Š๋ผ๋ฉฐ, ๊ธฐ์šธ๊ธฐ๊ฐ€ ๋‚ฎ์•„์ง€๋Š” ๊ณณ์œผ๋กœ ๊ฑธ์–ด ๋‚˜๊ฐˆ ๊ฒƒ์ด๋‹ค.

๊ทธ๋ ‡๋‹ค๋ฉด ๊ธฐ์šธ๊ธฐ๋ผ๋Š” ๊ฒƒ์€ ๋ฌด์—‡์ผ๊นŒ?

gradient๋Š” x๋ž‘ ๊ฐ™์€ ์ฐจ์›์„ ๊ฐ€์ง€๋Š” vector๋กœ, ๊ธฐ์šธ๊ธฐ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ค€๋‹ค

๊ทธ๋Ÿฌ๋ฉด ์–ด๋–ป๊ฒŒ ๊ตฌํ• ๊นŒ?

0.0001์ด๋ผ๋Š” ๊ฐ’์— ์•„์ฃผ ์ž‘์€ ๊ฐ’์„ ๋”ํ•ด ๋ดค๋”๋‹ˆ, loss๊ฐ€ ์ค„์–ด๋“ค์—ˆ๋‹ค

๊ธฐ์šธ๊ธฐ๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ๋™์ผํ•˜๊ฒŒ gradient๋ฅผ ๊ตฌํ•œ๋‹ค

๋‹ค์Œ entry์—์„œ๋„ ๊ฐ™์€ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•œ๋‹ค.

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋งค์šฐ ๋Š๋ฆด ๊ฒƒ์ด๋‹ค.

๊ทธ๋ž˜์„œ ์ด๋ ‡๊ฒŒ ํ‘œํ˜„ํ•˜๊ธฐ๋กœ ํ•œ๋‹ค!

WOW,,,,

์ด๋ ‡๊ฒŒ ๋š๋”ฑ ๋œ๋‹ค๊ณ  ํ•จ^^

ํ•˜์ง€๋งŒ numerical๋ฐฉ๋ฒ•์€ ๋””๋ฒ„๊น…ํ•˜๊ธฐ ์ข‹์€ ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ํ•œ๋‹ค.

๋Š๋ฆฌ๊ณ  ๋ถ€์ •ํ™• ํ•˜์ง€๋งŒ,, ๋””๋ฒ„๊น…ํ•  ๋•Œ ์“ฐ์ž„

 

Gradient Descent

step_size๋Š” learning rate๋ผ๊ณ ๋„ ๋ถˆ๋ฆฌ๋Š”๋ฐ, ์ค‘์š”ํ•œ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ์ด๋‹ค.

๊ฐ€์žฅ ๋จผ์ € ์ฒดํฌ ํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋ผ๊ณ  ํ•จ

gradient์˜ ์Œ์˜ ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ€๋‹ค ๋ณด๋ฉด minima์— ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ๋‹ค

์–ด๋–ค ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋ƒ์— ๋”ฐ๋ผ gradient๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฐฉ์‹์ด ๋‹ค๋ฅด๋‹ค

stochastic Gradient Descent (SGD)

์‹ค์ œ๋กœ๋Š” N์ด ๋งค์šฐ ํฌ๋‹ค

N์ด million๋‹จ์œ„๋กœ ๊ฐ€๋ฉด W๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š”๋ฐ ๋งค์šฐ ๋งŽ์€ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆด ๊ฒƒ

๊ทธ๋ž˜์„œ minibatch๋ฅผ ์ด์šฉํ•ด์„œ ๊ทธ ๋‹จ์œ„๋งˆ๋‹ค Loss๋‹จ์œ„๋ฅผ ์—…๋ฐ์ดํŠธํ•จ

๋นจ๊ฐ„์ƒ‰๊ณผ ํŒŒ๋ž€์ƒ‰์„ linear ํ•˜๊ฒŒ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์—†๋Š”๋ฐ,

์ถ•์„ ๋ฐ”๊ฟ”์„œ ๋ถ„๋ฅ˜๊ฐ€ ๊ฐ€๋Šฅํ•ด์ง

์–ด๋Š ์ƒ‰์ด ๊ฐ€์žฅ ๋งŽ์€์ง€ histogram์„ ๋งŒ๋“ฆ

simpleํ•œ image feature representation

dominant ํ•œ edge ๋ฐฉํ–ฅ์„ ์ฐพ์•„์„œ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๋งŒ๋“ฆ

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์— ๋งŽ์ด ์“ฐ์ธ๋‹ค๊ณ  ํ•จ

crop ํ•ด์„œ clustering

์‹ค์ œ๋กœ ํ•ด๋ณด๋ฉด ๋ณ„๋ฐ˜ ๋‹ค๋ฅด์ง€ ์•Š์Œ

convolution ๊ณผ์ •์—์„œ feature๊ฐ€ extract ๋œ๋‹ค

'KAIST MASTER๐Ÿ“š > CS231n' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[CS231n] Lecture 5 - Convolutional Neural Networks  (0) 2021.02.03
[CS231n] Lecture 4 - Introduction to Neural Networks  (0) 2021.02.01
[CS231n] Lecture 2 - Image Classification  (0) 2021.01.30

๋Œ“๊ธ€