Softmax에 대한 직관적 이해

ML 7월 02, 2020 0 Comments

Go Wiki!

영문 위키를 보자.

The following function:

\[\textrm{softmax}(k, x_1, \cdots, x_n) = \frac{e^{x_k}}{\sum_{i=1}^n e^{x_i}}\]

is referred to as the softmax function. The reason is that the effect of exponentiating the values \(x1, \cdots, x_n\) is to exaggerate the differences between them. As a result, \(\textrm{softmax}(k, x_1, \cdots, x_n)\) will return a value close to 0 whenever \(x_k\) is significantly less than the maximum of all the values, and will return a value close to 1 when applied to the maximum value, unless it is extremely close to the next-largest value. Thus, the softmax function can be used to construct a weighted average that behaves as a smooth function(which can be conveniently differentiated, etc.) and which approximates the indicator function.

`argmax`

argmax 함수는 주어진 값에서 최대값을 찾아낸다. 예를 들어,

\(\textrm{argmax}_{1 \leq k \leq 10} (k^3 – 21k^2+98k)\) 는 \(3\) 이다.

\(x_k = k^3 – 21k^2+98k\) 로 놓고, \(k=1, 2, \cdots, 10\) 일때, \(x_k\) 를 구해보면 다음과 같다.

\[x_1 = 78, x_2=120, x_3=132, x_4=120, x_5=90, x_6=48, x_7=0, \cdots\]

최대값은 \(132\) , 그때 \(k=3\) 이다.

이 \(k=3\) 을 소위 dummy-coding으로 적으면 \((0,0,1,0,0,0,0,0,0,0)\) 이 된다.

이제 \(\textrm{argmax}\) 가 아니라 \(\textrm{softmax}\) 를 취해본다.

softmax = function(x) {
  exp(x)/sum(exp(x))
}

f = function(x) {x^3-21*x^2+98*x}

softmax(f(1:10))

##  [1]  3.532585e-24  6.144137e-06  9.999877e-01  6.144137e-06  5.749452e-19  3.305660e-37  4.711108e-58
##  [8]  6.714102e-79  3.860288e-97 3.612312e-110

우와.. 수를 읽기 쉽지가 않으니…

round(softmax(f(1:10)), 2)

##  [1] 0 0 1 0 0 0 0 0 0 0

잉?

round(softmax(f(1:10)), 5)

##  [1] 0.00000 0.00001 0.99999 0.00001 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000

따라서 \(\textrm{argmax}\) 가 아니라 \(\textrm{softmax}\) 를 비교해본다면,

dummy coding(argmax(x_k)) = (0.00000, 0.00000, 1.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000)
             softmax(x_k) = (0.00000, 0.00001, 0.99999, 0.00001, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000)

굉장히 비슷하지 않은가?

그러니까 내 말은, softmax는 soft(dummycoding(argmax)) 정도로 이해할 수 있다는 소리다.

그밖의 성질

\[\textrm{softmax}(k, x_1, \cdots, x_n) = \frac{e^{x_k}}{\sum_{i=1}^n e^{x_i}}\]

위의 식에서 분모와 분자를 모두 \(e^{x_k}\) 로 나눠주면 다음과 같다.

\[\textrm{softmax}(k, x_1, \cdots, x_n) = \frac{1}{\sum_{i=1}^n e^{x_i-x_k}}\]

결국 \(\textrm{softmax}\) 값은 \(x_1-x_k, x_2 – x_k, \cdots\) 에 의해 좌우될 뿐, \(x_1, x_2, \cdots\) 의 크기가 중요하지 않다. 다시 말해 \(x_1, x_2, \cdots\) 에 일정한 상수를 더하거나 빼줘도 \(\textrm{softmax}\) 값은 변하지 않는다.

다음은 가장 큰 \(x_k=0\) 이고, 나머지 \(x_i(i \neq k)\) 가 모두 \(-1\) 또는 \(-2\) , …일 때, \(\textrm{softmax}(x_k)\) 의 변화를 보여준다.

library(dplyr)
library(ggplot2)
cond = expand.grid(n=2:10, diffx = 1:10)
dat = cond %>% mutate(softmax = 1/(exp(0)+(n-1)*exp(-diffx)))
ggplot(dat, aes(x=n, col=factor(diffx), y=softmax)) + 
  geom_point() + 
  geom_line() + ylim(0,1) + 
  geom_hline(yintercept= 0.9, linetype='dotted') + 
  scale_x_continuous(breaks=1:10)

plot of chunk unnamed-chunk-4

만약 선택 가지 수가 5개 일때, 최대 softmax 값이 0.9를 넘어가려면, 그 \(x_k\) 의 차이가 모두 3~4 이상이면 된다는 것을 보여준다. \(x_k\) 의 차이가 모두 5를 넘어가면, 선택가지수가 10개라도 최대 softmax는 0.9 이상이 됨을 확인할 수 있다.

admin

Comments on this post

0 Comments

2025 7월
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31