행렬, 행렬, 행렬

미분류 9월 09, 2019 0 Comments

IEEE는 Institute of Electrical and Electronics Engineers의 약자이며 보통 I-Triple-E(아이 트리플 이)라고 읽는다. AAA 건전지는 영어권 나라에서 보통 triple A라고 읽는다. NCAA는 보통 N-C-Double-A라고 읽는다. 그렇다고 “행렬,행렬,행렬"을 "세 행렬"이라고 읽어야 하는 것은 아니다.

R의 행렬 관련 함수

데이터 프레임을 행렬로 바꾸는 다음의 세 함수를 비교/설명한다.

as.matrix : 행렬로
data.matrix : 자료 행렬로
model.matrix : 모형 행렬로

R의 함수 확인하기

builtins() 함수는 R의 내장 함수와 변수 등(객체)을 나열한다.

library(magrittr)
builtins() %>% summary()
builtins() %>% head(20)

##    Length     Class      Mode 
##      1345 character character 
##  [1] "zapsmall"              "xzfile"               
##  [3] "xtfrm.Surv"            "xtfrm.POSIXlt"        
##  [5] "xtfrm.POSIXct"         "xtfrm.numeric_version"
##  [7] "xtfrm.factor"          "xtfrm.difftime"       
##  [9] "xtfrm.default"         "xtfrm.Date"           
## [11] "xtfrm.AsIs"            "xtfrm"                
## [13] "xpdrows.data.frame"    "xor"                  
## [15] "writeLines"            "writeChar"            
## [17] "writeBin"              "write.dcf"            
## [19] "write"                 "withVisible"

총 1345개의 객체 중 matrix를 포함한 경우를 걸러내면 다음과 같다.

library(stringr)

obj <- builtins()
obj %>% str_extract(".*matrix.*") %>% sort
#sort(obj[grepl("matrix", obj)])

##  [1] "anyDuplicated.matrix"       "as.data.frame.matrix"      
##  [3] "as.data.frame.model.matrix" "as.matrix"                 
##  [5] "as.matrix.data.frame"       "as.matrix.default"         
##  [7] "as.matrix.noquote"          "as.matrix.POSIXlt"         
##  [9] "data.matrix"                "determinant.matrix"        
## [11] "duplicated.matrix"          "is.matrix"                 
## [13] "isSymmetric.matrix"         "matrix"                    
## [15] "prmatrix"                   "subset.matrix"             
## [17] "summary.matrix"             "unique.matrix"

summary.matrix와 같은 함수는 제너릭 함수 summary가 matrix에 적용되었을 때 작동되는 방식을 결정한다.

stats 패키지에서도 matrix를 포함하는 함수 이름을 찾아본다.

obj_stats <- ls("package:stats")
obj_stats %>% str_extract(".*matrix.*") %>% sort
#sort(obj_stats[grepl("matrix", obj_stats)])

## [1] "model.matrix"         "model.matrix.default" "model.matrix.lm"

여기서는 패키지 stats의 matrix 관련 함수를 포함하여 다음의 세 함수를 비교 설명한다.

as.matrix
data.matrix
model.matrix

`as.matrix`

as.matrix는 말 그대로 행렬로 만든다. 행렬의 주요 특징의 하나는 모든 원소의 타입이 같다는 점이다. 행렬은 참/거짓, 정수, 실수, 또는 문자열을 담을 수 있지만, 이들을 섞어서 담을 순 없다.

a <- matrix(1:4, 2, 2)
a

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

a[1,1] = 'a'
a

##      [,1] [,2]
## [1,] "a"  "3" 
## [2,] "2"  "4"

위의 결과에서 보듯이 원소 하나를 문자열로 변화시키자 모든 원소가 문자열로 변화된다.

as.matrix는 주로 데이터 프레임을 행렬로 변환시키는 경우에 사용한다. 이때 데이터 프레임의 한 열이라도 문자열이라면 변환된 행렬은 문자열이 된다.

as.matrix(data.frame(x1=c(1,2), x2=c(2,4)))

##      x1 x2
## [1,]  1  2
## [2,]  2  4

as.matrix(data.frame(x1=c(1,2), x2=c('a','b')))

##      x1  x2 
## [1,] "1" "a"
## [2,] "2" "b"

`data.matrix`

통계에서 데이터는 주로 숫자이다. 문자데이터의 대부분은 숫자로 변환된 후 분석된다.

data.matrix는 데이터 프레임을 변환하여 수치형 행렬로 바꾼다. 다음의 예를 보자.

dat = data.frame(bool = c(TRUE, FALSE, FALSE),
                 int = c(1,2,3),
                 num = c(1.2, 4.3, -1.4),
                 fac = factor(c('rock', 'scissor', 'rock')),
                 str = c("don't look back", "go ahead", "no longer")
                 ,stringsAsFactors=FALSE
                 )
as.matrix(dat)

##      bool    int num    fac       str              
## [1,] " TRUE" "1" " 1.2" "rock"    "don't look back"
## [2,] "FALSE" "2" " 4.3" "scissor" "go ahead"       
## [3,] "FALSE" "3" "-1.4" "rock"    "no longer"

data.matrix(dat)
NAs introduced by coercion     
bool int  num fac str
[1,]    1   1  1.2   1  NA
[2,]    0   2  4.3   2  NA
[3,]    0   3 -1.4   1  NA

위의 결과에서 보듯이 여러 자료형이 혼합되어 있는 dat에 대해 as.matrix() 결과는 문자형 행렬이고, model.matrix()의 결과는 수치형 행렬이다. (그리고 문자열 자료는 NA가 되었다.)

`model.matrix`

model.matrix는 데이터 프레임을 선형 모형 \(\vec{y} = \mathbf{X}\vec{\beta} + \vec{e}\) 의 \(\mathbf{X}\) 에 적합한 행렬로 변환시켜준다.

가장 기본적인 활용법은 다음과 같다.

dat1 = data.frame(x1 = rnorm(4),
                  f1 = factor(c("yes", "no", "dont know", "no")))
(mm <- model.matrix(~., dat1))

##   (Intercept)         x1 f1no f1yes
## 1           1 -1.0733481    0     1
## 2           1  0.9378294    1     0
## 3           1  0.1460410    0     0
## 4           1 -0.8485989    1     0
## attr(,"assign")
## [1] 0 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$f1
## [1] "contr.treatment"

복잡하게 보이지만 결과는 행렬이다.

class(mm)

## [1] "matrix"

단지 여러가지 속성(attributes)가 붙어 있다.

행렬만 보자면 다음과 같다.

mm2 <- mm
attr(mm2, "assign") <- NULL
attr(mm2, "contrasts") <- NULL
mm2

##   (Intercept)         x1 f1no f1yes
## 1           1 -1.0733481    0     1
## 2           1  0.9378294    1     0
## 3           1  0.1460410    0     0
## 4           1 -0.8485989    1     0

model.matrix는 기본적으로 절편(Intercept)를 포함한 모형을 상정한다. x1은 그대로이고, f1no, f1yes는 팩터형 f1을 더미코딩한 결과이다.

절편을 제외한 모형은 다음과 같다.

model.matrix(~ . -1, dat1)

##           x1 f1dont know f1no f1yes
## 1 -1.0733481           0    0     1
## 2  0.9378294           0    1     0
## 3  0.1460410           1    0     0
## 4 -0.8485989           0    1     0
## attr(,"assign")
## [1] 1 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$f1
## [1] "contr.treatment"

문자형까지 포함한 경우는 다음과 같이 문자형을 먼저 팩터형으로 변환한다.

dat2 = data.frame(x1 = rnorm(4),
                  f1 = factor(c("yes", "no", "dont know", "no")),
                  s1 = c("Actually, ...",
                         "No way, ...", 
                         "I dont think ...",
                         "Overall, ..."),
                  stringsAsFactors=FALSE
)
model.matrix(~ ., dat2)

##   (Intercept)         x1 f1no f1yes s1I dont think ... s1No way, ...
## 1           1  0.9040869    0     1                  0             0
## 2           1 -0.9824035    1     0                  0             1
## 3           1 -0.7942396    0     0                  1             0
## 4           1  0.6669408    1     0                  0             0
##   s1Overall, ...
## 1              0
## 2              0
## 3              0
## 4              1
## attr(,"assign")
## [1] 0 1 2 2 3 3 3
## attr(,"contrasts")
## attr(,"contrasts")$f1
## [1] "contr.treatment"
## 
## attr(,"contrasts")$s1
## [1] "contr.treatment"

물론 동일한 자료에서도 여러 가지 선형 모형을 상정하고 모형 행렬을 만들 수 있다.

model.matrix(~ . -1, dat1)

##           x1 f1dont know f1no f1yes
## 1 -1.0733481           0    0     1
## 2  0.9378294           0    1     0
## 3  0.1460410           1    0     0
## 4 -0.8485989           0    1     0
## attr(,"assign")
## [1] 1 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$f1
## [1] "contr.treatment"

model.matrix(~ x1, dat1)

##   (Intercept)         x1
## 1           1 -1.0733481
## 2           1  0.9378294
## 3           1  0.1460410
## 4           1 -0.8485989
## attr(,"assign")
## [1] 0 1

model.matrix(~ I(x1^2) - f1, dat1)

##   (Intercept)    I(x1^2)
## 1           1 1.15207619
## 2           1 0.87952396
## 3           1 0.02132797
## 4           1 0.72012010
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$f1
## [1] "contr.treatment"

#model.matrix(~ . - f1, dat1)
model.matrix(~ . + I(x1^2), dat1)

##   (Intercept)         x1 f1no f1yes    I(x1^2)
## 1           1 -1.0733481    0     1 1.15207619
## 2           1  0.9378294    1     0 0.87952396
## 3           1  0.1460410    0     0 0.02132797
## 4           1 -0.8485989    1     0 0.72012010
## attr(,"assign")
## [1] 0 1 2 2 3
## attr(,"contrasts")
## attr(,"contrasts")$f1
## [1] "contr.treatment"

마무리

결과적으로 세 함수는 다음과 같이 구분할 수 있다.

as.matrix : 행렬로
data.matrix : 자료 행렬로(수치 행렬로)
model.matrix : 모형 행렬로( \(f(x)=\mathbf{X}\beta\) 의 \(\mathbf{X}\) 로)

Tags: builtins matrix 함수

admin

Comments on this post

0 Comments

2025 8월
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31