Wide to Long 2

가로형 11월 04, 2019 0 Comments

dat <- read.table(header=T, text='
 name height.t1 weight.t1 height.t2 weight.t2 height.t3 weight.t3
 Sohn  158 49 166 56 170 60
 Mija  155 44 160 51 161 54
 James 159 50 165 55 169 59
 Mary  156 47 158 50 159 53
 ')
head(dat)

##    name height.t1 weight.t1 height.t2 weight.t2 height.t3 weight.t3
## 1  Sohn       158        49       166        56       170        60
## 2  Mija       155        44       160        51       161        54
## 3 James       159        50       165        55       169        59
## 4  Mary       156        47       158        50       159        53

`tidyr` 복습

가로형/세로형 변환에서 가장 쉬운 방법은 tidyr::spread/tidyr::gather인 듯 보인다. 세로형에서 가로형으로 만들 때에는 gather을 쓰고, 반대는 spread를 쓴다. 그리고 열이름을 지정한다.

library(data.table)
library(dplyr)
library(tidyr)
dat %>% gather(key='key', value='value', height.t1:weight.t3)

##     name       key value
## 1   Sohn height.t1   158
## 2   Mija height.t1   155
## 3  James height.t1   159
## 4   Mary height.t1   156
## 5   Sohn weight.t1    49
## 6   Mija weight.t1    44
## 7  James weight.t1    50
## 8   Mary weight.t1    47
## 9   Sohn height.t2   166
## 10  Mija height.t2   160
## 11 James height.t2   165
## 12  Mary height.t2   158
## 13  Sohn weight.t2    56
## 14  Mija weight.t2    51
## 15 James weight.t2    55
## 16  Mary weight.t2    50
## 17  Sohn height.t3   170
## 18  Mija height.t3   161
## 19 James height.t3   169
## 20  Mary height.t3   159
## 21  Sohn weight.t3    60
## 22  Mija weight.t3    54
## 23 James weight.t3    59
## 24  Mary weight.t3    53

dat %>% gather(key='key', value='value', height.t1:weight.t3) %>%
  spread(key='key', value='value')

##    name height.t1 height.t2 height.t3 weight.t1 weight.t2 weight.t3
## 1 James       159       165       169        50        55        59
## 2  Mary       156       158       159        47        50        53
## 3  Mija       155       160       161        44        51        54
## 4  Sohn       158       166       170        49        56        60

구조화된 열이름

위의 열이름을 보면 반복되는 패턴을 확인할 수 있다. height와 weight는 키와 체중을 나타내고 t1, t2, t3는 측정된 시간을 나타낸다. 이렇게 구조화된 열이름은 두 가지 정보(측정된 값, 그리고 측정된 시간)을 나타낸다.

열이름에 포함된 정보를 분리하려면,

dat %>% gather(key='key', value='value', height.t1:weight.t3) %>%
  separate(key, sep='[.]', into=c('var', 'time'))

##     name    var time value
## 1   Sohn height   t1   158
## 2   Mija height   t1   155
## 3  James height   t1   159
## 4   Mary height   t1   156
## 5   Sohn weight   t1    49
## 6   Mija weight   t1    44
## 7  James weight   t1    50
## 8   Mary weight   t1    47
## 9   Sohn height   t2   166
## 10  Mija height   t2   160
## 11 James height   t2   165
## 12  Mary height   t2   158
## 13  Sohn weight   t2    56
## 14  Mija weight   t2    51
## 15 James weight   t2    55
## 16  Mary weight   t2    50
## 17  Sohn height   t3   170
## 18  Mija height   t3   161
## 19 James height   t3   169
## 20  Mary height   t3   159
## 21  Sohn weight   t3    60
## 22  Mija weight   t3    54
## 23 James weight   t3    59
## 24  Mary weight   t3    53

key 열에 포함된 두 정보를 .(마침표)를 기준으로 두 열 var과 time으로 나누었다. 이때 sep=에는 RE(정규표현식)이 쓰임을 유의하자.

이렇게 두 개의 열을 얻었다면 이 중 한 열에 대해서 다시 가로형으로 만들 수 있다.

dat2 <- dat %>% 
  gather(key='key', value='value', height.t1:weight.t3) %>%
  separate(key, sep='[.]', into=c('var', 'time'))
dat2 %>% spread(key=var, value=value)

##     name time height weight
## 1  James   t1    159     50
## 2  James   t2    165     55
## 3  James   t3    169     59
## 4   Mary   t1    156     47
## 5   Mary   t2    158     50
## 6   Mary   t3    159     53
## 7   Mija   t1    155     44
## 8   Mija   t2    160     51
## 9   Mija   t3    161     54
## 10  Sohn   t1    158     49
## 11  Sohn   t2    166     56
## 12  Sohn   t3    170     60

dat2 %>% spread(key='time', value=value)

##    name    var  t1  t2  t3
## 1 James height 159 165 169
## 2 James weight  50  55  59
## 3  Mary height 156 158 159
## 4  Mary weight  47  50  53
## 5  Mija height 155 160 161
## 6  Mija weight  44  51  54
## 7  Sohn height 158 166 170
## 8  Sohn weight  49  56  60

Wide to Long 2

만약 구조화된 열이름에서 바로 위의 세로형으로 변환하고자 한다면 어떻게 해야 할까?

그런데 그 전에 왜 그렇게, 다시 중간 단계를 거치지 않고 바로 해야 할까? 물론 메모리와 디스크 용량의 문제 때문이다. 만약 중간단계를 거쳐야 한다면, 괜히 메모리를 잡아 먹게 되고, 속도가 느리다. 메모리 부족으로 변환이 불가능할 수도 있다.

`stats::reshape`

놀랍게도 R의 기본 패키지 stats의 reshape으로 이런 작업을 할 수 있다.

가장 기본적인 방법은 다음과 같다.

head(dat)

##    name height.t1 weight.t1 height.t2 weight.t2 height.t3 weight.t3
## 1  Sohn       158        49       166        56       170        60
## 2  Mija       155        44       160        51       161        54
## 3 James       159        50       165        55       169        59
## 4  Mary       156        47       158        50       159        53

reshape(dat, direction='long', varying=colnames(dat)[2:7])

##       name time height weight id
## 1.t1  Sohn   t1    158     49  1
## 2.t1  Mija   t1    155     44  2
## 3.t1 James   t1    159     50  3
## 4.t1  Mary   t1    156     47  4
## 1.t2  Sohn   t2    166     56  1
## 2.t2  Mija   t2    160     51  2
## 3.t2 James   t2    165     55  3
## 4.t2  Mary   t2    158     50  4
## 1.t3  Sohn   t3    170     60  1
## 2.t3  Mija   t3    161     54  2
## 3.t3 James   t3    169     59  3
## 4.t3  Mary   t3    159     53  4

Voila! 비록 우리가 원하던 깔끔한 형태는 아니지만 기본적인 결과는 정확하다.

여기서 몇 가지 개선을 해보자.

reshape(dat, direction='long',  varying=colnames(dat)[2:7],
        timevar='time',
        times=c('t1', 't2', 't3'),
        idvar='name')

##           name time height weight
## Sohn.t1   Sohn   t1    158     49
## Mija.t1   Mija   t1    155     44
## James.t1 James   t1    159     50
## Mary.t1   Mary   t1    156     47
## Sohn.t2   Sohn   t2    166     56
## Mija.t2   Mija   t2    160     51
## James.t2 James   t2    165     55
## Mary.t2   Mary   t2    158     50
## Sohn.t3   Sohn   t3    170     60
## Mija.t3   Mija   t3    161     54
## James.t3 James   t3    169     59
## Mary.t3   Mary   t3    159     53

`reshape`의 한계 : 결측치

reshape은 열이름의 정보를 적극적으로 해석하지는 못하는 듯 하다. 다음의 데이터를 보자. weight.t2가 존재하지 않는다.

dat <- read.table(header=T, text='
 name height.t1 weight.t1 height.t2 height.t3 weight.t3
 Sohn  158 49 166 170 60
 Mija  155 44 160 161 54
 James 159 50 165 169 59
 Mary  156 47 158 159 53
 ')
head(dat)

##    name height.t1 weight.t1 height.t2 height.t3 weight.t3
## 1  Sohn       158        49       166       170        60
## 2  Mija       155        44       160       161        54
## 3 James       159        50       165       169        59
## 4  Mary       156        47       158       159        53

reshape(dat, direction='long',  varying=colnames(dat)[2:6],
        timevar='time',
        times=c('t1', 't2', 't3'),
        idvar='name')

## Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying, : 'varying' arguments must be the same length

물론 방법이 없는 것은 아니다.

dat$weight.t2 <- NA
reshape(dat, direction='long',  varying=colnames(dat)[2:7],
        timevar='time',
        times=c('t1', 't2', 't3'),
        idvar='name')

##           name time height weight
## Sohn.t1   Sohn   t1    158     49
## Mija.t1   Mija   t1    155     44
## James.t1 James   t1    159     50
## Mary.t1   Mary   t1    156     47
## Sohn.t2   Sohn   t2    166     60
## Mija.t2   Mija   t2    160     54
## James.t2 James   t2    165     59
## Mary.t2   Mary   t2    158     53
## Sohn.t3   Sohn   t3    170     NA
## Mija.t3   Mija   t3    161     NA
## James.t3 James   t3    169     NA
## Mary.t3   Mary   t3    159     NA

하지만 뭔가 이상하다. t2.weight가 결측치인데, 위에서는 t3.weight에 결측치가 있다. 이런 결과로 보아 reshape을 모든 경우(일반적인 경우)에 언제나 정확하게 작동할 것으로 기대하기는 부족한 듯 하다.

다음의 대안이 있기는 하지만…

reshape(dat, direction='long', 
        varying=list(c('height.t1', 'height.t2', 'height.t3'),
                     c('weight.t1', 'weight.t2', 'weight.t3')),
        idvar='name', 
        v.names=c('height', 'weight'),
        timevar='time',
        times=c('t1', 't2', 't3'))

##           name time height weight
## Sohn.t1   Sohn   t1    158     49
## Mija.t1   Mija   t1    155     44
## James.t1 James   t1    159     50
## Mary.t1   Mary   t1    156     47
## Sohn.t2   Sohn   t2    166     NA
## Mija.t2   Mija   t2    160     NA
## James.t2 James   t2    165     NA
## Mary.t2   Mary   t2    158     NA
## Sohn.t3   Sohn   t3    170     60
## Mija.t3   Mija   t3    161     54
## James.t3 James   t3    169     59
## Mary.t3   Mary   t3    159     53

dat$weight.t2 <- NA
reshape(dat, direction='long', 
        varying=list(c('height.t1', 'weight.t1'),
                     c('height.t2', 'weight.t2'),
                     c('height.t3', 'weight.t3')),
        idvar='name', 
        v.names=c('t1', 't2', 't3'),
        timevar='var',
        times=c('height', 'weight'))

##               name    var  t1  t2  t3
## Sohn.height   Sohn height 158 166 170
## Mija.height   Mija height 155 160 161
## James.height James height 159 165 169
## Mary.height   Mary height 156 158 159
## Sohn.weight   Sohn weight  49  NA  60
## Mija.weight   Mija weight  44  NA  54
## James.weight James weight  50  NA  59
## Mary.weight   Mary weight  47  NA  53

무엇을 남길 것인가?

이 부분은 이렇게 이해할 있다. 보통 세로형은 시간 변수를 세로형으로 만든다. 가로형에서 세로형으로 바꿀 변수(varying=), 세로형으로 변환 된 열의 이름(timevar=), 그리고 그 내용(times=)을 정해주었다면 기본적인 틀을 갖추어진 셈이다.

reshape(dat, direction='long', 
        varying=list(c('height.t1', 'weight.t1'),
                     c('height.t2', 'weight.t2'),
                     c('height.t3', 'weight.t3')),
        timevar='var', 
        times=c('height', 'weight'))

##           name    var height.t1 height.t2 height.t3 id
## 1.height  Sohn height       158       166       170  1
## 2.height  Mija height       155       160       161  2
## 3.height James height       159       165       169  3
## 4.height  Mary height       156       158       159  4
## 1.weight  Sohn weight        49        NA        60  1
## 2.weight  Mija weight        44        NA        54  2
## 3.weight James weight        50        NA        59  3
## 4.weight  Mary weight        47        NA        53  4

`data.table::melt`

data.table의 melt은 대안이 될 수 있다. data.table::melt을 사용하려면 먼저 데이터를 데이터테이블로 변환한 후, 무엇을 가로형으로 남길지 결정한다.

DT <- data.table(dat)
melt(DT, measure=patterns('t1', 't2', 't3'))

## Warning in melt.data.table(DT, measure = patterns("t1", "t2", "t3")):
## 'measure.vars' [height.t2, weight.t2] are not all of the same type. By
## order of hierarchy, the molten data value column will be of type 'integer'.
## All measure variables not of type 'integer' will be coerced too. Check
## DETAILS in ?melt.data.table for more on coercion.

##     name variable value1 value2 value3
## 1:  Sohn        1    158    166    170
## 2:  Mija        1    155    160    161
## 3: James        1    159    165    169
## 4:  Mary        1    156    158    159
## 5:  Sohn        2     49     NA     60
## 6:  Mija        2     44     NA     54
## 7: James        2     50     NA     59
## 8:  Mary        2     47     NA     53

melt(DT, measure=patterns('height', 'weight'))

## Warning in melt.data.table(DT, measure = patterns("height", "weight")):
## 'measure.vars' [weight.t1, weight.t3, weight.t2] are not all of the same
## type. By order of hierarchy, the molten data value column will be of type
## 'integer'. All measure variables not of type 'integer' will be coerced too.
## Check DETAILS in ?melt.data.table for more on coercion.

##      name variable value1 value2
##  1:  Sohn        1    158     49
##  2:  Mija        1    155     44
##  3: James        1    159     50
##  4:  Mary        1    156     47
##  5:  Sohn        2    166     60
##  6:  Mija        2    160     54
##  7: James        2    165     59
##  8:  Mary        2    158     53
##  9:  Sohn        3    170     NA
## 10:  Mija        3    161     NA
## 11: James        3    169     NA
## 12:  Mary        3    159     NA

variable의 내용이나, value1, value2는 적절하게 지정해준다.

melt(DT, measure=patterns('height', 'weight'),
     variable.name='time', 
     value.name=c('height', 'weight'))

## Warning in melt.data.table(DT, measure = patterns("height", "weight"),
## variable.name = "time", : 'measure.vars' [weight.t1, weight.t3, weight.t2]
## are not all of the same type. By order of hierarchy, the molten data
## value column will be of type 'integer'. All measure variables not of type
## 'integer' will be coerced too. Check DETAILS in ?melt.data.table for more
## on coercion.

##      name time height weight
##  1:  Sohn    1    158     49
##  2:  Mija    1    155     44
##  3: James    1    159     50
##  4:  Mary    1    156     47
##  5:  Sohn    2    166     60
##  6:  Mija    2    160     54
##  7: James    2    165     59
##  8:  Mary    2    158     53
##  9:  Sohn    3    170     NA
## 10:  Mija    3    161     NA
## 11: James    3    169     NA
## 12:  Mary    3    159     NA

하지만 결측치가 있을 경우에는 제대로 작동하지 않는다. 아마도 pattern에 명시된 이름을 제외하고는 순서만 고려하는 듯 하다.

DT <- data.table(dat)
DT[ , weight.t2:=NULL]
DTlong1 <- melt(DT, measure=patterns('t1', 't2', 't3'))
DTlong1

##     name variable value1 value2 value3
## 1:  Sohn        1    158    166    170
## 2:  Mija        1    155    160    161
## 3: James        1    159    165    169
## 4:  Mary        1    156    158    159
## 5:  Sohn        2     49     NA     60
## 6:  Mija        2     44     NA     54
## 7: James        2     50     NA     59
## 8:  Mary        2     47     NA     53

DTlong2 <- melt(DT, measure=patterns('height', 'weight'))
DTlong2

##      name variable value1 value2
##  1:  Sohn        1    158     49
##  2:  Mija        1    155     44
##  3: James        1    159     50
##  4:  Mary        1    156     47
##  5:  Sohn        2    166     60
##  6:  Mija        2    160     54
##  7: James        2    165     59
##  8:  Mary        2    158     53
##  9:  Sohn        3    170     NA
## 10:  Mija        3    161     NA
## 11: James        3    169     NA
## 12:  Mary        3    159     NA

variable의 내용을 직접 변경할 수 있는 방법이 없음이 약간 아쉽다.

DTlong1[, variable:=c('height', 'weight')[variable]]
DTlong2

##      name variable value1 value2
##  1:  Sohn        1    158     49
##  2:  Mija        1    155     44
##  3: James        1    159     50
##  4:  Mary        1    156     47
##  5:  Sohn        2    166     60
##  6:  Mija        2    160     54
##  7: James        2    165     59
##  8:  Mary        2    158     53
##  9:  Sohn        3    170     NA
## 10:  Mija        3    161     NA
## 11: James        3    169     NA
## 12:  Mary        3    159     NA

DTlong2[ , variable:=c('t1', 't2', 't3')[variable]] 
DTlong2

##      name variable value1 value2
##  1:  Sohn       t1    158     49
##  2:  Mija       t1    155     44
##  3: James       t1    159     50
##  4:  Mary       t1    156     47
##  5:  Sohn       t2    166     60
##  6:  Mija       t2    160     54
##  7: James       t2    165     59
##  8:  Mary       t2    158     53
##  9:  Sohn       t3    170     NA
## 10:  Mija       t3    161     NA
## 11: James       t3    169     NA
## 12:  Mary       t3    159     NA

결론

구조화된 열이름을 가지고 있을 경우에 가로형으로 남길 변수를 지정하여 다음과 같이 사용하는 것이 가장 간단해보인다. (pattern=이 아니라 patterns=임을 유의하자.)

data.table::melt(DT, measure=patterns('t1', 't2', 't3'))

##     name variable value1 value2 value3
## 1:  Sohn        1    158    166    170
## 2:  Mija        1    155    160    161
## 3: James        1    159    165    169
## 4:  Mary        1    156    158    159
## 5:  Sohn        2     49     NA     60
## 6:  Mija        2     44     NA     54
## 7: James        2     50     NA     59
## 8:  Mary        2     47     NA     53

이때 주어진 자료를 데이터테이블로 만들고, 세로형이 될 변수는 순서와 갯수를 일정하게 맞추어주어야 함을 잊지 말자!

Tags: data.table melt reshape

admin

Comments on this post

0 Comments

2024 11월
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Wide to Long 2

tidyr 복습

구조화된 열이름

Wide to Long 2

stats::reshape

reshape의 한계 : 결측치

무엇을 남길 것인가?

data.table::melt

결론

Related Posts

Comments on this post

Leave a comment

`tidyr` 복습

`stats::reshape`

`reshape`의 한계 : 결측치

`data.table::melt`