Visualization 2월 26, 2019 0 Comments

Exploratory Data Analysis

데이터 살펴보기(요약통계치, 간편시각화)

데이터를 전체적으로 살펴보고 분석의 방향을 가늠한다. 이 과정에서 자료의 오류를 찾을 수도 있다. 특히 특이값(outlier)과 결측값(missing value)에 주목한다.

데이터에는 크고 작은 오류가 있기 마련이다.

데이터 프레임df의 전체적인 구조 확인하기

df <- mtcars
head(df, n=10); tail(df); View(df)
dim(df); nrow(df); ncol(df)
colnames(df); rownames(df)
apply(df, 2, function(x) sum(is.na(x)))
sum(duplicated(df))
sum(!complete.cases(df))

통계 요약치
- 집중경향치 : 평균(mean), 중앙값(median), 최빈값(mode), 절사평균(trimmed mean)
- 변산성 측정치 : 범위(range), 사분점간 범위(inter-quartile range), 분산(variance), 표준편차(standard deviation)
데이터 시각화
- 1변수 시각화 : 이산형 데이터(plot), 연속형 데이터(hist)
- 2변수 시각화 : plot(y ~ x, data = )
- 조건부 시각화 : xyplot(y ~ x | g, data = )

요약 통계치

x <- rnorm(100)
x <- c(x, 2, 2)

# 집중경향치
mean(x)
median(x)
Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
Mode(x) # table
mean(x, trim = 0.2)

# 변산성 측정치
range(x)
IQR(x)
var(x)
sd(x)

함수 Mode는 stackoverflow에서 가져왔다.

주어진 데이터 프레임의 모든 열에 대해 요약통계치를 얻으려면 apply(df, 2, func)을 사용한다.

data(mtcars)
apply(mtcars, 2, mean)

data("BankWages", package='AER')
apply(BankWages, 2, Mode)

R에는 좀 더 간단한 함수를 제공한다. summary, psych::describe등을 활용해보자.

data('singer', package='lattice')
summary(mtcars)
summary(BankWages)
summary(singer)

데이터 간편 시각화

1변수의 시각화
- 이산형 : plot
- 연속형 : hist
2변수의 시각화
- plot(y ~ x, data= )
조건부 1변수 시각화
- 이산형 : lattice::histogram(~ x | g, data= )
- 연속형 : lattice::histogram(~ x | g, data= )
조건부 2변수 시각화
- x, y 이산형 : lattice::histogram(~ y | x*g, data = )
- x 이산형, y 연속형 : lattice::xyplot(y ~ x | g, data = , jitter.x = TRUE)
- x, y 연속형 : lattice::xyplot()

시각화 결과

library(lattice)
data('BankWages', package='AER')
data(mtcars)
# 일변수
hist(BankWages$education) #histogram(~education, data=BankWages)
plot( ~ gender, data=BankWages)

plot of chunk unnamed-chunk-5

# 이변수
Bank1 <- BankWages %>% slice(1:200) %>% filter(gender == 'male')
Bank2 <- BankWages %>% slice(-(1:200))
Bank <- rbind(Bank1, Bank2)

plot(job ~ gender, data=BankWages)
plot(job ~ gender, data=Bank)
plot(education ~ gender, data=BankWages)
plot(education ~ gender, data=Bank)

plot of chunk unnamed-chunk-6

# 비교
ggplot(data = Bank, aes(y=education, x=gender)) + geom_boxplot()
ggplot(data = Bank, aes(y=education, x=gender)) + geom_point()
ggplot(data = Bank, aes(y=education, x=gender)) + geom_jitter()
ggplot(data = Bank, aes(y=education, x=gender)) + geom_violin()
ggplot(data = Bank, aes(y=education, x=gender)) + geom_jitter(width=0.02, alpha=0.2) + theme_minimal()
ggplot(data = Bank, aes(y=education, x=gender)) + 
  geom_jitter(width=0.02, height = 0.1, alpha=0.1) + theme_minimal()
ggplot(data = Bank, aes(y=education, x=gender)) + 
  geom_jitter(width=0.02, height = 0.1, alpha=0.1) + theme_minimal()
ggplot(data = Bank, aes(y=education, x=gender)) + 
  geom_dotplot(binaxis = 'y', stackdir = 'center', dotsize=0.3, width=0.9)

## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = Bank, aes(y=ordered(education), x=gender)) + 
  geom_count()

plot(qsec ~ hp, mtcars)

plot of chunk unnamed-chunk-7

# 조건부 플롯
histogram( ~ job | gender * minority, BankWages)
library(dplyr)
par(mfcol=c(1,2))
BankWages %>% group_by(minority) %>% do({
  plot(job~gender, main = .$minority[1], data=.)
  data.frame()
})
xyplot(education ~ job | gender, BankWages)
xyplot(education ~ job | gender, BankWages, jitter.x=TRUE)
xyplot(qsec ~ hp | am, mtcars)
xyplot(qsec ~ hp | mpg, mtcars)
mpgequal <- equal.count(mtcars$mpg, number=3, overlap=0)
xyplot(qsec ~ hp | mpgequal, mtcars)

## # A tibble: 0 x 1
## # Groups:   minority [0]
## # ... with 1 variable: minority <fct>

plot of chunk unnamed-chunk-8

Tags: EDA histogram lattice xyplot 탐색적자료분석

admin

Comments on this post

0 Comments

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31