前一篇介绍了NHANES数据库的加权及数据的下载NHANSE数据库的介绍及使用(一)_Christina-CSDN博客,这一篇主要介绍数据库如何导入软件进行下一步计算合并。
例一:
以NHANSE数据库的文章为例(Brody DJ, Pratt LA, Hughes J. Prevalence of depression among adults aged 20 and over: United States, 2013-2016. NCHS Data Brief, no 303. Hyattsville, MD: National Center for Health Statistics. 2018.)
1.加载安装包
library(dplyr) library(survey)
2.下载数据
此步骤可以在官网上下载,或使用软件下载。
# Download & Read SAS Transport Files # Demographic (DEMO) download.file("https://wwwn.cdc.gov/nchs/nhanes/2013-2014/DEMO_H.XPT", tf <- tempfile(), mode="wb") DEMO_H <- foreign::read.xport(tf)[,c("SEQN","RIAGENDR","RIDAGEYR","SDMVSTRA","SDMVPSU","WTMEC2YR")] download.file("https://wwwn.cdc.gov/nchs/nhanes/2015-2016/DEMO_I.XPT", tf <- tempfile(), mode="wb") DEMO_I <- foreign::read.xport(tf)[,c("SEQN","RIAGENDR","RIDAGEYR","SDMVSTRA","SDMVPSU","WTMEC2YR")] # Mental Health - Depression Screener (DPQ) download.file("http://wwwn.cdc.gov/nchs/nhanes/2013-2014/DPQ_H.XPT", tf <- tempfile(), mode="wb") DPQ_H <- foreign::read.xport(tf) download.file("http://wwwn.cdc.gov/nchs/nhanes/2015-2016/DPQ_I.XPT", tf <- tempfile(), mode="wb") DPQ_I <- foreign::read.xport(tf)
3.合并数据
# Append Files DEMO <- bind_rows(DEMO_H, DEMO_I) DPQ <- bind_rows(DPQ_H, DPQ_I) # Merge DEMO and DPQ files and create derived variables One <- left_join(DEMO, DPQ, by="SEQN") %>% # Set 7=Refused and 9=Don't Know To Missing for variables DPQ010 thru DPQ090 ## mutate_at(vars(DPQ010:DPQ090), ~ifelse(. >=7, NA, .)) %>% mutate(. , # create indicator for overall summary one = 1, # Create depression score as sum of variables DPQ010 -- DPQ090 Depression.Score = rowSums(select(. , DPQ010:DPQ090)), # Create depression indicator as binary 0/100 variable. (is missing if Depression.Score is missing) Depression= ifelse(Depression.Score >=10, 100, 0), # Create factor variables Gender = factor(RIAGENDR, labels=c("Men", "Women")), Age.Group = cut(RIDAGEYR, breaks=c(-Inf,19,39,59,Inf),labels=c("Under 20", "20-39","40-59","60 and over")), # Generate 4-year MEC weight (Divide weight by 2 because we are appending 2 survey cycles) # Note: using the MEC Exam Weights (WTMEC2YR), per the analytic notes on the # Mental Health - Depression Screener (DPQ_H) documentation WTMEC4YR = WTMEC2YR/2 , # Define indicator for analysis population of interest: adults aged 20 and over with a valid depression score inAnalysis= (RIDAGEYR >= 20 & !is.na(Depression.Score)) ) %>% # drop DPQ variables select(., -starts_with("DPQ"))
由于使用了两年的数据,因此weight需要计算,WTMEC4YR=WTMEC2YR
4.定义survey数据集
NHANES_all <- svydesign(data=One, id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC4YR, nest=TRUE)
选择子集
NHANES <- subset(NHANES_all, inAnalysis)
5.统计分析
计算加权均值及标准差,定义函数
getSummary <- function(varformula, byformula, design){ # Get mean, stderr, and unweighted sample size c <- svyby(varformula, byformula, design, unwtd.count ) p <- svyby(varformula, byformula, design, svymean ) outSum <- left_join(select(c,-se), p) outSum }
计算抑郁分层的结果
getSummary(~Depression, ~one, NHANES) #' By sex getSummary(~Depression, ~Gender, NHANES) #' By age getSummary(~Depression, ~Age.Group, NHANES) #' By sex and age getSummary(~Depression, ~Gender + Age.Group, NHANES)
注意,在NHANSE数据库使用过程中,首先要定义survey数据集,再进行subset运算, 不能直接subset取子集计算,否则会导致有偏估计。
例二:以今年发表在EST上的文献为例
Exposure: chloroform (TCM); bromodichloromethane (BDCM); dibromochloromethane (DBCM); bromoform (TBM)
Outcome: thyroid function (FT4;FT3; TT4;TT3; TPOAb; TgAb)
Exclusion criterion: thyroid diseases; prescription medications, pregnant status, <20 years old
Covariates: demographic data; serum cotinine
Year: 2007-2008
1.数据下载
根据文献中的暴露变量及结局变量,协变量,下载相应数据集。
2.数据导入
setwd("C:\\Users\\18896\\Desktop\\NHANSE20211110\\example1") library(foreign) DEMO_E <- read.xport("DEMO_E.XPT") BMX_E<-read.xport("BMX_E.XPT") MCQ_E <- read.xport("MCQ_E.XPT") RHQ_E <- read.xport("RHQ_E.XPT") RXQ_RX_E <- read.xport("RXQ_RX_E.XPT") RXQ_RX_E<-subset(RXQ_RX_E,!duplicated(RXQ_RX_E$SEQN)) THYROD_E <- read.xport("THYROD_E.XPT") VOCMWB_E <- read.xport("VOCMWB_E.XPT")
3.数据合并
##数据库整合 ##并集(Union) data_E <- DEMO_E data_E <- merge(data_E, MCQ_E, by = "SEQN", all = T) data_E <- merge(data_E, BMX_E, by = "SEQN", all = T) data_E <- merge(data_E, RHQ_E, by = "SEQN", all = T) data_E <- merge(data_E, RXQ_RX_E, by = "SEQN", all = T) data_E <- merge(data_E, THYROD_E, by = "SEQN", all = T) data_E <- merge(data_E, VOCMWB_E, by = "SEQN", all = T)
SEQN为唯一ID识别码,注意merge时,使用all=T,否则会丢失样本。
4.数据重命名
在进行数据计算时,需要将我们选择的变量进行重新命名以更好识别
data_new <- plyr::rename(data_E, c(RIDAGEYR="age",DMDEDUC2="Education", RIDEXPRG="pregnant.status",RIAGENDR="Gender", RIDRETH1="race", BMXWT="weight",BMXHT="height",BMXBMI="BMI", LBXVBF="Bromoform",LBXVBM="Bromodichloromethane",LBXVCF="Chloroform", LBXVCM="Dibromochloromethane", LBXT3F="FT3",LBXT4F="FT4",LBXTT3="TT3",LBXTT4="TT4", LBXTPO="TPOAb",LBXATG="TgAb", RXDUSE="medication",MCQ160M="thyroid.deseases" ))
5.数据加权
library(survey) design <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=TRUE,data=data_new) design_new<-subset(design,SEQN%in%VOCMWB_E$SEQN & age>=20 & thyroid.deseases!=1 & !is.na(FT4) & !is.na(FT3) &!is.na(TT3) &!is.na(TT4) & !is.na(TPOAb) & !is.na(TgAb)) data_2<-subset(data_new,SEQN%in%VOCMWB_E$SEQN & age>=20 & thyroid.deseases!=1 & !is.na(FT4) & !is.na(FT3) &!is.na(TT3) &!is.na(TT4) & !is.na(TPOAb) & !is.na(TgAb))
6.一般情况分析
计算了数据中的年龄及种族加权及未加权的均值或比例,可以看出加权及未加权结果有很大差异,对数据进行基线信息描述时,应该使用加权结果。
#unweighted age and se mean(data_2$age,na.rm=T) #49.54916 # weighted age and se svymean(~age, design_new, na.rm = TRUE) #45.874 #' Proportion of unweighted interview sample data_2 %>% count(race) %>% mutate(prop= round(n / sum(n)*100, digits=1)) #' Proportion of weighted interview sample data_2 %>% count(race, wt=WTMEC2YR) %>% mutate(prop= round(n / sum(n)*100, digits=1))
具体在论文中呈现时,可以参考以下方式
7.svyglm分析
使用常规的glm和weighted glm会对结果进行有偏估计,应该在构建survey数据库的基础上,进行svyglm分析,以下是三个方法的比较。
#glm Result2 <- glm(TT4~Bromoform+age+Gender+race+BMI+Education, family = gaussian(), data=data_2) summary(Result2) #weighted glm Result3 <- glm(TT4~Bromoform+age+Gender+race+BMI+Education, family = gaussian(), data=data_2,weights =WTMEC2YR ) summary(Result3) #survey-weighted glm Result1 <- svyglm(TT4~Bromoform+age+Gender+race+BMI+Education, family = gaussian(), data=data_new,design=design_new) summary(Result1)
ref:
Sun Y, Xia PF, Korevaar TI, Mustieles V, Zhang Y, Pan XF, Wang YX, Messerlian C. Relationship between Blood Trihalomethane Concentrations and Serum Thyroid Function Measures in US Adults. Environmental Science & Technology. 2021 Oct 7.
Brody DJ, Pratt LA, Hughes JP. Prevalence of depression among adults aged 20 and over: United States, 2013-2016.
Emecen-Huja P, Li HF, Ebersole JL, Lambert J, Bush H. Epidemiologic evaluation of Nhanes for environmental Factors and periodontal disease. Scientific reports. 2019 Jun 3;9(1):1-1.