Calculating standard deviations for survey subgroups in R

Blog

10 September 2018

The survey package is one of my favourites in R.

Among its many other uses, it can compute summary statistics by subgroups. For example, if you have a survey of individuals from several countries with an item on the respondents’ income, you can calculate the average income in each subgroup with the svyby() function.

However, like many other functions in the package, svyby() returns standard errors—but not standard deviations—of the mean values. Although it is possible to calculate the standard deviations for all observations with the svyvar() function, with no argument for subgroup analysis, this can quickly become tiresome if you have many subgroups or variables to work with.

I have recently been working witht the European Social Survey data, and I needed to calculate the standard deviations of the means of three variabels in 23 countries. hence, I have written a loop, which I am sure I will use again.

Hoping that it might be useful for others as well, I am leaving it here:

library(survey)
library(tidyverse)

datalist <- list()

for(i in levels(as.factor(df$subgroups))) {
        
  df_subset <- df[df$subgroups == i, ] 
  
  design <- svydesign(id = ~1 , data = df_subset, weight = ~surveyweight)
  
  dat <- data.frame(as.data.frame(sqrt(svyvar(~variable1, na.rm= TRUE, design = design)))[,"variance"],
                    as.data.frame(sqrt(svyvar(~variable2, na.rm= TRUE, design = design)))[,"variance"])
  
  colnames(dat) <- c("sd.variable1", "sd.variable2")
  
  datalist[[i]] <- dat
}

df_sd <- as.data.frame(do.call(rbind, datalist)) %>%
  rownames_to_column() %>%
  rename(subgroups = rowname)

where you will need to change

df with your data frame,
subgroups with the variable name for the subgroups (i.e. countries),
surveyweight with the weight(s) in your survey, and
variable1 and variable2 with the variable names for which you are calculating the standard deviations.

All blog posts by date