Recoding Variables

Daniel Lüdecke

2019-09-24

Data preparation is a common task in research, which usually takes the most amount of time in the analytical process. sjmisc is a package with special focus on transformation of variables that fits into the workflow and design-philosophy of the so-called “tidyverse”.

Basically, this package complements the dplyr package in that sjmisc takes over data transformation tasks on variables, like recoding, dichotomizing or grouping variables, setting and replacing missing values, etc. A distinctive feature of sjmisc is the support for labelled data, which is especially useful for users who often work with data sets from other statistical software packages like SPSS or Stata.

This vignette demonstrate some of the important recoding-functions in sjmisc. The examples are based on data from the EUROFAMCARE project, a survey on the situation of family carers of older people in Europe. The sample data set efc is part of this package.

library(sjmisc)
data(efc)

To show the results after recoding variables, the frq() function is used to print frequency tables.

Dichotomization: dividing variables into two groups

dicho() dichotomizes variables into “dummy” variables (with 0/1 coding). Dichotomization is either done by median, mean or a specific value (see argument dich.by).

Like all recoding-functions in sjmisc, dicho() returns the complete data frame including the recoded variables, if the first argument is a data.frame. If the first argument is a vector, only the recoded variable is returned. See this vignette for details about the function-design.

If dicho() returns a data frame, the recoded variables have the same name as the original variable, including a suffix _d.

# age, ranged from 65 to 104, in this output
# grouped to get a shorter table
frq(efc, e17age, auto.grp = 5)
#> 
#> elder' age (e17age) <numeric>
#> # total N=908  valid N=891  mean=79.12  sd=8.09
#> 
#>  val  label frq raw.prc valid.prc cum.prc
#>    1  65-72 212   23.35     23.79   23.79
#>    2  73-80 277   30.51     31.09   54.88
#>    3  81-88 270   29.74     30.30   85.19
#>    4  89-96 124   13.66     13.92   99.10
#>    5 97-104   8    0.88      0.90  100.00
#>   NA   <NA>  17    1.87        NA      NA

# splitting is done at the median by default:
median(efc$e17age, na.rm = TRUE)
#> [1] 79

# the recoded variable is now named "e17age_d"
efc <- dicho(efc, e17age)
frq(efc, e17age_d)
#> 
#> elder' age (e17age_d) <categorical>
#> # total N=908  valid N=891  mean=0.49  sd=0.50
#> 
#>  val frq raw.prc valid.prc cum.prc
#>    0 455   50.11     51.07   51.07
#>    1 436   48.02     48.93  100.00
#>   NA  17    1.87        NA      NA

As dicho(), like all recoding-functions, supports labelled data, the variable preserves it variable label (but not the value labels). You can directly define value labels inside the function:

x <- dicho(efc$e17age, val.labels = c("young age", "old age"))
frq(x)
#> 
#> elder' age (x) <categorical>
#> # total N=908  valid N=891  mean=0.49  sd=0.50
#> 
#>  val     label frq raw.prc valid.prc cum.prc
#>    0 young age 455   50.11     51.07   51.07
#>    1   old age 436   48.02     48.93  100.00
#>   NA      <NA>  17    1.87        NA      NA

To split a variable at a different value, use the dich.by-argment. The value specified in dich.by is inclusive, i.e. all values from lowest to and including dich.by are recoded into the lower category, while all values above dich.by are recoded into the higher category.

# split at upper quartile
x <- dicho(
  efc$e17age, 
  dich.by = quantile(efc$e17age, probs = .75, na.rm = TRUE), 
  val.labels = c("younger three quarters", "oldest quarter")
)
frq(x)
#> 
#> elder' age (x) <categorical>
#> # total N=908  valid N=891  mean=0.24  sd=0.43
#> 
#>  val                  label frq raw.prc valid.prc cum.prc
#>    0 younger three quarters 678   74.67     76.09   76.09
#>    1         oldest quarter 213   23.46     23.91  100.00
#>   NA                   <NA>  17    1.87        NA      NA

Since the distribution of values in a dataset may differ for different subgroups, all recoding-functions also work on grouped data frames. In the following example, first, the age-variable e17age is dichotomized at the median. Then, the data is grouped by gender (c161sex) and the dichotomization is done for each subgroup, i.e. it once relates to the median age in the subgroup of female, and once to the median age in the subgroup of male family carers.

data(efc)
x1 <- dicho(efc$e17age)

x2 <- efc %>% 
  dplyr::group_by(c161sex) %>% 
  dicho(e17age) %>% 
  dplyr::pull(e17age_d)

# median age of total sample
frq(x1)
#> 
#> elder' age (x) <categorical>
#> # total N=908  valid N=891  mean=0.49  sd=0.50
#> 
#>  val frq raw.prc valid.prc cum.prc
#>    0 455   50.11     51.07   51.07
#>    1 436   48.02     48.93  100.00
#>   NA  17    1.87        NA      NA

# median age of total sample, with median-split applied
# to distibution of age by subgroups of gender
frq(x2)
#> 
#> elder' age (x) <numeric>
#> # total N=908  valid N=891  mean=1.50  sd=0.50
#> 
#>  val frq raw.prc valid.prc cum.prc
#>    1 449   49.45     50.39   50.39
#>    2 442   48.68     49.61  100.00
#>   NA  17    1.87        NA      NA

Splitting variables into several groups

split_var() recodes numeric variables into equal sized groups, i.e. a variable is cut into a smaller number of groups at specific cut points. The amount of groups depends on the n-argument and cuts a variable into n quantiles.

Similar to dicho(), if the first argument in split_var() is a data frame, the complete data frame including the new recoded variable(s), with suffix _g, is returned.

x <- split_var(efc$e17age, n = 3)
frq(x)
#> 
#> elder' age (x) <categorical>
#> # total N=908  valid N=891  mean=2.05  sd=0.82
#> 
#>  val frq raw.prc valid.prc cum.prc
#>    1 274   30.18     30.75   30.75
#>    2 294   32.38     33.00   63.75
#>    3 323   35.57     36.25  100.00
#>   NA  17    1.87        NA      NA

Unlike dplyr’s ntile(), split_var() never splits a value into two different categories, i.e. you always get a “clean” separation of original categories. In other words: cases that have identical values in a variable will always be recoded into the same group. The following example demonstrates the differences:

x <- dplyr::ntile(efc$neg_c_7, n = 3)
# for some cases, value "10" is recoded into category "1",
# for other cases into category "2". Same is true for value "13"
table(efc$neg_c_7, x)
#>     x
#>        1   2   3
#>   7   75   0   0
#>   8   99   0   0
#>   9  106   0   0
#>   10  18 102   0
#>   11   0  96   0
#>   12   0  85   0
#>   13   0  14  50
#>   14   0   0  54
#>   15   0   0  45
#>   16   0   0  30
#>   17   0   0  35
#>   18   0   0  26
#>   19   0   0  16
#>   20   0   0  16
#>   21   0   0   2
#>   22   0   0   7
#>   23   0   0   4
#>   24   0   0   3
#>   25   0   0   6
#>   27   0   0   1
#>   28   0   0   2

x <- split_var(efc$neg_c_7, n = 3)
# no separation of cases with identical values.
table(efc$neg_c_7, x)
#>     x
#>        1   2   3
#>   7   75   0   0
#>   8   99   0   0
#>   9  106   0   0
#>   10   0 120   0
#>   11   0  96   0
#>   12   0  85   0
#>   13   0   0  64
#>   14   0   0  54
#>   15   0   0  45
#>   16   0   0  30
#>   17   0   0  35
#>   18   0   0  26
#>   19   0   0  16
#>   20   0   0  16
#>   21   0   0   2
#>   22   0   0   7
#>   23   0   0   4
#>   24   0   0   3
#>   25   0   0   6
#>   27   0   0   1
#>   28   0   0   2

split_var(), unlike ntile(), does therefor not always return exactly equal-sized groups:

x <- dplyr::ntile(efc$neg_c_7, n = 3)
frq(x)
#> 
#> x <integer>
#> # total N=908  valid N=892  mean=2.00  sd=0.82
#> 
#>  val frq raw.prc valid.prc cum.prc
#>    1 298   32.82     33.41   33.41
#>    2 297   32.71     33.30   66.70
#>    3 297   32.71     33.30  100.00
#>   NA  16    1.76        NA      NA

x <- split_var(efc$neg_c_7, n = 3)
frq(x)
#> 
#> Negative impact with 7 items (x) <categorical>
#> # total N=908  valid N=892  mean=2.03  sd=0.81
#> 
#>  val frq raw.prc valid.prc cum.prc
#>    1 280   30.84     31.39   31.39
#>    2 301   33.15     33.74   65.13
#>    3 311   34.25     34.87  100.00
#>   NA  16    1.76        NA      NA

Recode variables into equal-ranged groups

With group_var(), variables can be grouped into equal ranged categories, i.e. a variable is cut into a smaller number of groups, where each group has the same value range. group_labels() creates the related value labels.

The range of the groups is defined in the size-argument. At the same time, the size-argument also defines the lower bound of one of the groups.

For instance, if the lowest value of a variable is 1 and the maximum is 10, and size = 5, then

  1. each group will have a range of 5, and
  2. one of the groups will start with the value 5.

This means, that an equal-ranged grouping will define groups from 0 to 4, 5 to 9 and 10-14. Each of these groups has a range of 5, and one of the groups starts with the value 5.

The group assignment becomes clearer, when group_labels() is used in parallel:

set.seed(123)
x <- round(runif(n = 150, 1, 10))

frq(x)
#> 
#> x <numeric>
#> # total N=150  valid N=150  mean=5.52  sd=2.63
#> 
#>  val frq raw.prc valid.prc cum.prc
#>    1   6    4.00      4.00    4.00
#>    2  19   12.67     12.67   16.67
#>    3  16   10.67     10.67   27.33
#>    4  17   11.33     11.33   38.67
#>    5  20   13.33     13.33   52.00
#>    6  12    8.00      8.00   60.00
#>    7  19   12.67     12.67   72.67
#>    8  16   10.67     10.67   83.33
#>    9  15   10.00     10.00   93.33
#>   10  10    6.67      6.67  100.00
#>   NA   0    0.00        NA      NA

frq(group_var(x, size = 5))
#> 
#> x <numeric>
#> # total N=150  valid N=150  mean=1.68  sd=0.59
#> 
#>  val frq raw.prc valid.prc cum.prc
#>    1  58   38.67     38.67   38.67
#>    2  82   54.67     54.67   93.33
#>    3  10    6.67      6.67  100.00
#>   NA   0    0.00        NA      NA

group_labels(x, size = 5)
#> [1] "0-4"   "5-9"   "10-14"

dummy <- group_var(x, size = 5, as.num = FALSE)
levels(dummy) <- group_labels(x, size = 5)
frq(dummy)
#> 
#> x <categorical>
#> # total N=150  valid N=150  mean=1.68  sd=0.59
#> 
#>    val frq raw.prc valid.prc cum.prc
#>    0-4  58   38.67     38.67   38.67
#>    5-9  82   54.67     54.67   93.33
#>  10-14  10    6.67      6.67  100.00
#>   <NA>   0    0.00        NA      NA

dummy <- group_var(x, size = 3, as.num = FALSE)
levels(dummy) <- group_labels(x, size = 3)
frq(dummy)
#> 
#> x <categorical>
#> # total N=150  valid N=150  mean=2.48  sd=0.96
#> 
#>   val frq raw.prc valid.prc cum.prc
#>   0-2  25   16.67     16.67   16.67
#>   3-5  53   35.33     35.33   52.00
#>   6-8  47   31.33     31.33   83.33
#>  9-11  25   16.67     16.67  100.00
#>  <NA>   0    0.00        NA      NA

The argument right.interval can be used when size should indicate the upper bound of a group-range.

dummy <- group_var(x, size = 4, as.num = FALSE)
levels(dummy) <- group_labels(x, size = 4)
frq(dummy)
#> 
#> x <categorical>
#> # total N=150  valid N=150  mean=2.00  sd=0.74
#> 
#>   val frq raw.prc valid.prc cum.prc
#>   0-3  41   27.33     27.33   27.33
#>   4-7  68   45.33     45.33   72.67
#>  8-11  41   27.33     27.33  100.00
#>  <NA>   0    0.00        NA      NA

dummy <- group_var(x, size = 4, as.num = FALSE, right.interval = TRUE)
levels(dummy) <- group_labels(x, size = 4, right.interval = TRUE)
frq(dummy)
#> 
#> x <categorical>
#> # total N=150  valid N=150  mean=1.78  sd=0.71
#> 
#>   val frq raw.prc valid.prc cum.prc
#>   1-4  58   38.67     38.67   38.67
#>   5-8  67   44.67     44.67   83.33
#>  9-12  25   16.67     16.67  100.00
#>  <NA>   0    0.00        NA      NA

Flexible recoding of variables

rec() recodes old values of variables into new values, and can be considered as a “classsical” recode-function. The recode-pattern, i.e. which new values should replace the old values, is defined in the rec-argument. This argument has a specific “syntax”:

Here are some examples:

frq(efc$e42dep)
#> 
#> elder's dependency (x) <numeric>
#> # total N=908  valid N=901  mean=2.94  sd=0.94
#> 
#>  val                label frq raw.prc valid.prc cum.prc
#>    1          independent  66    7.27      7.33    7.33
#>    2   slightly dependent 225   24.78     24.97   32.30
#>    3 moderately dependent 306   33.70     33.96   66.26
#>    4   severely dependent 304   33.48     33.74  100.00
#>   NA                 <NA>   7    0.77        NA      NA

# replace NA with 5
frq(rec(efc$e42dep, rec = "NA=5;else=copy"))
#> 
#> elder's dependency (x) <numeric>
#> # total N=908  valid N=908  mean=2.96  sd=0.95
#> 
#>  val frq raw.prc valid.prc cum.prc
#>    1  66    7.27      7.27    7.27
#>    2 225   24.78     24.78   32.05
#>    3 306   33.70     33.70   65.75
#>    4 304   33.48     33.48   99.23
#>    5   7    0.77      0.77  100.00
#>   NA   0    0.00        NA      NA

# recode 1 to 2 into 1 and 3 to 4 into 2
frq(rec(efc$e42dep, rec = "1,2=1; 3,4=2"))
#> 
#> elder's dependency (x) <numeric>
#> # total N=908  valid N=901  mean=1.68  sd=0.47
#> 
#>  val frq raw.prc valid.prc cum.prc
#>    1 291   32.05      32.3    32.3
#>    2 610   67.18      67.7   100.0
#>   NA   7    0.77        NA      NA

# recode 1 to 3 into 4 into 2
frq(rec(efc$e42dep, rec = "min:3=1; 4=2"))
#> 
#> elder's dependency (x) <numeric>
#> # total N=908  valid N=901  mean=1.34  sd=0.47
#> 
#>  val frq raw.prc valid.prc cum.prc
#>    1 597   65.75     66.26   66.26
#>    2 304   33.48     33.74  100.00
#>   NA   7    0.77        NA      NA

# recode numeric to character, and remaining values
# into the highest value (="hi") of e42dep
frq(rec(efc$e42dep, rec = "1=first;2=2nd;else=hi"))
#> 
#> elder's dependency (x) <character>
#> # total N=908  valid N=901  mean=1.82  sd=0.54
#> 
#>    val frq raw.prc valid.prc cum.prc
#>    2nd 225   24.78     24.97   24.97
#>      4 610   67.18     67.70   92.67
#>  first  66    7.27      7.33  100.00
#>   <NA>   7    0.77        NA      NA

data(iris)
frq(rec(iris, Species, rec = "setosa=huhu; else=copy", append = FALSE))
#> 
#> Species_r <categorical>
#> # total N=150  valid N=150  mean=2.00  sd=0.82
#> 
#>         val frq raw.prc valid.prc cum.prc
#>        huhu  50   33.33     33.33   33.33
#>  versicolor  50   33.33     33.33   66.67
#>   virginica  50   33.33     33.33  100.00
#>        <NA>   0    0.00        NA      NA

# works with mutate
efc %>%
  dplyr::select(e42dep, e17age) %>%
  dplyr::mutate(dependency_rev = rec(e42dep, rec = "rev")) %>%
  head()
#>   e42dep e17age dependency_rev
#> 1      3     83              2
#> 2      3     88              2
#> 3      3     82              2
#> 4      4     67              1
#> 5      4     84              1
#> 6      4     85              1

# recode multiple variables and set value labels via recode-syntax
dummy <- rec(
  efc, c160age, e17age,
  rec = "15:30=1 [young]; 31:55=2 [middle]; 56:max=3 [old]",
  append = FALSE
)
frq(dummy)
#> 
#> carer' age (c160age_r) <numeric>
#> # total N=908  valid N=901  mean=2.40  sd=0.59
#> 
#>  val  label frq raw.prc valid.prc cum.prc
#>    1  young  48    5.29      5.33    5.33
#>    2 middle 442   48.68     49.06   54.38
#>    3    old 411   45.26     45.62  100.00
#>   NA   <NA>   7    0.77        NA      NA
#> 
#> 
#> elder' age (e17age_r) <numeric>
#> # total N=908  valid N=891  mean=3.00  sd=0.00
#> 
#>  val  label frq raw.prc valid.prc cum.prc
#>    1  young   0    0.00         0       0
#>    2 middle   0    0.00         0       0
#>    3    old 891   98.13       100     100
#>   NA   <NA>  17    1.87        NA      NA

Scoped variants

Where applicable, the recoding-functions in sjmisc have “scoped” versions as well, e.g. dicho_if() or split_var_if(), where transformation will be applied only to those variables that match the logical condition of predicate.

Cheat Sheet

A cheatsheet can be downloaded from the RStudio cheatsheet collection.