Kickstarting R - Contingency tables

## How do I get a crosstab?

You've been locked in a room with a PC containing the data for 248 subjects and they won't let you have lunch until you have crosstabulated all the demographic data. It's almost noon and you only have R. You hesitantly try

`> table(infert\$education,infert\$parity)`

and you get a very sparse tabulation of the parity (number of births) by educational attainment. You try the enhanced version of this function,

`> xtabs(infert\$education,infert\$parity)`

and are faced with a slightly more informative display. Unfortunately, you know that Bronwyn will want to know what percentage of women who completed high school had 2 or fewer children and Hans will have to have a chi-squared test for every contingency table. Let's see what can be done. R follows the precepts of a bunch of brilliant people at Bell Labs in making statistics modular. That is, individual functions do fairly simple, general things very well, and intelligently combining the modules will do almost anything that you want. The beginner's problem is usually figuring out what the heck are the functions that will do the particular things that they want. We'll use the example data frame `infert` provided with R to illustrate how to build on that. First, let's find and retrieve the data.

```> show.data()
...
freeny          Freeny's Revenue Data
infert          Secondary infertility matched case-control study
iris            Edgar Anderson's Iris Data as data.frame
...
> data(infert)```

A quick summary of the data will reveal that parity ranges from 1-6. This will have to be reduced to two categories. That's pretty easy to do by assigning the output of a logical comparison.

```> gt2<-infert\$parity>2
> table(infert\$education,gt2)
...```

The observant reader may ask why the comparison "greater than" was used rather than "less than or equal to". Convenience is the answer. By default, R orders factors, and FALSE (0) is less than TRUE (1). Using "greater than" here gets the factors "right way round", rather than having "more than 2" in the first column and "less than or equal to 2" in the second. When factors are coded as labels, they are ordered alphabetically. You can explicitly order factors if you wish.

This is still a pretty laconic table which will have to be explained. Putting the `dimnames` in will help.

`> table(infert\$education,gt2,dnn=c("Education","Parity"))`

It would also be nice if there were some descriptive labels rather than just "FALSE" and "TRUE". The really useful function `ifelse()` will do the trick.

```> gt2<-ifelse(infert\$parity>2,"Over 2","2 or less") > table(infert\$education,gt2,dnn=c("Education","Parity"))```

Notice how the labels have been doctored so that they will be in the conventional order. Now we have a reasonable looking contingency table, but what about Bronwyn's percentages and Hans' chi-squares? We're going to have to go a bit beyond what `table()` will do to get output that will satisfy them. Let's go through the function format.xtab().

First, we check that the minimal data is there, then get the base table from which to derive the rest of the information. In order to calculate the percentages, we'll need the row and column sums. These can be calculated in one hit by using `apply()`. Next up come the row and column names. Here, `formatC()` pops up. Plain old `format()` would have formatted each set of labels to the length of the longest label plus 1, but if we want a neat table, we want all of the labels to be the same length. Also notice that the `fieldwidth` has been given the default value of 10, allowing the user to shrink or expand the columns. `dnn` is given a default value if none was passed, and we're ready to go.

First the variable names (`dnn`) and the column names, then each of the rows, starting with the cell counts and row counts, the cell row percentages and the overall row percentages and then the cell column percentages. After that, the column counts and grand total and the column percentages. Finally, if a chi-square test was ordered by including the argument `chisq=T`, the rather complicated bit at the bottom to print out the values of the chi-square test will do its stuff. It would be simpler just to run the chi-square test and let it print itself, but we would then get variables labeled as `v1` and `v2`, which might be confusing. You'll also notice when you run this function that `chisq.test()` warns you that some of the cells have smaller than recommended counts. You may wish to recode educational attainment to two categories as an exercise.

## But wait, I want more than two dimensions!

Contingency tables with more than two dimensions can be pretty difficult to interpret, and the chi-square test will only handle 2D at present anyway. However, that's no reason to get spooked. In the same file as `format.xtab()` is the `xtab()` function.

If you pass it a two element formula, it will act just like `format.xtab()`.

`xtab(v1~v2,mydata)`

If you ask for more than two dimensions, it will print out hierarchical counts and percentages of all levels of variables starting at the last one in the formula. When it gets to the first two, it will print out 2D contingency tables for those variables. It gets silly pretty quickly. Both `table()` and `ftable()` will also display multi-way crosstabulations.

This is also an introduction to the use of recursion, in which a function calls itself until whatever test you have set is satisfied. In this case, the function stops calling itself when there are at most two variables to be crosstabulated.

Get the nachos, you deserve it.