A Guide to Bivariate Table 1 • datscience

Sample Characteristics (Table 1)

Including Bivariate Comparisons

Before starting this tutorial, I want to thank the developers of the packages

table1 (Benjamin Rich), and
flextable (David Gohel and Colleagues).

Really all credit should go to the teams maintaining these two packages.
Full disclosure: My function merely provides an easy-to-use API or wrapper around their packages to get a beautiful publication-ready bivariate comparison table 1.

Creating a descriptive sociodemographic table can be a tedious and repetitive task. Every published article has such a table (usually the first, so table1). However, I found myself to be frustrated over the amount of work it took to create those table, especially because it is such a repetitive task, and I always believed that there has to be an easy solution for this in R. Recently I discovered the table1::table1() function, which made life much easier. I advanced the example given by Benjamin Rich and combined it with the ability of flextable::flextable() to get it nicely formatted into Word.

So what am I talking about, you might think? See for yourself below; It just requires one function call!

# Only 2 mandatory arguments: formula specifying variables to be used & data (see also table1::table1())
flex_table1(
  str_formula = "~ random_Smoking  + Glucose + Insulin + Leptin +Age + BMI|Diagnosis", 
  data = breast_cancer_modified,
  table_caption = c(
    "Table 1",
    "Sample Characteristics, Comparison of Healthy Controls and Cancer Patients."
  )

Characteristic	Healthy Controls (N = 52)	Breast Cancer Patients (N = 64)	t / χ²	df	p
Fictional Smoking Status			0.56	2	.756
Non-Smoker	26 (50 %)	28 (44 %)
Occasionally	2 (4 %)	2 (3 %)
Smoker	24 (46 %)	34 (53 %)
Glucose ***	88 (± 10)	110 (± 27)	-4.8	84.48	< .001
Insulin **	6.9 (± 4.9)	13 (± 12)	-3.32	85.57	.0013
Leptin	27 (± 19)	27 (± 19)	0.01	114	.991
Age	58 (± 19)	57 (± 13)	0.45	89.37	.654
BMI	28 (± 5.4)	27 (± 4.6)	1.43	114	.156
Note. Differences are determined by independent sample t-test or Pearson's χ²-test.

In published articles, this kind of table (above) is usually the first table (thus called Table 1). The example above shows the descriptive stats of two subgroups of the sample (aka bivariate descriptive). The function also automatically conducts group comparisons (and, if desired, corrects) the obtained p-values.

For metric/interval data: independent sample t-test or in case of heterogeneity of variances the Welch’s correction is applied
For categorical/ordinal data: Egon Pearon’s \(N-1\) version of the \(\chi^2\)-test or respective Fisher's test (for expected cell counts of ≤ 1).

See also the “Details” section in the documentation of the function (flex_table1()), on why the function defaults to no correction for multiple comparisons and the use of the \(N-1\) version of the \(\chi^2\) test for cases with expected cell count above 1 (instead of Fisher’s exact test for cell expected cell count below 5).

I will show you how you can create such a table and get it into Word with just a few basic and simple steps.

Step by Step Tutorial

So how do we get there, just follow these few and easy steps.

1.) Data and Setup

1.1) Load Required Packages

We will primarily use the datscience package, as well as a bit of dplyr and labelled to prepare the data. However, please note that the function I am showcasing here heavily benefits from the two package table1 flextable (as described above).
If you want to customize the appearance of your table, I recommend you to check out the sources above. The function and vignette provided here just serve for convenience purposes, to make the creation of a bivariate comparison in table1 a less time-consuming task.

Note: If you further want to customize your table created with datscience::flex_table1(), you can easily additionally use the functions provided by flextable, as flex_table1() just returns a flextable object which you can modify as you desire

library(datscience)
# Additionally load dplyr and labelled for Data Wrangling
library(dplyr)
library(labelled)
# We are also setting seed because we will generate random data with sample below
set.seed(123)

1.2) Loading Exemplary Data

The data we are using for this tutorial is publicly [Patricio, 2018] available on the homepage of the UCI Machine Learning Repository. For reading in the data we can just use baseR function read.csv().

# Load the data
path <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00451/dataR2.csv"
breast_cancer <- read.csv(path)

1.3) Modify the Data

Now, for demonstration purposes, we will slightly modify the original data set. We will start by changing the column Classification to a factor and assigning the correct labels, obtained from the UCI repository.:

| Labels:
| 1=Healthy controls
| 2=Patients

Additionally we will generate a new fictional variable called random_Smoking with 3 associated factor levels c("Smoker","Non-Smoker","Occasionally").

Please note that this is not part of the original data and was just fabricated to illustrate the utility of the flex_table1() function on comparison of factors.

# Modify the Data
breast_cancer_modified <- breast_cancer |>
  # Create a Factor Variable from Classification Column
  # And add Labels to the Factor Levels
  mutate(Diagnosis = factor(Classification,
    labels = c("Healthy Controls", "Breast Cancer Patients")
  )) |>
  # Remove the old Variable
  select(-Classification) |>
  # Generate a New Fictional Categorical Variable: random_Smoking
  mutate(random_Smoking = sample(c("Smoker", "Non-Smoker", "Occasionally"),
    size = nrow(breast_cancer),
    replace = TRUE, prob = c(0.49, 0.49, 0.02)
  )) |>
  # Convert the Randomly Generated Variable into a Factor
  mutate(random_Smoking = factor(random_Smoking))

# Giving the new fictional column a name to be shown in the table
var_label(breast_cancer_modified$random_Smoking) <- "Fictional Smoking Status"

2) Inspection and Overview of Data

Just to get a glimpse and overview of the data. Additionally, this give the opportunity to showcase another useful function of this (datscience) package the format_flextable()

# determine how many number of decimal places by convention
nod <- get_number_of_decimals(nrow(breast_cancer_modified))
head(breast_cancer_modified) |> 
  # round numeric columns
  mutate(across(where(is.numeric), ~ round(.x,nod))) |> 
  flextable() |> 
  format_flextable(table_caption = c("Table 2","First Cases (N = 5) of the Cancer Patients Data Set"))

First Cases (N = 5) of the Cancer Patients Data Set
48	23.5	70	2.7	0.5	8.8	9.7	8.0	417.1	Healthy Controls	Smoker
83	20.7	92	3.1	0.7	8.8	5.4	4.1	468.8	Healthy Controls	Non-Smoker
82	23.1	91	4.5	1.0	17.9	22.4	9.3	554.7	Healthy Controls	Non-Smoker
68	21.4	77	3.2	0.6	9.9	7.2	12.8	928.2	Healthy Controls	Smoker
86	21.1	92	3.5	0.8	6.7	4.8	10.6	773.9	Healthy Controls	Smoker
49	22.9	92	3.2	0.7	6.8	13.7	10.3	530.4	Healthy Controls	Non-Smoker

2.1) Univariate Descriptives and Distribution

Additionally lets inspect distributional parameters and descriptive statistics with the use of the function psych::describe(), and format the output again with datscience::format_flextable(). All outputs generated here can also be conveniently saved to word with datscience::save_flextable()

Descriptive Statistics and Distribution

breast_cancer_modified |> 
  # remove categorical variables
  select(where(is.numeric)) |> 
  # get descriptives of metric variables
    psych::describe() |>
    as.data.frame() |> 
    round(nod) |> 
  # add rownmaes (name of variable) as a separate columns
    tibble::rownames_to_column(var = "Variable") |> 
    select(-vars) |> 
  # generate flextable
    flextable() |> 
  # format flextable
    format_flextable(table_caption = c("Table 3","Complete Sample - Numeric Variables: Distribution and Descriptive Stat."))

Complete Sample - Numeric Variables: Distribution and Descriptive Stat.
Age	116	57.3	16.1	56.0	57.3	18.5	24.0	89.0	65.0	0.0	-1.0	1.5
BMI	116	27.6	5.0	27.7	27.4	6.4	18.4	38.6	20.2	0.2	-1.0	0.5
Glucose	116	97.8	22.5	92.0	94.1	11.9	60.0	201.0	141.0	2.5	8.4	2.1
Insulin	116	10.0	10.1	5.9	7.8	3.7	2.4	58.5	56.0	2.5	7.0	0.9
HOMA	116	2.7	3.6	1.4	1.9	1.0	0.5	25.1	24.6	3.7	16.5	0.3
Leptin	116	26.6	19.2	20.3	23.9	15.5	4.3	90.3	86.0	1.3	1.2	1.8
Adiponectin	116	10.2	6.8	8.4	9.1	4.4	1.7	38.0	36.4	1.8	3.6	0.6
Resistin	116	14.7	12.4	10.8	12.5	7.8	3.2	82.1	78.9	2.5	8.3	1.2
MCP.1	116	534.6	345.9	471.3	490.6	304.6	45.8	1,698.4	1,652.6	1.4	2.3	32.1

3.) Creation of Bivariate Table 1 Including Comparisons

The function datscience::flex_table1 provides a nice API for creating Table 1 with bivariate group comparison. There are only two mandatory arguments:

data = the data.frame to use, in the example we use the previously prepared modified breast cancer data data = breast_cancer_modified
str_formula = the variables that should be displayed and compared in formula notation. Note formula is given as a string in this case. It starts allways with a tilde "~" followed by the variables to be shown in the rows combined with a plus sign "Age + BMI". Lastly the groups to be compared a specified after "| Diagnosis". For more details please refere to the table1 documentation. Lets say for our cancer sample we want to know if there are difference in the following variables
- Glucose, Insulin, Leptin, Age and BMI of the sample, Additionally if the fictional smoking status differs
- Between Cancer Patients and Healthy Controls
- The formula string would look like this: "~ random_Smoking + Glucose + Insulin + Leptin +Age + BMI|Diagnosis"
For this tutorial we also add a caption to the table.

And that’s it. We included in the function call below the argument ref_correction = TRUE this marks p-values where the respective correction was applied. However TRUE is also the default value. You change this behavior by just passing FALSE

Creation of the Table

flex_table1(
  str_formula = "~ random_Smoking  + Glucose + Insulin + Leptin +Age + BMI|Diagnosis", 
  data = breast_cancer_modified,
  table_caption = c(
    "Table 4",
    "Sample Characteristics, Comparison of Healthy Controls and Cancer Patients."
  ),
  ref_correction = TRUE
)|>
  # Only For markdown we additionally need to fix the autofit property, this step is not
  # needed when directly saved to word e.g. with datscience::save_flextable()
  set_table_properties(layout = "autofit", width = 1)

Sample Characteristics, Comparison of Healthy Controls and Cancer Patients.
Fictional Smoking Status			0.56	2	.756
Non-Smoker	26 (50 %)	28 (44 %)
Occasionally	2 (4 %)	2 (3 %)
Smoker	24 (46 %)	34 (53 %)
Glucose ***	88 (± 10)	110 (± 27)	-4.8	84.48	< .001
Insulin **	6.9 (± 4.9)	13 (± 12)	-3.32	85.57	.0013
Leptin	27 (± 19)	27 (± 19)	0.01	114	.991
Age	58 (± 19)	57 (± 13)	0.45	89.37	.654
BMI	28 (± 5.4)	27 (± 4.6)	1.43	114	.156
Note. Differences are determined by independent sample t-test or Pearson's χ²-test.

Multiple Groups

Please note as of version 0.2.3 of datscience one can also compare multiple groups with the flex_table1 function (see also News.md)

Literature

[Patricio, 2018] Patrício, M., Pereira, J., Crisóstomo, J., Matafome, P., Gomes, M., Seiça, R., & Caramelo, F. (2018). Using Resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer, 18(1)

First Cases (N = 5) of the Cancer Patients Data Set
Table 2
Age	BMI	Glucose	Insulin	HOMA	Leptin	Adiponectin	Resistin	MCP.1	Diagnosis	random_Smoking
48	23.5	70	2.7	0.5	8.8	9.7	8.0	417.1	Healthy Controls	Smoker
83	20.7	92	3.1	0.7	8.8	5.4	4.1	468.8	Healthy Controls	Non-Smoker
82	23.1	91	4.5	1.0	17.9	22.4	9.3	554.7	Healthy Controls	Non-Smoker
68	21.4	77	3.2	0.6	9.9	7.2	12.8	928.2	Healthy Controls	Smoker
86	21.1	92	3.5	0.8	6.7	4.8	10.6	773.9	Healthy Controls	Smoker
49	22.9	92	3.2	0.7	6.8	13.7	10.3	530.4	Healthy Controls	Non-Smoker

Complete Sample - Numeric Variables: Distribution and Descriptive Stat.
Table 3
Variable	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
Age	116	57.3	16.1	56.0	57.3	18.5	24.0	89.0	65.0	0.0	-1.0	1.5
BMI	116	27.6	5.0	27.7	27.4	6.4	18.4	38.6	20.2	0.2	-1.0	0.5
Glucose	116	97.8	22.5	92.0	94.1	11.9	60.0	201.0	141.0	2.5	8.4	2.1
Insulin	116	10.0	10.1	5.9	7.8	3.7	2.4	58.5	56.0	2.5	7.0	0.9
HOMA	116	2.7	3.6	1.4	1.9	1.0	0.5	25.1	24.6	3.7	16.5	0.3
Leptin	116	26.6	19.2	20.3	23.9	15.5	4.3	90.3	86.0	1.3	1.2	1.8
Adiponectin	116	10.2	6.8	8.4	9.1	4.4	1.7	38.0	36.4	1.8	3.6	0.6
Resistin	116	14.7	12.4	10.8	12.5	7.8	3.2	82.1	78.9	2.5	8.3	1.2
MCP.1	116	534.6	345.9	471.3	490.6	304.6	45.8	1,698.4	1,652.6	1.4	2.3	32.1

Sample Characteristics, Comparison of Healthy Controls and Cancer Patients.
Table 4
Characteristic	Healthy Controls (N = 52)	Breast Cancer Patients (N = 64)	t / χ²	df	p
Fictional Smoking Status			0.56	2	.756
Non-Smoker	26 (50 %)	28 (44 %)
Occasionally	2 (4 %)	2 (3 %)
Smoker	24 (46 %)	34 (53 %)
Glucose ***	88 (± 10)	110 (± 27)	-4.8	84.48	< .001
Insulin **	6.9 (± 4.9)	13 (± 12)	-3.32	85.57	.0013
Leptin	27 (± 19)	27 (± 19)	0.01	114	.991
Age	58 (± 19)	57 (± 13)	0.45	89.37	.654
BMI	28 (± 5.4)	27 (± 4.6)	1.43	114	.156
Note. Differences are determined by independent sample t-test or Pearson's χ²-test.