Sample Characteristics (Table 1)
Including Bivariate Comparisons
Before starting this tutorial, I want to thank the developers of the packages
Really all credit should go to the teams maintaining these two
packages.
Full disclosure: My function merely provides an
easy-to-use API or wrapper around their packages to get a beautiful
publication-ready bivariate comparison table 1.
Creating a descriptive sociodemographic table can be a tedious and
repetitive task. Every published article has such a table (usually the
first, so table1). However, I found myself to be frustrated over the
amount of work it took to create those table, especially because it is
such a repetitive task, and I always believed that there has to be an
easy solution for this in R. Recently I discovered the
table1::table1()
function, which made life much easier. I
advanced the example given by Benjamin Rich and combined it with the
ability of flextable::flextable()
to get it nicely
formatted into Word.
So what am I talking about, you might think? See for yourself below; It just requires one function call!
# Only 2 mandatory arguments: formula specifying variables to be used & data (see also table1::table1())
flex_table1(
str_formula = "~ random_Smoking + Glucose + Insulin + Leptin +Age + BMI|Diagnosis",
data = breast_cancer_modified,
table_caption = c(
"Table 1",
"Sample Characteristics, Comparison of Healthy Controls and Cancer Patients."
)
Characteristic |
Healthy Controls |
Breast Cancer Patients |
t / χ² |
df |
p |
---|---|---|---|---|---|
Fictional Smoking Status |
0.56 |
2 |
.756 |
||
Non-Smoker |
26 (50 %) |
28 (44 %) |
|||
Occasionally |
2 (4 %) |
2 (3 %) |
|||
Smoker |
24 (46 %) |
34 (53 %) |
|||
Glucose *** |
88 (± 10) |
110 (± 27) |
-4.8 |
84.48 |
< .001 |
Insulin ** |
6.9 (± 4.9) |
13 (± 12) |
-3.32 |
85.57 |
.0013 |
Leptin |
27 (± 19) |
27 (± 19) |
0.01 |
114 |
.991 |
Age |
58 (± 19) |
57 (± 13) |
0.45 |
89.37 |
.654 |
BMI |
28 (± 5.4) |
27 (± 4.6) |
1.43 |
114 |
.156 |
Note. Differences are determined by independent sample t-test or Pearson's χ²-test. |
In published articles, this kind of table (above) is usually the first table (thus called Table 1). The example above shows the descriptive stats of two subgroups of the sample (aka bivariate descriptive). The function also automatically conducts group comparisons (and, if desired, corrects) the obtained p-values.
- For metric/interval data: independent sample t-test or in case of heterogeneity of variances the Welch’s correction is applied
- For categorical/ordinal data: Egon Pearon’s \(N-1\) version of the \(\chi^2\)-test or respective Fisher's test (for expected cell counts of ≤ 1).
See also the “Details” section in the documentation of the function
(flex_table1()
), on why the function defaults to no
correction for multiple comparisons and the use of the \(N-1\) version of the \(\chi^2\) test for cases with expected cell
count above 1 (instead of Fisher’s exact test for cell expected cell
count below 5).
I will show you how you can create such a table and get it into Word with just a few basic and simple steps.
Step by Step Tutorial
So how do we get there, just follow these few and easy steps.
1.) Data and Setup
1.1) Load Required Packages
We will primarily use the datscience
package, as well as
a bit of dplyr
and labelled
to prepare the
data. However, please note that the function I am showcasing here
heavily benefits from the two package table1
flextable
(as described above).
If you want to customize the appearance of your table, I recommend you
to check out the sources above. The function and vignette provided here
just serve for convenience purposes, to make the creation of a bivariate
comparison in table1 a less time-consuming task.
Note: If you further want to customize your
table created with datscience::flex_table1()
, you can
easily additionally use the functions provided by flextable, as
flex_table1()
just returns a flextable object which you can
modify as you desire
1.2) Loading Exemplary Data
The data we are using for this tutorial is publicly [Patricio, 2018]
available on the homepage of the UCI
Machine Learning Repository. For reading in the data we can just use
baseR function read.csv()
.
# Load the data
path <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00451/dataR2.csv"
breast_cancer <- read.csv(path)
1.3) Modify the Data
Now, for demonstration purposes, we will slightly modify the original
data set. We will start by changing the column
Classification
to a factor and assigning the correct
labels, obtained from the UCI repository.:
| Labels:
| 1=Healthy controls
| 2=Patients
Additionally we will generate a new fictional variable called
random_Smoking
with 3 associated factor levels
c("Smoker","Non-Smoker","Occasionally")
.
Please note that this is not part of the
original data and was just fabricated to illustrate the utility of the
flex_table1()
function on comparison of factors.
# Modify the Data
breast_cancer_modified <- breast_cancer |>
# Create a Factor Variable from Classification Column
# And add Labels to the Factor Levels
mutate(Diagnosis = factor(Classification,
labels = c("Healthy Controls", "Breast Cancer Patients")
)) |>
# Remove the old Variable
select(-Classification) |>
# Generate a New Fictional Categorical Variable: random_Smoking
mutate(random_Smoking = sample(c("Smoker", "Non-Smoker", "Occasionally"),
size = nrow(breast_cancer),
replace = TRUE, prob = c(0.49, 0.49, 0.02)
)) |>
# Convert the Randomly Generated Variable into a Factor
mutate(random_Smoking = factor(random_Smoking))
# Giving the new fictional column a name to be shown in the table
var_label(breast_cancer_modified$random_Smoking) <- "Fictional Smoking Status"
2) Inspection and Overview of Data
Just to get a glimpse and overview of the data. Additionally, this
give the opportunity to showcase another useful function of this
(datscience
) package the
format_flextable()
# determine how many number of decimal places by convention
nod <- get_number_of_decimals(nrow(breast_cancer_modified))
head(breast_cancer_modified) |>
# round numeric columns
mutate(across(where(is.numeric), ~ round(.x,nod))) |>
flextable() |>
format_flextable(table_caption = c("Table 2","First Cases (N = 5) of the Cancer Patients Data Set"))
First Cases (N = 5) of the Cancer Patients Data Set | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Table 2 | ||||||||||
Age |
BMI |
Glucose |
Insulin |
HOMA |
Leptin |
Adiponectin |
Resistin |
MCP.1 |
Diagnosis |
random_Smoking |
48 |
23.5 |
70 |
2.7 |
0.5 |
8.8 |
9.7 |
8.0 |
417.1 |
Healthy Controls |
Smoker |
83 |
20.7 |
92 |
3.1 |
0.7 |
8.8 |
5.4 |
4.1 |
468.8 |
Healthy Controls |
Non-Smoker |
82 |
23.1 |
91 |
4.5 |
1.0 |
17.9 |
22.4 |
9.3 |
554.7 |
Healthy Controls |
Non-Smoker |
68 |
21.4 |
77 |
3.2 |
0.6 |
9.9 |
7.2 |
12.8 |
928.2 |
Healthy Controls |
Smoker |
86 |
21.1 |
92 |
3.5 |
0.8 |
6.7 |
4.8 |
10.6 |
773.9 |
Healthy Controls |
Smoker |
49 |
22.9 |
92 |
3.2 |
0.7 |
6.8 |
13.7 |
10.3 |
530.4 |
Healthy Controls |
Non-Smoker |
2.1) Univariate Descriptives and Distribution
Additionally lets inspect distributional parameters and descriptive
statistics with the use of the function psych::describe()
,
and format the output again with
datscience::format_flextable()
. All outputs generated here
can also be conveniently saved to word with
datscience::save_flextable()
Descriptive Statistics and Distribution
breast_cancer_modified |>
# remove categorical variables
select(where(is.numeric)) |>
# get descriptives of metric variables
psych::describe() |>
as.data.frame() |>
round(nod) |>
# add rownmaes (name of variable) as a separate columns
tibble::rownames_to_column(var = "Variable") |>
select(-vars) |>
# generate flextable
flextable() |>
# format flextable
format_flextable(table_caption = c("Table 3","Complete Sample - Numeric Variables: Distribution and Descriptive Stat."))
Complete Sample - Numeric Variables: Distribution and Descriptive Stat. | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Table 3 | ||||||||||||
Variable |
n |
mean |
sd |
median |
trimmed |
mad |
min |
max |
range |
skew |
kurtosis |
se |
Age |
116 |
57.3 |
16.1 |
56.0 |
57.3 |
18.5 |
24.0 |
89.0 |
65.0 |
0.0 |
-1.0 |
1.5 |
BMI |
116 |
27.6 |
5.0 |
27.7 |
27.4 |
6.4 |
18.4 |
38.6 |
20.2 |
0.2 |
-1.0 |
0.5 |
Glucose |
116 |
97.8 |
22.5 |
92.0 |
94.1 |
11.9 |
60.0 |
201.0 |
141.0 |
2.5 |
8.4 |
2.1 |
Insulin |
116 |
10.0 |
10.1 |
5.9 |
7.8 |
3.7 |
2.4 |
58.5 |
56.0 |
2.5 |
7.0 |
0.9 |
HOMA |
116 |
2.7 |
3.6 |
1.4 |
1.9 |
1.0 |
0.5 |
25.1 |
24.6 |
3.7 |
16.5 |
0.3 |
Leptin |
116 |
26.6 |
19.2 |
20.3 |
23.9 |
15.5 |
4.3 |
90.3 |
86.0 |
1.3 |
1.2 |
1.8 |
Adiponectin |
116 |
10.2 |
6.8 |
8.4 |
9.1 |
4.4 |
1.7 |
38.0 |
36.4 |
1.8 |
3.6 |
0.6 |
Resistin |
116 |
14.7 |
12.4 |
10.8 |
12.5 |
7.8 |
3.2 |
82.1 |
78.9 |
2.5 |
8.3 |
1.2 |
MCP.1 |
116 |
534.6 |
345.9 |
471.3 |
490.6 |
304.6 |
45.8 |
1,698.4 |
1,652.6 |
1.4 |
2.3 |
32.1 |
3.) Creation of Bivariate Table 1 Including Comparisons
The function datscience::flex_table1
provides a nice API
for creating Table 1 with bivariate group comparison. There are only two
mandatory arguments:
- data = the data.frame to use, in the example we use the previously
prepared modified breast cancer data
data = breast_cancer_modified
- str_formula = the variables that should be displayed and compared in
formula notation. Note formula is given as a string in this
case. It starts allways with a tilde
"~"
followed by the variables to be shown in the rows combined with a plus sign"Age + BMI"
. Lastly the groups to be compared a specified after"| Diagnosis"
. For more details please refere to the table1 documentation. Lets say for our cancer sample we want to know if there are difference in the following variables- Glucose, Insulin, Leptin, Age and BMI of the sample, Additionally if the fictional smoking status differs
- Between Cancer Patients and Healthy Controls
- The formula string would look like this:
"~ random_Smoking + Glucose + Insulin + Leptin +Age + BMI|Diagnosis"
- For this tutorial we also add a caption to the table.
And that’s it. We included in the function call below the argument
ref_correction = TRUE
this marks p-values where
the respective correction was applied. However TRUE
is also
the default value. You change this behavior by just passing
FALSE
Creation of the Table
flex_table1(
str_formula = "~ random_Smoking + Glucose + Insulin + Leptin +Age + BMI|Diagnosis",
data = breast_cancer_modified,
table_caption = c(
"Table 4",
"Sample Characteristics, Comparison of Healthy Controls and Cancer Patients."
),
ref_correction = TRUE
)|>
# Only For markdown we additionally need to fix the autofit property, this step is not
# needed when directly saved to word e.g. with datscience::save_flextable()
set_table_properties(layout = "autofit", width = 1)
Sample Characteristics, Comparison of Healthy Controls and Cancer Patients. | |||||
---|---|---|---|---|---|
Table 4 | |||||
Characteristic |
Healthy Controls |
Breast Cancer Patients |
t / χ² |
df |
p |
Fictional Smoking Status |
0.56 |
2 |
.756 |
||
Non-Smoker |
26 (50 %) |
28 (44 %) |
|||
Occasionally |
2 (4 %) |
2 (3 %) |
|||
Smoker |
24 (46 %) |
34 (53 %) |
|||
Glucose *** |
88 (± 10) |
110 (± 27) |
-4.8 |
84.48 |
< .001 |
Insulin ** |
6.9 (± 4.9) |
13 (± 12) |
-3.32 |
85.57 |
.0013 |
Leptin |
27 (± 19) |
27 (± 19) |
0.01 |
114 |
.991 |
Age |
58 (± 19) |
57 (± 13) |
0.45 |
89.37 |
.654 |
BMI |
28 (± 5.4) |
27 (± 4.6) |
1.43 |
114 |
.156 |
Note. Differences are determined by independent sample t-test or Pearson's χ²-test. |