--- title: "Univariate analysis of continuous and categorical variables" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Univariate analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(rmarkdown.html_vignette.check_title = FALSE) ``` The first step in data exploration usually consists of univariate, descriptive analysis of all variables of interest. Tidycomm offers four basic functions to quickly output relevant statistics: - `describe()` for continuous variables - `tab_percentiles()` for continuous variables - `describe_cat()` for categorical variables - `tab_frequencies()` for categorical variables ```{r setup, message = FALSE, warning = FALSE, include = FALSE} library(tidycomm) ``` For demonstration purposes, we will use sample data from the [Worlds of Journalism](https://worldsofjournalism.org/) 2012-16 study included in tidycomm. ```{r} WoJ ``` ## Describe continuous variables `describe()` outputs several measures of central tendency and variability for all variables named in the function call: ```{r} WoJ %>% describe(autonomy_selection, autonomy_emphasis, work_experience) ``` If no variables are passed to `describe()`, all numeric variables in the data are described: ```{r} WoJ %>% describe() ``` Data can be grouped before describing: ```{r} WoJ %>% dplyr::group_by(country) %>% describe(autonomy_emphasis, autonomy_selection) ``` The returning results from `describe()` can also be visualized: ```{r} WoJ %>% describe() %>% visualize() ``` In addition, percentiles can easily be extracted from continuous variables: ```{r} WoJ %>% tab_percentiles() ``` Percentiles can also be visualized: ```{r} WoJ %>% tab_percentiles(trust_parties) %>% visualize() ``` ## Describe categorical variables `describe_cat()` outputs a short summary of categorical variables (number of unique values, mode, N of mode) of all variables named in the function call: ```{r} WoJ %>% describe_cat(reach, employment, temp_contract) ``` If no variables are passed to `describe_cat()`, all categorical variables (i.e., `character` and `factor` variables) in the data are described: ```{r} WoJ %>% describe_cat() ``` Data can be grouped before describing: ```{r} WoJ %>% dplyr::group_by(reach) %>% describe_cat(country, employment) ``` Again, also the results from `describe_cat()` can be visualized like so: ```{r} WoJ %>% describe_cat() %>% visualize() ``` ## Tabulate frequencies of categorical variables `tab_frequencies()` outputs absolute and relative frequencies of all unique values of one or more categorical variables: ```{r} WoJ %>% tab_frequencies(employment) ``` Passing more than one variable will compute relative frequencies based on all combinations of unique values: ```{r} WoJ %>% tab_frequencies(employment, country) ``` You can also group your data before. This will lead to within-group relative frequencies: ```{r} WoJ %>% dplyr::group_by(country) %>% tab_frequencies(employment) ``` (Compare the columns `percent`, `cum_n` and `cum_percent` with the output above.) And of course, also `tab_frequencies()` can easily be visualized: ```{r} WoJ %>% tab_frequencies(country) %>% visualize() ```