Title: | Easily Extracting Information About Your Data |
---|---|
Description: | Makes it easy to display descriptive information on a data set. Getting an easy overview of a data set by displaying and visualizing sample information in different tables (e.g., time and scope conditions). The package also provides publishable 'LaTeX' code to present the sample information. |
Authors: | Cosima Meyer [cre, aut], Dennis Hammerschmidt [aut] |
Maintainer: | Cosima Meyer <[email protected]> |
License: | GPL-3 |
Version: | 0.0.13 |
Built: | 2024-10-31 22:09:08 UTC |
Source: | https://github.com/cosimameyer/overviewr |
Internal function that calculates the 'overview_tab' for data.table objects
.overview_heat( dat = NULL, id = NULL, time = NULL, label = FALSE, perc = FALSE, col_low = NULL, col_high = NULL, xaxis = NULL, yaxis = NULL, theme_plot = NULL, exp_total = NULL, col_names = NULL )
.overview_heat( dat = NULL, id = NULL, time = NULL, label = FALSE, perc = FALSE, col_low = NULL, col_high = NULL, xaxis = NULL, yaxis = NULL, theme_plot = NULL, exp_total = NULL, col_names = NULL )
dat |
The data set |
id |
The scope (e.g., country codes or individual IDs). The axis is ordered in ascending order by default. |
time |
The time (e.g., time periods given by years, months, ...) |
label |
If TRUE (default), the total number of observations/percentages of observations are displayed. If FALSE, it returns no labels. |
perc |
If FALSE (default) plot returns the total number of observations per time-scope-unit. If TRUE, it returns the number of observations per time-scope-unit in percentage |
col_low |
Hex color code for the lowest value (default is "#dceaf2") |
col_high |
Hex color code for the lowest value (default is "#2A5773") |
xaxis |
Label of your x axis ("Time frame" is default) |
yaxis |
Label of your y axis ("Sample" is default) |
theme_plot |
Previously generated theme |
exp_total |
Expected total number of observations (i.e. maximum) for time unit. |
col_names |
The column names (containing id and time) |
A ggplot
Internal function that calculates the 'overview_tab' for data.table objects
.overview_tab(dat = NULL, id = NULL, time = NULL, col_names = NULL)
.overview_tab(dat = NULL, id = NULL, time = NULL, col_names = NULL)
dat |
Your data set |
id |
Scope (e.g., country codes or individual IDs) |
time |
Time (e.g., time periods given by years, months, ...). There are three options to add a date variable: 1) Time can be a character vector containing **one** time variable, 2) a time variable following the YYYY-MM-DD format, or 3) or a list containing multiple time variables ('time = list(year = NULL, month = NULL, day = NULL)'). |
col_names |
The column names (containing id and time) |
A data.table
Function used in 'overview_tab' to find running integers
find_int_runs(run = NULL)
find_int_runs(run = NULL)
run |
Variable (integer) that should be checked for consecutive values |
The function returns a data set
Function used in 'overview_na' to generate a new data frame with na_count and percentage share of NAs for each row
overview_add_na_output(dat_result = NULL, dat = NULL)
overview_add_na_output(dat_result = NULL, dat = NULL)
dat_result |
Data.frame from 'overview_na' |
dat |
Data frame |
The function returns a data set that has the information on the row-wise NA share
This function plots a ggplot to visualize a cross table plot.
overview_crossplot( dat, id, time, cond1, cond2, threshold1, threshold2, xaxis = "Condition 1", yaxis = "Condition 2", label = FALSE, color = FALSE, dot_size = 2, fontsize = 2.5 )
overview_crossplot( dat, id, time, cond1, cond2, threshold1, threshold2, xaxis = "Condition 1", yaxis = "Condition 2", label = FALSE, color = FALSE, dot_size = 2, fontsize = 2.5 )
dat |
Your data set |
id |
Your scope (e.g., country codes or individual IDs). If the id variable contains NAs, they will not be included in the plot. |
time |
Your time (e.g., time periods given by years, months, ...) |
cond1 |
Variable that describes the first condition |
cond2 |
Variable that describes the second condition |
threshold1 |
A threshold for |
threshold2 |
A threshold for |
xaxis |
Label of the x axis ("Condition 1" is default) |
yaxis |
Label of the y axis ("Condition 2" is default) |
label |
Label of the observations. Overlapping labels are avoided by using 'ggrepel' |
color |
Color of the different observation groups |
dot_size |
Option argument that defines the dot size (default is 2) |
fontsize |
If label is TRUE, the fontsize arguments allows to define the text of the labels (the default is 2.5) |
A ggplot figure that presents the sample information visually in a cross table
data(toydata) overview_crossplot( dat = toydata, cond1 = gdp, cond2 = population, threshold1 = 25000, threshold2 = 27000, id = ccode, time = year )
data(toydata) overview_crossplot( dat = toydata, cond1 = gdp, cond2 = population, threshold1 = 25000, threshold2 = 27000, id = ccode, time = year )
Sorts a data set conditionally in a cross table. This can be helpful to get a sense of the time and scope conditions of a data set. Note, if used with a data set that has multiple observations on the id-time unit, the function automatically aggregates this information using the mean.
overview_crosstab(dat, cond1, cond2, threshold1, threshold2, id, time)
overview_crosstab(dat, cond1, cond2, threshold1, threshold2, id, time)
dat |
A data set object |
cond1 |
Variable that describes the first condition |
cond2 |
Variable that describes the second condition |
threshold1 |
A threshold for |
threshold2 |
A threshold for |
id |
Scope (e.g., country codes or individual IDs) |
time |
Time (e.g., time periods given by years, months, ...) |
A data frame object that contains a summary of the data set that can
later be converted to a 'LaTeX' output using overview_latex
data(toydata) overview_crosstab( dat = toydata, cond1 = gdp, cond2 = population, threshold1 = 25000, threshold2 = 27000, id = ccode, time = year )
data(toydata) overview_crosstab( dat = toydata, cond1 = gdp, cond2 = population, threshold1 = 25000, threshold2 = 27000, id = ccode, time = year )
This function plots a heat map to visualize the coverage of the time-scope-units of the data. Options include total number of cases per time-scope-unit or relative number in percentage.
overview_heat( dat, id, time, perc = FALSE, exp_total = NULL, xaxis = "Time frame", yaxis = "Sample", col_low = "#dceaf2", col_high = "#2A5773", label = TRUE )
overview_heat( dat, id, time, perc = FALSE, exp_total = NULL, xaxis = "Time frame", yaxis = "Sample", col_low = "#dceaf2", col_high = "#2A5773", label = TRUE )
dat |
The data set |
id |
The scope (e.g., country codes or individual IDs). The axis is ordered in ascending order by default. |
time |
The time (e.g., time periods given by years, months, ...) |
perc |
If FALSE (default) plot returns the total number of observations per time-scope-unit. If TRUE, it returns the number of observations per time-scope-unit in percentage |
exp_total |
Expected total number of observations (i.e. maximum) for time unit. |
xaxis |
Label of your x axis ("Time frame" is default) |
yaxis |
Label of your y axis ("Sample" is default) |
col_low |
Hex color code for the lowest value (default is "#dceaf2") |
col_high |
Hex color code for the lowest value (default is "#2A5773") |
label |
If TRUE (default), the total number of observations/percentages of observations are displayed. If FALSE, it returns no labels. |
A ggplot figure that presents sample coverage visually
data(toydata) overview_heat(toydata, ccode, year, perc = TRUE, exp_total = 12)
data(toydata) overview_heat(toydata, ccode, year, perc = TRUE, exp_total = 12)
Produces a 'LaTeX' output for output obtained via
overview_tab
and overview_crosstab
overview_latex( obj, title = "Time and scope of the sample", id = "Sample", time = "Time frame", crosstab = FALSE, cond1 = "Condition 1", cond2 = "Condition 2", save_out = FALSE, file_path, label = "tab:tab1", fontsize, file, path )
overview_latex( obj, title = "Time and scope of the sample", id = "Sample", time = "Time frame", crosstab = FALSE, cond1 = "Condition 1", cond2 = "Condition 2", save_out = FALSE, file_path, label = "tab:tab1", fontsize, file, path )
obj |
Overview object produced by overview_tab or overview_crosstab |
title |
Caption of the table (default is "Time and scope of the sample") |
id |
The name of the left column (default is "Sample"), will be ignored if crosstab is TRUE |
time |
The name of the right column (default is ("Time frame")), will
be ignored if |
crosstab |
Logical argument, if TRUE produces a |
cond1 |
Description for the first condition (character), will be
ignored if |
cond2 |
Description for the second condition (character), will be
ignored if |
save_out |
Optional argument, exports the output table as a .tex file, default is FALSE |
file_path |
Specifies the path and file name (.tex) where you store your output |
label |
Specifies the label (default is "tab:tab1") |
fontsize |
Specifies the font size (all 'LaTeX' font sizes such as "scriptsize" or "small" work) |
file |
This argument is deprecated. Please use "file_path" instead and add the full path. |
path |
This argument is deprecated. Please use "file_path" instead and add the full path. |
A 'LaTeX' output that can either be copy-pasted in a text document or exported directed as a .tex file
data(toydata) overview_object <- overview_tab(dat = toydata, id = ccode, time = year) overview_latex( obj = overview_object, title = "Some nice title", crosstab = FALSE ) #' overview_object <- overview_tab(dat = toydata, id = ccode, time = year) overview_latex( obj = overview_object, title = "Some nice title", file_path = "some/path_to/your_output_file.tex" ) overview_ct_object <- overview_crosstab( dat = toydata, cond1 = gdp, cond2 = population, threshold1 = 25000, threshold2 = 27000, id = ccode, time = year ) overview_latex( obj = overview_ct_object, title = "Some nice title for a cross tab", crosstab = TRUE, cond1 = "Name of first condition", cond2 = "Name of second condition" )
data(toydata) overview_object <- overview_tab(dat = toydata, id = ccode, time = year) overview_latex( obj = overview_object, title = "Some nice title", crosstab = FALSE ) #' overview_object <- overview_tab(dat = toydata, id = ccode, time = year) overview_latex( obj = overview_object, title = "Some nice title", file_path = "some/path_to/your_output_file.tex" ) overview_ct_object <- overview_crosstab( dat = toydata, cond1 = gdp, cond2 = population, threshold1 = 25000, threshold2 = 27000, id = ccode, time = year ) overview_latex( obj = overview_ct_object, title = "Some nice title for a cross tab", crosstab = TRUE, cond1 = "Name of first condition", cond2 = "Name of second condition" )
This function plots a ggplot to visualize the distribution of NAs across all variables in the data set.
overview_na( dat, yaxis = "Variables", perc = TRUE, row_wise = FALSE, add = FALSE )
overview_na( dat, yaxis = "Variables", perc = TRUE, row_wise = FALSE, add = FALSE )
dat |
Your data set |
yaxis |
Label of your y axis ("Variables" is default) |
perc |
If TRUE (default) plot returns the number of NAs in percentage |
row_wise |
If TRUE (FALSE is default) plot return the number of NAs per row |
add |
If TRUE (FALSE is default) it generates a new data frame with na_count and percentage share of NAs for each row |
Depending on the selection, the function returns a ggplot figure that presents the distribution of NAs in the data set or adds the information on the row-wise NA share
data(toydata) overview_na(toydata, perc = FALSE)
data(toydata) overview_na(toydata, perc = FALSE)
Provides an overview of the overlap of two data sets. Cautionary note: This function is currently only preliminary workable and can only capture 2 data sets. We are working on an extension that allows to compare multiple data sets.
overview_overlap( dat1, dat2, dat1_id, dat2_id, dat1_name = "Data set 1", dat2_name = "Data set 2", plot_type = "bar" )
overview_overlap( dat1, dat2, dat1_id, dat2_id, dat1_name = "Data set 1", dat2_name = "Data set 2", plot_type = "bar" )
dat1 |
A first data set object |
dat2 |
A second data set object |
dat1_id |
Scope (e.g., country codes or individual IDs) of dat1. It is important that both ID variables are exactly the same to generate the perfect match. |
dat2_id |
Scope (e.g., country codes or individual IDs) of dat2. It is important that both ID variables are exactly the same to generate the perfect match. |
dat1_name |
Name of dat1 ("Data set 1" is the default) |
dat2_name |
Name of dat2 ("Data set 2" is the default) |
plot_type |
Type of plot ("bar" and "venn" are the two options) "venn" relies on the ggvenn function |
A ggplot2 object (bar chart) that shows the overlap of two data sets.
## Not run: data(toydata) toydata2 <- toydata[which(toydata$year > 1992), ] overview_overlap( dat1 = toydata, dat2 = toydata2, dat1_id = ccode, dat2_id = ccode ) ## End(Not run)
## Not run: data(toydata) toydata2 <- toydata[which(toydata$year > 1992), ] overview_overlap( dat1 = toydata, dat2 = toydata2, dat1_id = ccode, dat2_id = ccode ) ## End(Not run)
This function plots a ggplot to visualize the distribution of scope objects across the time frame.
overview_plot( dat, id, time, xaxis = "Time frame", yaxis = "Sample", asc = TRUE, color, dot_size = 2 )
overview_plot( dat, id, time, xaxis = "Time frame", yaxis = "Sample", asc = TRUE, color, dot_size = 2 )
dat |
Your data set |
id |
Your scope (e.g., country codes or individual IDs). If the id variable contains NAs, they will not be included in the plot. |
time |
Your time (e.g., time periods given by years, months, ...) |
xaxis |
Label of the x axis ("Time frame" is default) |
yaxis |
Label of the y axis ("Sample" is default) |
asc |
Sorting the y axis in ascending order ("TRUE" is default) |
color |
Optional argument that defines the color |
dot_size |
Option argument that defines the dot size (default is 2) |
A ggplot figure that presents the sample information visually
data(toydata) overview_plot(dat = toydata, id = ccode, time = year)
data(toydata) overview_plot(dat = toydata, id = ccode, time = year)
Function used in 'overview_na' to plot the absolute share of NA values
overview_plot_absolute( dat_result = NULL, theme_plot = NULL, yaxis = NULL, xaxis = NULL )
overview_plot_absolute( dat_result = NULL, theme_plot = NULL, yaxis = NULL, xaxis = NULL )
dat_result |
Data frame |
theme_plot |
Theme for the plot (pre-defined) |
yaxis |
Name for yaxis |
xaxis |
Name for xaxix |
The function returns a ggplot
Function used in 'overview_na' to plot the percentage share of NA values
overview_plot_percentage( dat_result = NULL, theme_plot = NULL, yaxis = NULL, xaxis = NULL )
overview_plot_percentage( dat_result = NULL, theme_plot = NULL, yaxis = NULL, xaxis = NULL )
dat_result |
Data frame |
theme_plot |
Theme for the plot (pre-defined) |
yaxis |
Name for yaxis |
xaxis |
Name for xaxix |
The function returns a ggplot
Provides an overview table for the time and scope conditions of a data set. If a data.table object is provided, the function uses data.table's syntax to perform the evaluation
overview_tab( dat, id, time = list(year = NULL, month = NULL, day = NULL), complex_date = FALSE )
overview_tab( dat, id, time = list(year = NULL, month = NULL, day = NULL), complex_date = FALSE )
dat |
A data frame or data table object |
id |
Scope (e.g., country codes or individual IDs) |
time |
Time (e.g., time periods given by years, months, ...). There are three options to add a date variable: 1) Time can be a character vector containing **one** time variable, 2) a time variable following the YYYY-MM-DD format, or 3) or a list containing multiple time variables ('time = list(year = NULL, month = NULL, day = NULL)'). |
complex_date |
Boolean argument identifying if there is a more complex (list-wise) date_time parameter (FALSE is the default) |
A data frame object that contains a summary of a sample that
can later be converted to a 'LaTeX' output using overview_latex
# With version 1 (and also 2): data(toydata) output_table <- overview_tab(dat = toydata, id = ccode, time = year) # With version 3: overview_tab(dat = toydata, id = ccode, time = list( year = toydata$year, month = toydata$month, day = toydata$day ), complex_date = TRUE)
# With version 1 (and also 2): data(toydata) output_table <- overview_tab(dat = toydata, id = ccode, time = year) # With version 3: overview_tab(dat = toydata, id = ccode, time = list( year = toydata$year, month = toydata$month, day = toydata$day ), complex_date = TRUE)
Internal function that calculates the 'overview_tab' for data.frame objects
overview_tab_df(dat2 = NULL, dat = NULL, id = NULL, time = NULL)
overview_tab_df(dat2 = NULL, dat = NULL, id = NULL, time = NULL)
dat2 |
Your data set |
dat |
Your data set |
id |
Scope (e.g., country codes or individual IDs) |
time |
Time (e.g., time periods given by years, months, ...). There are three options to add a date variable: 1) Time can be a character vector containing **one** time variable, 2) a time variable following the YYYY-MM-DD format, or 3) or a list containing multiple time variables ('time = list(year = NULL, month = NULL, day = NULL)'). |
A data.frame
Internal function that calculates the 'overview_tab' for data.table objects
overview_tab_dt(dat = NULL, id = NULL, time = NULL, col_names = NULL)
overview_tab_dt(dat = NULL, id = NULL, time = NULL, col_names = NULL)
dat |
Your data set |
id |
Scope (e.g., country codes or individual IDs) |
time |
Time (e.g., time periods given by years, months, ...). There are three options to add a date variable: 1) Time can be a character vector containing **one** time variable, 2) a time variable following the YYYY-MM-DD format, or 3) or a list containing multiple time variables ('time = list(year = NULL, month = NULL, day = NULL)'). |
col_names |
The column names (containing id and time) |
A data.table
Defines the theme for the 'overview_heat' plot function
theme_heat_plot()
theme_heat_plot()
A theme for the 'overview_heat' plot
Defines the theme for the 'overview_na' plot function
theme_na_plot()
theme_na_plot()
A theme for the 'overview_na' plot
Small, artificially generated toy data set that comes in a cross-sectional format where the unit of analysis is either country-year or country-year-month. It provides artificial information for five countries (Angola, Benin, France, Rwanda, and the UK) for a time span from 1990 to 1999 to illustrate the use of the package.
data(toydata)
data(toydata)
An object of class "data.frame"
ISO3 country code (as character) for the countries in the sample (Angola, Benin, France, Rwanda, and UK)
A value between 1990 and 1999
An abbreviation (MMM) for month (character)
A fake value for GDP (randomly generated)
A fake value for population (randomly generated)
This data set was artificially created for the overviewR package.
data(toydata) head(toydata)
data(toydata) head(toydata)