linelist_aggregation
transform_linelist_aggregation.Rmd
Step 2: Line list aggregation
This report explains necessary data preparation processes after
importing a line list data frame.
Input
- Case data in a line list format
(
cases_clean.rds
) produced in the previous step (link).
output
- Case incidence grouped by time interval
(
cases.rds
)
Loading libraries
The following code loads required packages; missing packages will be installed automatically, but will require a working internet connection for the installation to be successful.
library(dplyr) # For data wrangling
library(incidence2) # For generating incidence data
Input data
In this example, we will use a line list data extracted from Go.Data API. While the example data set is already cleaned, it is important to make sure you have applied necessary data cleaning steps such as:
- Column names are cleaned and standardized
- Data types are properly assigned
- Missing / duplicated values are treated
- Categorical values are cleaned
Format the date columns properly
You will need one date index column to produce the aggregated
incidence output. While this is optional, in this example, since we have
nrow(cases_clean[is.na(cases_clean$date_of_onset),])
out of
nrow(cases_clean)
cases without any
date_of_osnet
values, we will compliment this with the
date_of_reporting
value.
cases_clean <- cases_clean %>%
mutate(
date_of_onset = case_when(
is.na(date_of_onset) ~ date_of_reporting,
TRUE ~ date_of_onset)
)
Apply necessary aggregation and rename
We will convert this into a time-indexed grouped data frame with three columns:
-
date_index
- This is the time index for your analysis (eg. date of infection, onset of symptoms, reporting…) -
group
- Any grouping variable you might need (eg. places, age groups) -
count
- Counts of observations per time and group
There are two ways of achieving this step as follows.
Option 1: Using incidence()
function from the
incidence2 package.
In this method, the output will be in an incidence2 class object. With this, you can also simply plot your incidence data. See more information on the incidence2 package here.
# Generate the incidence object
cases <- incidence(cases_clean %>% filter(!is.na(admin_1_name)),
groups = "admin_1_name",
date_index = "date_of_onset")
# Rename necessary columns
cases <- cases %>%
rename(group = admin_1_name)
head(cases)
plot(cases)