linelist_aggregation

Outline of the report

Step 2: Line list aggregation

This report explains necessary data preparation processes after importing a line list data frame.

Input

Case data in a line list format (cases_clean.rds) produced in the previous step (link).

output

Case incidence grouped by time interval (cases.rds)

Loading libraries

The following code loads required packages; missing packages will be installed automatically, but will require a working internet connection for the installation to be successful.

library(dplyr) # For data wrangling
library(incidence2) # For generating incidence data

Input data

In this example, we will use a line list data extracted from Go.Data API. While the example data set is already cleaned, it is important to make sure you have applied necessary data cleaning steps such as:

Column names are cleaned and standardized
Data types are properly assigned
Missing / duplicated values are treated
Categorical values are cleaned

# Go.Data extract as an example
cases_clean <- readRDS(file.path("inst", "extdata", "godata_exports", "cases_clean.rds"))

head(cases_clean)

Format the date columns properly

You will need one date index column to produce the aggregated incidence output. While this is optional, in this example, since we have nrow(cases_clean[is.na(cases_clean$date_of_onset),]) out of nrow(cases_clean) cases without any date_of_osnet values, we will compliment this with the date_of_reporting value.

cases_clean <- cases_clean %>%
  mutate(
    date_of_onset = case_when(
      is.na(date_of_onset) ~ date_of_reporting,
      TRUE ~ date_of_onset)
  )

Apply necessary aggregation and rename

We will convert this into a time-indexed grouped data frame with three columns:

date_index - This is the time index for your analysis (eg. date of infection, onset of symptoms, reporting…)
group - Any grouping variable you might need (eg. places, age groups)
count - Counts of observations per time and group

There are two ways of achieving this step as follows.

Option 1: Using `incidence()` function from the incidence2 package.

In this method, the output will be in an incidence2 class object. With this, you can also simply plot your incidence data. See more information on the incidence2 package here.

# Generate the incidence object
cases <- incidence(cases_clean %>% filter(!is.na(admin_1_name)), 
                   groups = "admin_1_name", 
                   date_index = "date_of_onset")
  
# Rename necessary columns
cases <- cases %>%
  rename(group = admin_1_name)


head(cases)

plot(cases)

Option 2: Using `group_by()` function from the dplyr package.

In this method, your output will be a data frame object.

cases <- cases_clean %>% filter(!is.na(admin_1_name)) %>%
  group_by(date_index=date_of_onset, group=admin_1_name) %>%
  summarise(count = n())

head(cases)