Skip to contents

Outline of the report

Step 2: Line list aggregation

This report explains necessary data preparation processes after importing a line list data frame.

Input

  • Case data in a line list format (cases_clean.rds) produced in the previous step (link).

output

  • Case incidence grouped by time interval (cases.rds)

Loading libraries

The following code loads required packages; missing packages will be installed automatically, but will require a working internet connection for the installation to be successful.

library(dplyr) # For data wrangling
library(incidence2) # For generating incidence data

Input data

In this example, we will use a line list data extracted from Go.Data API. While the example data set is already cleaned, it is important to make sure you have applied necessary data cleaning steps such as:

  • Column names are cleaned and standardized
  • Data types are properly assigned
  • Missing / duplicated values are treated
  • Categorical values are cleaned
# Go.Data extract as an example
cases_clean <- readRDS(file.path("inst", "extdata", "godata_exports", "cases_clean.rds"))

head(cases_clean)

Format the date columns properly

You will need one date index column to produce the aggregated incidence output. While this is optional, in this example, since we have nrow(cases_clean[is.na(cases_clean$date_of_onset),]) out of nrow(cases_clean) cases without any date_of_osnet values, we will compliment this with the date_of_reporting value.

cases_clean <- cases_clean %>%
  mutate(
    date_of_onset = case_when(
      is.na(date_of_onset) ~ date_of_reporting,
      TRUE ~ date_of_onset)
  )

Apply necessary aggregation and rename

We will convert this into a time-indexed grouped data frame with three columns:

  • date_index - This is the time index for your analysis (eg. date of infection, onset of symptoms, reporting…)
  • group - Any grouping variable you might need (eg. places, age groups)
  • count - Counts of observations per time and group

There are two ways of achieving this step as follows.

Option 1: Using incidence() function from the incidence2 package.

In this method, the output will be in an incidence2 class object. With this, you can also simply plot your incidence data. See more information on the incidence2 package here.

# Generate the incidence object
cases <- incidence(cases_clean %>% filter(!is.na(admin_1_name)), 
                   groups = "admin_1_name", 
                   date_index = "date_of_onset")
  
# Rename necessary columns
cases <- cases %>%
  rename(group = admin_1_name)


head(cases)
plot(cases)

Option 2: Using group_by() function from the dplyr package.

In this method, your output will be a data frame object.

cases <- cases_clean %>% filter(!is.na(admin_1_name)) %>%
  group_by(date_index=date_of_onset, group=admin_1_name) %>%
  summarise(count = n())

head(cases)