2 Simulation algorithm

The Doob-Gillespie algorithm is a stochastic exact solution that is used to simulate continuous-time processes, with several applications in biological modelling. The Doob-Gillespie algorithm can be used in evolutionary biology, for example to efficiently simulate a birth-death process. The island-mainland simulation in the DAISIEmainland package uses a two-part Doob-Gillespie simulation, one for the mainland (DAISIEmainland::sim_mainland) and one for the island (DAISIEmainland::sim_island).

2.1 Mainland simulation

The mainland simulation uses a Doob-Gillespie algorithm to simulate the speciation and extinction of species under a Moran process, whereby every species extinction is immediately followed by a random species giving rise to two new species (speciation). This ensures a constant number of species on the mainland. Then the mainland phylogenetic data is fed into the island simulation (Section 2.2). Here the example shows mainland data being simulated for a time duration of one (million years), five initial mainland species, and a mainland extinction rate of one (per species per million years).

set.seed(
  1,
  kind = "Mersenne-Twister",
  normal.kind = "Inversion",
  sample.kind = "Rejection"
)
mainland <- DAISIEmainland:::sim_mainland(
  total_time = 1,
  m = 5,
  mainland_ex = 1
)

The output is a list of five mainland clades called multi_mainland_clade ( see Section A.2.4).

mainland
#> [[1]]
#>   spec_id main_anc_id spec_type branch_code  branch_t spec_origin_t spec_ex_t
#> 1       1           1         E           A       NaN     0.0000000 0.7579891
#> 2      10           1         E          AA 0.7579891     0.7579891 0.9493026
#> 3      11           1         C          AB 0.7579891     0.7579891 1.0000000
#> 4      12           1         C         AAA 0.9493026     0.9493026 1.0000000
#> 5      13           1         C         AAB 0.9493026     0.9493026 1.0000000
#> 
#> [[2]]
#>   spec_id main_anc_id spec_type branch_code  branch_t spec_origin_t spec_ex_t
#> 1       2           2         E           A       NaN     0.0000000 0.1789954
#> 2       8           2         C          AA 0.1789954     0.1789954 1.0000000
#> 3       9           2         E          AB 0.1789954     0.1789954 0.7579891
#> 
#> [[3]]
#>   spec_id main_anc_id spec_type branch_code  branch_t spec_origin_t spec_ex_t
#> 1       3           3         E           A       NaN     0.0000000 0.1510364
#> 2       6           3         E          AA 0.1510364     0.1510364 0.9493026
#> 3       7           3         C          AB 0.1510364     0.1510364 1.0000000
#> 
#> [[4]]
#>   spec_id main_anc_id spec_type branch_code branch_t spec_origin_t spec_ex_t
#> 1       4           4         E           A      NaN             0 0.1510364
#> 
#> [[5]]
#>   spec_id main_anc_id spec_type branch_code branch_t spec_origin_t spec_ex_t
#> 1       5           5         E           A      NaN             0 0.1789954

Specifically focusing on one of these mainland clades (see Section A.2.5):

mainland[[1]]
#>   spec_id main_anc_id spec_type branch_code  branch_t spec_origin_t spec_ex_t
#> 1       1           1         E           A       NaN     0.0000000 0.7579891
#> 2      10           1         E          AA 0.7579891     0.7579891 0.9493026
#> 3      11           1         C          AB 0.7579891     0.7579891 1.0000000
#> 4      12           1         C         AAA 0.9493026     0.9493026 1.0000000
#> 5      13           1         C         AAB 0.9493026     0.9493026 1.0000000

We can see that the information on this mainland clade include the ID of each species (spec_id). The ID of the mainland ancestor from which that species derived (main_anc_id). The type of species (spec_type) these can be I, C, or E for single lineage clade, cladogenetic species in a clade, or extinct species, respectively. All mainland species are initialised as spec_id = I but as the simulation progresses species become either part of a clade (C) or extinct (E). By keeping the extinct species we have a full history of the mainland and not the so-called reconstructed history which would only contain the living (extant) species. The branch_code provides the evolutionary relationships of the species in the clade and allows the topology of the clade to be constructed. The branch_t is the time when the species speciated from their common ancestor (in time after the start of the simulation and not time before present). The spec_origin_t is the time the species originated and is often equal to the branch_t. The spec_ex_t is the time the species went extinct (again in time after the start of the simulation). Species that do not go extinct are give a spec_ex_t equal to the total time of the simulation.

Another example of what can happen on the mainland is the extinction of a singleton lineage before it underwent speciation.

mainland[[4]]
#>   spec_id main_anc_id spec_type branch_code branch_t spec_origin_t spec_ex_t
#> 1       4           4         E           A      NaN             0 0.1510364

2.2 Island simulation

The island simulation runs after the mainland simulation and uses the mainland data to determine which species can immigrate to the island through time. The Doob-Gillespie algorithm is altered to accommodate the dynamic mainland pool. The time-steps are bounded to not jump over changes on the mainland to ensure the present state of the system (i.e. species on mainland) is always up-to-date. The algorithm checks whether any changes have occured on the mainland since the last time step and if so, the system is updated and the returned to the time at which the mainland last changed. This is valid owing to the Markov (memoryless) property of the Doob-Gillespie algorithm.

set.seed(
  1,
  kind = "Mersenne-Twister",
  normal.kind = "Inversion",
  sample.kind = "Rejection"
)
island_tbl <- DAISIEmainland:::sim_island(
  total_time = 1,
  island_pars = c(1, 1, 10, 1, 1),
  mainland_clade = mainland[[1]],
  mainland_sample_prob = 1,
  mainland_sample_type = "complete"
)
island_tbl
#>   spec_id main_anc_id     col_t spec_type branch_code branch_t          ana_origin
#> 1       1           1 0.7551818         A        <NA>      NaN mainland_extinction
#> 2      10          10 0.8507697         A        <NA>      NaN mainland_extinction

The island simulation outputs a island_tbl (see Section A.2.6). The island_tbl includes the species ID (spec_id), mainland ancestor ID (main_anc_id), species type (spec_type), branching code (branch_code), and branching time (branch_t). It also includes the time at which the species colonised the island (col_t) and the reason a species is anagenetic (i.e. endemic to the island without being in an island clade). The reasons for a species to be anagenetic are: the mainland population of the species goes extinct (mainland_extinction), the species undergoes anagenesis on the island (immig_parent), or the species formed a clade but all other members of the clade went extinct before the present (clade_extinct).

One major difference between the island and mainland data that are produced by sim_mainland() and sim_island() is that the latter only has information from the reconstructed point of view. This means extinct species are removed from the data set and only species extant at the end of the simulation are included.

For both the island and mainland Doob-Gillespie algorithms time steps are sampled from an exponential distribution with rate:

\[X = \lambda e ^{-\lambda x}, \text{ where } \lambda = \sum_j r_j\]

where \(r_j\) are the rates. For the mainland process this is just the rate of mainland extinction (\(\mu_M\)), as this is the only mainland parameter, whereas, for the island algorithm \(r_j\) are the rates of cladogenesis (\(\lambda^c\)), island extinction (\(\mu\)), colonisation (\(\gamma\)), and anagenesis (\(\lambda^a\)). After the time step (\(\Delta\) t) is sampled, the event is sampled from a dynamic discrete probability distribution, weighted by its rate (propensity) relative to all other rates:

\[r_i / \sum_j r_j\]

The system is then updated and the algorithm repeats until the time step exceeds the total time of the simulation.

The function that encapsulates and runs both of these simulations is sim_island_with_mainland(). This function also includes the formatting of the data and the assignment of an endemicity status to each island colonist, which is needed by the DAISIE inference model. The DAISIEmainland simulation outputs two data sets: (1) contains full information of all species colonisation times, and (2) an incomplete information data set which resembles what an empiricist would have access to (see Section A.2.1). These two data sets allow for the quantification of error in estimation when the empiricist does not have access to all the data.

set.seed(
  1,
  kind = "Mersenne-Twister",
  normal.kind = "Inversion",
  sample.kind = "Rejection"
)
daisie_mainland_data <- sim_island_with_mainland(
  total_time = 1,
  m = 5,
  island_pars = c(1, 1, 10, 1, 1), 
  mainland_ex = 1,
  mainland_sample_prob = 1,
  mainland_sample_type = "unsampled",
  replicates = 1 
)
daisie_mainland_data
#> $ideal_multi_daisie_data
#> $ideal_multi_daisie_data[[1]]
#> $ideal_multi_daisie_data[[1]][[1]]
#> $ideal_multi_daisie_data[[1]][[1]]$island_age
#> [1] 1
#> 
#> $ideal_multi_daisie_data[[1]][[1]]$not_present
#> [1] 3
#> 
#> 
#> $ideal_multi_daisie_data[[1]][[2]]
#> $ideal_multi_daisie_data[[1]][[2]]$branching_times
#> [1] 1.0000000 0.1026008
#> 
#> $ideal_multi_daisie_data[[1]][[2]]$stac
#> [1] 2
#> 
#> $ideal_multi_daisie_data[[1]][[2]]$missing_species
#> [1] 0
#> 
#> 
#> $ideal_multi_daisie_data[[1]][[3]]
#> $ideal_multi_daisie_data[[1]][[3]]$branching_times
#> [1] 1.0000000 0.9627315
#> 
#> $ideal_multi_daisie_data[[1]][[3]]$stac
#> [1] 2
#> 
#> $ideal_multi_daisie_data[[1]][[3]]$missing_species
#> [1] 0
#> 
#> 
#> 
#> 
#> $empirical_multi_daisie_data
#> $empirical_multi_daisie_data[[1]]
#> $empirical_multi_daisie_data[[1]][[1]]
#> $empirical_multi_daisie_data[[1]][[1]]$island_age
#> [1] 1
#> 
#> $empirical_multi_daisie_data[[1]][[1]]$not_present
#> [1] 3
#> 
#> 
#> $empirical_multi_daisie_data[[1]][[2]]
#> $empirical_multi_daisie_data[[1]][[2]]$branching_times
#> [1] 1.0000000 0.8489636
#> 
#> $empirical_multi_daisie_data[[1]][[2]]$stac
#> [1] 2
#> 
#> $empirical_multi_daisie_data[[1]][[2]]$missing_species
#> [1] 0
#> 
#> 
#> $empirical_multi_daisie_data[[1]][[3]]
#> $empirical_multi_daisie_data[[1]][[3]]$branching_times
#> [1] 1.00000 0.99999
#> 
#> $empirical_multi_daisie_data[[1]][[3]]$stac
#> [1] 5
#> 
#> $empirical_multi_daisie_data[[1]][[3]]$missing_species
#> [1] 0