2 Simulation algorithm
The Doob-Gillespie algorithm is a stochastic exact solution that is used to
simulate continuous-time processes, with several applications
in biological modelling. The Doob-Gillespie algorithm
can be used in evolutionary biology, for example to efficiently simulate a
birth-death process. The island-mainland simulation in the DAISIEmainland
package uses a two-part Doob-Gillespie simulation, one for the mainland
(DAISIEmainland::sim_mainland
) and one for the island
(DAISIEmainland::sim_island
).
2.1 Mainland simulation
The mainland simulation uses a Doob-Gillespie algorithm to simulate the speciation and extinction of species under a Moran process, whereby every species extinction is immediately followed by a random species giving rise to two new species (speciation). This ensures a constant number of species on the mainland. Then the mainland phylogenetic data is fed into the island simulation (Section 2.2). Here the example shows mainland data being simulated for a time duration of one (million years), five initial mainland species, and a mainland extinction rate of one (per species per million years).
set.seed(
1,
kind = "Mersenne-Twister",
normal.kind = "Inversion",
sample.kind = "Rejection"
)
mainland <- DAISIEmainland:::sim_mainland(
total_time = 1,
m = 5,
mainland_ex = 1
)
The output is a list of five mainland clades called multi_mainland_clade
(
see Section A.2.4).
mainland
#> [[1]]
#> spec_id main_anc_id spec_type branch_code branch_t spec_origin_t spec_ex_t
#> 1 1 1 E A NaN 0.0000000 0.7579891
#> 2 10 1 E AA 0.7579891 0.7579891 0.9493026
#> 3 11 1 C AB 0.7579891 0.7579891 1.0000000
#> 4 12 1 C AAA 0.9493026 0.9493026 1.0000000
#> 5 13 1 C AAB 0.9493026 0.9493026 1.0000000
#>
#> [[2]]
#> spec_id main_anc_id spec_type branch_code branch_t spec_origin_t spec_ex_t
#> 1 2 2 E A NaN 0.0000000 0.1789954
#> 2 8 2 C AA 0.1789954 0.1789954 1.0000000
#> 3 9 2 E AB 0.1789954 0.1789954 0.7579891
#>
#> [[3]]
#> spec_id main_anc_id spec_type branch_code branch_t spec_origin_t spec_ex_t
#> 1 3 3 E A NaN 0.0000000 0.1510364
#> 2 6 3 E AA 0.1510364 0.1510364 0.9493026
#> 3 7 3 C AB 0.1510364 0.1510364 1.0000000
#>
#> [[4]]
#> spec_id main_anc_id spec_type branch_code branch_t spec_origin_t spec_ex_t
#> 1 4 4 E A NaN 0 0.1510364
#>
#> [[5]]
#> spec_id main_anc_id spec_type branch_code branch_t spec_origin_t spec_ex_t
#> 1 5 5 E A NaN 0 0.1789954
Specifically focusing on one of these mainland clades (see Section A.2.5):
mainland[[1]]
#> spec_id main_anc_id spec_type branch_code branch_t spec_origin_t spec_ex_t
#> 1 1 1 E A NaN 0.0000000 0.7579891
#> 2 10 1 E AA 0.7579891 0.7579891 0.9493026
#> 3 11 1 C AB 0.7579891 0.7579891 1.0000000
#> 4 12 1 C AAA 0.9493026 0.9493026 1.0000000
#> 5 13 1 C AAB 0.9493026 0.9493026 1.0000000
We can see that the information on this mainland clade include the ID of each
species (spec_id
). The ID of the mainland ancestor from which that species
derived (main_anc_id
). The type of species (spec_type
) these can be I
,
C
, or E
for single lineage clade, cladogenetic species in a clade, or
extinct species, respectively. All mainland species are initialised as
spec_id = I
but as the simulation progresses species become either
part of a clade (C
) or extinct (E
). By keeping the extinct species we have
a full history of the mainland and not the so-called reconstructed history which
would only contain the living (extant) species. The branch_code
provides the
evolutionary relationships of the species in the clade and allows the topology
of the clade to be constructed. The branch_t
is the time when the species
speciated from their common ancestor (in time after the start of the simulation
and not time before present). The spec_origin_t
is the time the species
originated and is often equal to the branch_t
. The spec_ex_t
is the time the species went extinct (again in time after the start of the simulation). Species
that do not go extinct are give a spec_ex_t
equal to the total time of the
simulation.
Another example of what can happen on the mainland is the extinction of a singleton lineage before it underwent speciation.
mainland[[4]]
#> spec_id main_anc_id spec_type branch_code branch_t spec_origin_t spec_ex_t
#> 1 4 4 E A NaN 0 0.1510364
2.2 Island simulation
The island simulation runs after the mainland simulation and uses the mainland data to determine which species can immigrate to the island through time. The Doob-Gillespie algorithm is altered to accommodate the dynamic mainland pool. The time-steps are bounded to not jump over changes on the mainland to ensure the present state of the system (i.e. species on mainland) is always up-to-date. The algorithm checks whether any changes have occured on the mainland since the last time step and if so, the system is updated and the returned to the time at which the mainland last changed. This is valid owing to the Markov (memoryless) property of the Doob-Gillespie algorithm.
set.seed(
1,
kind = "Mersenne-Twister",
normal.kind = "Inversion",
sample.kind = "Rejection"
)
island_tbl <- DAISIEmainland:::sim_island(
total_time = 1,
island_pars = c(1, 1, 10, 1, 1),
mainland_clade = mainland[[1]],
mainland_sample_prob = 1,
mainland_sample_type = "complete"
)
island_tbl
#> spec_id main_anc_id col_t spec_type branch_code branch_t ana_origin
#> 1 1 1 0.7551818 A <NA> NaN mainland_extinction
#> 2 10 10 0.8507697 A <NA> NaN mainland_extinction
The island simulation outputs a island_tbl
(see Section A.2.6).
The island_tbl
includes the species ID (spec_id
), mainland ancestor
ID (main_anc_id
), species type (spec_type
), branching code (branch_code
), and branching time (branch_t
). It also includes the time at which the species
colonised the island (col_t
) and the reason a species is anagenetic (i.e.
endemic to the island without being in an island clade). The reasons for a
species to be anagenetic are: the mainland population of the species goes
extinct (mainland_extinction
), the species undergoes anagenesis on the island
(immig_parent
), or the species formed a clade but all other members of the
clade went extinct before the present (clade_extinct
).
One major difference between the island and mainland data that are produced by
sim_mainland()
and sim_island()
is that the latter only has
information from the reconstructed point of view. This means extinct species
are removed from the data set and only species extant at the end of the
simulation are included.
For both the island and mainland Doob-Gillespie algorithms time steps are sampled from an exponential distribution with rate:
\[X = \lambda e ^{-\lambda x}, \text{ where } \lambda = \sum_j r_j\]
where \(r_j\) are the rates. For the mainland process this is just the rate of mainland extinction (\(\mu_M\)), as this is the only mainland parameter, whereas, for the island algorithm \(r_j\) are the rates of cladogenesis (\(\lambda^c\)), island extinction (\(\mu\)), colonisation (\(\gamma\)), and anagenesis (\(\lambda^a\)). After the time step (\(\Delta\) t) is sampled, the event is sampled from a dynamic discrete probability distribution, weighted by its rate (propensity) relative to all other rates:
\[r_i / \sum_j r_j\]
The system is then updated and the algorithm repeats until the time step exceeds the total time of the simulation.
The function that encapsulates and runs both of these simulations is sim_island_with_mainland()
. This function also includes the formatting of the data and the assignment of an endemicity status to each island colonist, which is needed by the DAISIE
inference model. The DAISIEmainland
simulation outputs two data sets: (1) contains full information of all species colonisation times, and (2) an incomplete information data set which resembles what an empiricist would have access to (see Section A.2.1). These two data sets allow for the quantification of error in
estimation when the empiricist does not have access to all the data.
set.seed(
1,
kind = "Mersenne-Twister",
normal.kind = "Inversion",
sample.kind = "Rejection"
)
daisie_mainland_data <- sim_island_with_mainland(
total_time = 1,
m = 5,
island_pars = c(1, 1, 10, 1, 1),
mainland_ex = 1,
mainland_sample_prob = 1,
mainland_sample_type = "unsampled",
replicates = 1
)
daisie_mainland_data
#> $ideal_multi_daisie_data
#> $ideal_multi_daisie_data[[1]]
#> $ideal_multi_daisie_data[[1]][[1]]
#> $ideal_multi_daisie_data[[1]][[1]]$island_age
#> [1] 1
#>
#> $ideal_multi_daisie_data[[1]][[1]]$not_present
#> [1] 3
#>
#>
#> $ideal_multi_daisie_data[[1]][[2]]
#> $ideal_multi_daisie_data[[1]][[2]]$branching_times
#> [1] 1.0000000 0.1026008
#>
#> $ideal_multi_daisie_data[[1]][[2]]$stac
#> [1] 2
#>
#> $ideal_multi_daisie_data[[1]][[2]]$missing_species
#> [1] 0
#>
#>
#> $ideal_multi_daisie_data[[1]][[3]]
#> $ideal_multi_daisie_data[[1]][[3]]$branching_times
#> [1] 1.0000000 0.9627315
#>
#> $ideal_multi_daisie_data[[1]][[3]]$stac
#> [1] 2
#>
#> $ideal_multi_daisie_data[[1]][[3]]$missing_species
#> [1] 0
#>
#>
#>
#>
#> $empirical_multi_daisie_data
#> $empirical_multi_daisie_data[[1]]
#> $empirical_multi_daisie_data[[1]][[1]]
#> $empirical_multi_daisie_data[[1]][[1]]$island_age
#> [1] 1
#>
#> $empirical_multi_daisie_data[[1]][[1]]$not_present
#> [1] 3
#>
#>
#> $empirical_multi_daisie_data[[1]][[2]]
#> $empirical_multi_daisie_data[[1]][[2]]$branching_times
#> [1] 1.0000000 0.8489636
#>
#> $empirical_multi_daisie_data[[1]][[2]]$stac
#> [1] 2
#>
#> $empirical_multi_daisie_data[[1]][[2]]$missing_species
#> [1] 0
#>
#>
#> $empirical_multi_daisie_data[[1]][[3]]
#> $empirical_multi_daisie_data[[1]][[3]]$branching_times
#> [1] 1.00000 0.99999
#>
#> $empirical_multi_daisie_data[[1]][[3]]$stac
#> [1] 5
#>
#> $empirical_multi_daisie_data[[1]][[3]]$missing_species
#> [1] 0