Geographical migration and fitness dynamics of Streptococcus pneumoniae - Nature

Data sources and processing

Pneumococcal sequence data and metadata

The genomes included in this study were collected as part of the Global Pneumococcal Sequencing project (GPS), which is a global genomic survey of S. pneumoniae⁴⁰. The invasive disease isolates included here were collected by the National Institute for Communicable Disease in South Africa from 2000 to 2014 (ref. ²²). In the initial phase of genome sequencing, approximately 300 invasive-disease isolates from each year (2005–2014) were selected with a specific target age breakdown (50% from <3 year olds, 25% from 3–5 year olds, 25% from >5 year olds). In the second phase of sequencing, approximately 200 disease isolates from 2000 to 2004 and approximately 100 invasive disease isolates from 2005 to 2010 from children <5 years old were randomly selected for sequencing²² (Supplementary Table 2). The carriage isolates were collected in Soweto, Gauteng (n = 736; collection years 2010, 2012 and 2013) and Agincourt, Mpumalanga (n = 1,114; collection years 2009, 2011 and 2013) by the Wits Vaccines and Infectious Diseases Analytics Research Unit. A random sample of 400 carriage isolates from each year were chosen for genome sequencing (Supplementary Table 2). We included both carriage and invasive-disease isolates and conducted sensitivity analyses throughout to confirm the methods were robust to both carriage and invasive disease individually. Invasive disease is defined as the bacterium being isolated from a typically sterile site. The majority of total isolates were from children aged <5 years (75.0%); 7.6% were from individuals aged 5–20 years and only 17.5% were from adults (>20 years) (Supplementary Table 7). These were distributed across the before and after PCV periods. We additionally included similar GPSCs from the GPS database (for context within the global population) for the RR analysis. We utilized metadata that included collection year and month, residence province of the patient, age of the patient, sampling site and clinical manifestation. The range of sampling sites included nasopharyngeal swabs (for carriage), blood, pleural fluid, cerebrospinal fluid, peritoneal fluid, pus and other joint fluid (for invasive disease).

Population data

We estimated the population for each municipality (n = 234) across South Africa using the population-size estimates from LandScan 2017 (refs. ^41,42) (Extended Data Fig. 11).

Mobility data

The mobility data used in this study were collected using Meta Data for Good Disaster maps from South Africa. These are initiated at the onset of a disaster—in this case, the SARS-CoV-2 pandemic—and track the geographical movement of Meta users⁴³. We used the baseline (adjusting to 2 weeks before) human mobility for each month from January 2020 to July 2021 (refs. ^43,44) to attain a mobility probability from and to each municipality. For each origin (home) municipality (n = 234), we determined the mean monthly number of Meta users that were in each destination municipality in South Africa. Location pairs with a value of zero were given a value of ten. We divided each cell by the total number of users across all destinations for that origin municipality. Each cell in the resultant original–destination matrix therefore represented the probability of being in each destination municipality given your home municipality. This is the probability of mobility from each municipality to each other municipality after 2019. Because we do not have mobility data from the exact years the genomes were sampled, we adjusted the diagonal of the mobility matrix using an estimated parameter. This allows people to stay more or less at home and mitigates the effect of non-year matched mobility data. We define the radius of gyration (R_i) for each municipality as follows:

$${R}_{i}=\sqrt{{\sum }_{i}{m}_{i}{{d}_{i}}^{2}}$$

(1)

Where d_i is distance to region i, and m_i is the probability of mobility to region i. The sum is across all municipalities (n = 234)⁴⁵ (Extended Data Fig. 11).

Generation time distribution

We used a simulation framework to estimate the overall generation time using the separate contributions of the carriage durations and the incubation period. This approach has previously been used for other pathogens⁴⁶.

We sampled 1,000 carriage durations from an exponential distribution with means that are inverse to the clearance rates estimated across serotypes in ref. ³¹ (clearance rate = 0.026 (95% CI = 0.025–0.028) episode per day) and in ref. ³² (clearance rate = 0.032 (95% CI = 0.030–0.034) episodes per day)³².

To sample the day of transmission, we randomly sampled a time point between zero and the time of clearance for each individual. We then separately sampled an incubation period using a uniform distribution of between 1 and 5 days. To account for longer carriages resulting in more opportunities for transmission, we sampled from the distribution of generation, with the probability of sampling each individual weighted by the total carriage duration. The total generation time is then the sum of the duration to transmission and the incubation period.

We repeated these steps 10,000 times and estimated the mean and standard deviation of this final distribution assuming that the generation time follows a gamma distribution with a mean of 35 and a standard deviation of 35 (exponential distribution with a rate of 1/0.096). By comparing the histogram of the distribution with the gamma distribution, this seems a reasonable assumption (Supplementary Fig. 4).

Sample culture and genome sequencing

The pneumococcal isolates were selectively cultured on BD Trypticase soy agar II with 5% sheep blood (Beckton Dickinson) and incubated overnight at 37 °C in 5% CO₂. Genomic DNA was then manually extracted using a modified QIAamp DNA Mini kit (Qiagen) protocol. As part of GPS, pneumococcal isolates were whole-genome sequenced on an Illumina HiSeq platform to produce paired-end reads with an average of 100–125 bp in length, and data were deposited into the European Nucleotide Database. Whole-genome sequence data were processed as previously described^36,47.

AMR

We performed predictive antimicrobial susceptibility profiling using the CDC-AMR pipeline for three classes of antimicrobials: β-lactams (penicillin; encoded by the genes pbp1A, pbp2B and pbp2X)^48,49; sulfonamides (co-trimoxazole; folA and folP); and macrolides (erythromycin and clindamycin; ermB and mefA)^50,51. This was done for 6,798 randomly selected isolates⁴⁰.

Constructing time-resolved phylogenetic trees

We selected the GPSCs for which we had genomes from each of South Africa’s nine provinces and for which we had a minimum of 50 sequences in total to build phylogenies, henceforth referred to as ‘dominant GPSCs’. There were nine dominant GPSCs: GPSC1, GPSC2, GPSC5, GPSC10, GPSC13, GPSC14, GPSC17, GPSC68 and GPSC79 (n = 2.575). Assembly was performed using Wellcome Sanger Institute pathogen informatics automated pipelines and is freely available for download from GitHub under an open-source licence, GNU GPL 3 (ref. ⁵²). For each sample, sequence reads were used to create multiple assemblies using VelvetOptimiser (v.2.2.5) and Velvet (v.1.2.10)⁵³. An assembly improvement step was applied to the assembly with the best N50 and contigs scaffolded using SSPACE (v.2.0)⁵⁴, and sequence gaps were filled using GapFiller (v.1.11)⁵⁵. Assembly quality control parameters included a minimum average sequencing depth of 20× and an assembly length of 1.9–2.3 Mb. Sequences with more than 15% heterozygous SNP sites were excluded.

We created reference genomes for each GPSC using ABACAS (v.1.3.1) to order the contigs from a representative of each GPSC mapped to S. pneumoniae (strain ATCC 700669/Spain 23F-1) (EMBL accession: FM211187)⁵⁶. Any contigs that did not align were concatenated to the end. We multiply mapped all genomes from each dominant GPSC against these references, respectively, using a custom mapping, variant calling and local realignment around indels pipeline (multiple_mappings_to_bam.py)⁵⁷ using bwa-MEM (v.0.7.17)⁵⁸ and samtools mpileup (v.1.6)⁵⁹. The minimum base quality for a base to be considered was 50. The minimum mapping quality for a SNP to be called was 20, with a minimum of 8 reads matching the SNP. We built trees masking recombination regions using Gubbins (v.2.4.1)⁶⁰ with the hybrid model that uses FastTree for the first iteration and RAxML subsequently⁶¹ and a GTR model. We converted branch length to time using BactDating (v.1.0) with a mixed gamma, relaxed clock model⁶². We compared concordance between BEAST (v.1.10.4)⁶³ with both strict and relaxed clocks, and a Bayesian skyline prior. As the results were concordant, we used BactDating owing to its shorter runtime (Supplementary Fig. 14).

RR framework

We used a RR framework to investigate the risk of genetic similarity across geographical distance⁶⁴. We compared the location (loc) and label (G) (that is, GPSC or genetic similarity) of pairs of sequences that were collected around the same time (t). This approach has been shown to be robust to substantial biases in timing and location of isolate collection⁶⁴. We first constructed pair-wise matrices comparing every isolate to every other isolate (n pairs = 6,910). In the numerator was the ratio of pairs that were the same GPSC, collected within a year of each other, from the same province, over the total number of pairs collected within a year of each other from the same province. The denominator was the ratio of pairs that were the same GPSC, collected within a year of each other, from distant provinces (>1,000 km apart) (L_ref), over the total number of pairs collected within a year of each other from distant provinces. Geographical distances were calculated based on the centroid coordinates of each province. To demonstrate the suitability of using centroid distances, we simulated a spatial transmission process for 1,000 separate chains in which at each generation, a daughter point is placed at a randomly located location 350 m in each of the x and y direction. This was repeated over 20 generations. We then identified the centroid of each case based on the closest coordinate rounded to the nearest kilometre. We then calculated the total distance covered for both the true distance and the centroid distances and found that the resulting distances were similar (Supplementary Fig. 15).

To quantify uncertainty, we used a bootstrapping approach whereby in each bootstrap iteration, we randomly sampled with replacement the isolates before recalculating the statistic. We report the 2.5 and 97.5 percentiles from the resulting distribution.

We also repeated the same analysis but used the time-resolved phylogenetic trees to interrogate pairs across increasing divergence times (breaking the GPSCs into higher resolution). For this, rather than matrices designating whether pairs were the same GPSC or different GPSCs, the divergence time between each pair was included. We only included the divergence times between like GPSCs (Fig. 2c–e).

$${{\rm{R}}{\rm{R}}}_{{\rm{l}}{\rm{o}}{\rm{c}}}(\,g1,g2)=\frac{\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j\ne i}^{n}I({{\rm{l}}{\rm{o}}{\rm{c}}}_{i}={{\rm{l}}{\rm{o}}{\rm{c}}}_{j}\cap {t}_{ij}\le 1\,{\rm{y}}{\rm{e}}{\rm{a}}{\rm{r}}\cap ({G}_{ij} > {g}_{1},{G}_{ij} < {g}_{2}))\,/\,\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j\ne i}^{n}I({{\rm{l}}{\rm{o}}{\rm{c}}}_{i}={{\rm{l}}{\rm{o}}{\rm{c}}}_{j}\cap {t}_{ij}\le 1\,{\rm{y}}{\rm{e}}{\rm{a}}{\rm{r}})}{\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j\ne i}^{n}I({{{\rm{L}}}_{{\rm{r}}{\rm{e}}{\rm{f}}}}_{i}={{{\rm{L}}}_{{\rm{r}}{\rm{e}}{\rm{f}}}}_{j}\cap {t}_{ij}\le 1\,{\rm{y}}{\rm{e}}{\rm{a}}{\rm{r}}\cap ({G}_{ij} > {g}_{1},{G}_{ij} < {g}_{2}))\,/\,\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j\ne i}^{n}I({{{\rm{L}}}_{{\rm{r}}{\rm{e}}{\rm{f}}}}_{i}={{{\rm{L}}}_{{\rm{r}}{\rm{e}}{\rm{f}}}}_{j}\cap {t}_{ij}\le 1\,{\rm{y}}{\rm{e}}{\rm{a}}{\rm{r}})}$$

(2)

We utilized the framework to compare a range of geographical distances, keeping the reference distance to pairs that were >1,000 km apart (Fig. 2).

To identify the divergence time at which pairs had an equal risk of being in the same province as distant provinces (time to homogenization across South Africa), we investigated the divergence time at which there was no increased risk of similarity within a province compared against distant provinces (RR = 1). We stratified distances in South Africa into groups of distances, including the 9 within provinces, 14 pairs that were <500 km apart, 16 pairs 500–1,000 km apart and 6 pairs of provinces >1,000 km apart. We repeated the framework across rolling 20-year time windows at 10-year intervals from 0 to 100 years (g1,g2) (Fig. 2f). We repeated this for pairs for which one was from South Africa and the other was from another country in Africa (Supplementary Fig. 3A). We sampled 300 sequences from each province with replacement to compensate for biased sampling (Extended Data Fig. 2a). Furthermore, to incorporate phylogenetic uncertainty into the statistical framework, we sampled 100 individual phylogenies from the BactDating posterior. We report the 2.5 and 97.5 percentiles from the resulting distribution. We also repeated the analyses only including pairs isolated from patients with pneumococcal disease (Extended Data Fig. 2b).

Probabilistic mobility model

Overall strategy

We extended a previously published mechanistic phylogeographic model⁶⁵ to estimate the mobility of the pneumococcus between pairs of municipalities at each transmission step. To infer the probable path of transmission between sequence pairs, we used the divergence time and the generation time distribution to estimate the number of transmission generations between pairs of sequences. Each generation is a possible transmission event and provides an opportunity for a mobility event.

The approach ultimately aims to estimate an origin destination matrix for a transmission step, whereby each cell represents the probability that the pneumococcus is now in location j after one transmission step given it was previously in location i. As the phylogenetic trees combined with the generation time provide an estimate of how many transmission steps separate pairs of samples, we can use repeated matrix multiplication to integrate over all possible pathways linking two locations (see below for more details). We incorporated the probability of sampling at each geographical location, within each collection year, by GPSC to account for our observation process.

Notation

We follow notation per a previous study⁶⁵. A pair of isolates, C_a and C_b, with sequences, Seq_a and Seq_b included in a phylogeny are found in locations, L_a and L_b, and the samples were taken in the years, T_a and T_b. The inferred MRCA between C_a and C_b is time T_m and located in L_m. The number of transmission generation from the MRCA to C_a is G_a and to C_b is G_b.

Model fitting

Single transmission generation

Considering a single transmission generation, the probability that persons i and j come into contact with each other given i lives in location a and j lives in location b can be written as follows:

$$\begin{array}{l}{\rm{P}}({\rm{person}}\,i\,{\rm{and}}\,j\,{\rm{come}}\,{\rm{into}}\,{\rm{contact}}| {L}_{i}=a,{L}_{j}=b)\\ =\mathop{\sum }\limits_{k}^{N}P({V}_{i}=k| {L}_{i}=a)\cdot P({V}_{j}=k| {L}_{j}=b)\cdot {B}_{k}\end{array}$$

(3)

Where $P\left({V}_{i}=k| {L}_{i}=a\right)$ is the probability that individual i, whose home location is in a, visits location k and $P\left({V}_{j}=k| {L}_{j}=b\right)$ is the probability that individual j whose home location is in b visits location k, and B_k is the location-specific probability of transmission for k.

At time ${\tau }$, one infector in i in location a is expected to transmit to this number of persons in location b:

$$\begin{array}{l}E({\rm{number}}\,{\rm{of}}\,{\rm{persons}}\,{\rm{from}}\,b\,{\rm{infected}}\,{\rm{at}}\,{\rm{time}}\,{\tau }| {L}_{i}=a)\\ =\mathop{\sum }\limits_{k}^{N}P({V}_{i}=k| {L}_{i}=a)\cdot P({V}_{j}=k| {L}_{j}=b)\cdot {B}_{k}\cdot {S}_{b,{\tau },{\rm{gpsc}}}\end{array}$$

(4)

Where ${S}_{b,{\tau },{\rm{gpsc}}}$ is the number of susceptible people in location b at time ${\tau }$ with some lineage = gpsc.

The total number of people infected by the infector is:

$$\begin{array}{l}E\,({\rm{number}}\;{\rm{of}}\;{\rm{persons}}\;{\rm{from}}\;{\rm{all}}\;{\rm{locations}}\;{\rm{infected}}\;{\rm{at}}\;{\rm{time}}\,\tau | {L}_{i}=a)\\ \,\,=\mathop{\sum }\limits_{m}^{N}\mathop{\sum }\limits_{k}^{N}P({V}_{i}=k| {L}_{i}=a)\cdot {P}({V}_{j}=k| {L}_{j}=m)\cdot {B}_{k}\cdot {S}_{b,\tau ,{\rm{gpsc}}}\end{array}$$

(5)

Conditional on transmission occurring, the probability that the infectee’s isolate is taken in location b is:

$$\begin{array}{l}{\delta }_{a,b,\tau ,{\rm{gpsc}}}=P({L}_{j}=b| {L}_{i}=a)\\ \,\,=\frac{{\sum }_{k}^{N}P({V}_{i}=k| {L}_{i}=a)\cdot P({V}_{j}=k| {L}_{j}=b)\cdot {B}_{k}\cdot {S}_{b,\tau ,{\rm{gpsc}}}}{{\sum }_{m}^{N}{\sum }_{k}^{N}P({V}_{i}=k| {L}_{i}=a)\cdot P({V}_{j}=k| {L}_{j}=m)\cdot {B}_{k}\cdot {S}_{b,\tau ,{\rm{gpsc}}}}\end{array}$$

(6)

We then created an NXN transmission matrix,${\Delta }_{{\tau },{\rm{gpsc}},{\rm{gen}}=1}$, with N being the total number of locations, containing the transmission probabilities, asymmetrically, between all pairs of locations at each point in time; ${{\delta }}_{a,b,{\tau },{\rm{gpsc}}}$ is element [a,b] of the matrix.

Human mobility characterization

We use Meta mobility data (MetMob), as described in the mobility data section above, to characterize mobility between the 234 municipalities of South Africa. We aggregated these to the province level (n = 9) to fit the model. Initially we extracted a 234 × 234 matrix that sets out the probability that an individual from municipality a visits location k; again, where k = any municipality.

MetMob comes from individuals using Meta; however, this may not be representative of the amount of time spent at home by those involved in pneumococcus transmission. The mean of the diagonal of the Meta Mobility matrix (234 × 234) is 0.989, implying, on average, 98.9% of Meta Users stay in their home municipality, H_i. To allow for individuals to spend more or less time at home than represented in the MetMob data, we incorporated a parameter to adjust the probability of being at home (${\theta }$). We adjusted the probability of staying home using a standard logistic function and restricted it with bounds of –0.04 and 0.6 to facilitate exploration of a sensible space. The adjustment allowed by the bounds limits the range of movement H_i–0.6 and H_i+0.04.

The probability of a person remaining in location a therefore becomes:

$$\begin{array}{l}P({V}_{i}=a| {H}_{i}=a)\\ =\,{\rm{MetMob}}[a,a]-\left(-0.04+\left(\frac{(\exp ({\theta }))}{(1+\exp ({\theta }))}\right)\cdot (0.6+0.04)\right)\end{array}$$

(7)

Where a value of greater than 1.0 is obtained for a specific municipality, this is replaced by a value of 0.999.

The estimates thus far are mobility per day, but we were interested in mobility across the infectious period. Therefore, we adjusted the diagonal to account for mobility each day within the infectious period:

$${\rm{MetMob}}\left[a,a\right]=1-{{\rm{MetMob}}\left[a,a\right]}^{G}$$

(8)

We rescaled the probabilities so that the sum of all mobility is equal to 1:

$$P({V}_{i}=a| {H}_{i}=a,k\ne a)={\rm{MetMob}}[a,k]/(1-P({V}_{i}=a| {H}_{i}=a))$$

(9)

Where MetMob[a,k] considers all mobility probabilities from the Meta Mobility data.

The sum of the movements to South African municipalities is equal to 1, which assumes that the analysis contains all possible movements of both the pneumococcus and people and implying no external introductions. The outcome of this is that some mobility may be missed, especially around the country borders.

Probability of the pneumococcus being in each location after G transmission generations

To determine the probability that location k contains the home location after G transmission generations, we used matrix multiplication, which integrates across all possible pathways connecting two locations.

$${\Delta }_{{\tau }={T}_{G},{\rm{gpsc}},{\rm{gen}}=G}=\mathop{\prod }\limits_{r=2}^{G}{\Delta }_{{\tau }={t}_{r-1},{\rm{gpsc}},{\rm{gen}}=r-1}\cdot {\Delta }_{{\tau }={t}_{r},{\rm{gpsc}},{\rm{gen}}=1}$$

(10)

Where t_r is the time of generation G_r.

Probability of observing a pair of cases in two specific locations

The probability that C_A has home location L_A and C_B has home location L_B is conditional on the sequences being observed in locations L_A and L_B at times T_A and T_B. We assumed that the location of two cases, L_i and L_j, is dependent on the location of their MRCA, L_m, and the number of transmission generations separating them from their MRCA, G_A and G_B.

The observations processes across locations are independent of each other, and each transmission event is independent of other transmission events. The probability of observing (Obs) a case at L_i at time T_A is not dependent on the number of generations to, or location of, the MRCA. We considered discretized space of the nine provinces of South Africa, resulting in the following equation:

$$\begin{array}{c}P({L}_{A},{L}_{B}| {{\rm{Obs}}}_{{L}_{B},{T}_{B}},{{\rm{Obs}}}_{{L}_{B},{T}_{B}},{{\rm{Seq}}}_{A},{{\rm{Seq}}}_{B},{T}_{A},{T}_{B})\\ =\frac{{\sum }_{{L}_{m}}{\sum }_{{G}_{A}}{\sum }_{{G}_{B}}P({{\rm{Obs}}}_{{L}_{A},{T}_{A}})P({{\rm{Obs}}}_{{L}_{B}{T}_{B}})P({L}_{A}| {L}_{m},{G}_{A})P({L}_{B}| {L}_{m},{G}_{B})P({L}_{m})P({G}_{A},{G}_{B}| {{\rm{Seq}}}_{A},{{\rm{Seq}}}_{B},{T}_{A},{T}_{B})}{{\sum }_{{L}_{i}}{\sum }_{{L}_{j}}{\sum }_{{L}_{m}}{\sum }_{{G}_{A}}{\sum }_{{G}_{B}}P({{\rm{Obs}}}_{{L}_{i},{T}_{A}})P({{\rm{Obs}}}_{{L}_{i}{T}_{B}})P({L}_{i}| {L}_{m},{G}_{A})P({L}_{j}| {L}_{m},{G}_{B})P({L}_{m})P({G}_{A},{G}_{B}| {{\rm{Seq}}}_{A},{{\rm{Seq}}}_{B},{T}_{A},{T}_{B})}\end{array}$$

(11)

Probability of G generations between MRCA and a sequenced isolate

Under the previous equation, $P({G}_{A},{G}_{B}| {{\rm{Seq}}}_{A},{{\rm{Seq}}}_{B},{T}_{A},{T}_{B})$ represents the generation time distribution.

We can extract the joint probability of C_A and C_B being separated from the MRCA by G_A and G_B transmission generations, respectively, using the above-derived generation time distribution and the time-resolved phylogenetic trees.

Assuming the generation time is gamma distributed and all transmission events are independent, the sum of the gamma distribution is also gamma distributed. Additionally, we can extract the evolutionary times E_A and E_B, separating C_A and C_B from the MRCA. As previously described⁶⁵, using equation (19), we can estimate the probability of g transmission events over many trees, allowing us to incorporate phylogenetic including evolutionary parameters from the tree structure.

We determined the probability for the number of generations from MRCA for each isolate for 1–1,000 generations, using the generation time derived above.

Location of the MRCA (P(L
_m))

We estimated the probability that, on average, an MRCA is in each of the nine provinces in South Africa. We estimated parameters for each of the eight provinces, setting Western Cape aside as a reference, and dividing by the total across all nine to ensure that the sum of the probabilities is 1. This again assumes no external introductions.

Calculation of likelihood

We calculated the likelihood using all pairs of available sequenced S. pneumoniae as previously described⁶⁵. We accounted for the observation process by incorporating the probability of sampling in each location for isolates belonging to each GPSC annually.

Likelihood equation

We calculated the likelihood using all pairs of sequenced pneumococci as follows:

$$L\propto \mathop{\prod }\limits_{{\rm{province}}=1}^{9}\mathop{\prod }\limits_{i=1}^{{n}_{{\rm{gpsc}}}}\prod _{j\ne i}P({L}_{i},{L}_{j}| {{\rm{Obs}}}_{{L}_{i},{T}_{i}},{{\rm{Obs}}}_{{L}_{j},{T}_{j}},{{\rm{Seq}}}_{i},{{\rm{Seq}}}_{j},{T}_{i},{T}_{j})$$

(12)

Where n_gpsc are the number of sequences available from GPSC gpsc.

Hamiltonian MCMC

We used an MCMC approach to estimate our parameters using the package fmcmc (v.0.5-1) implemented in R⁶⁶. We estimated nine parameters: a parameter that adjusts the probability of staying in the home municipality compared with Meta mobility data; and eight parameters capturing the relative probability that the MRCA of a pair of individuals was in each of the other eight provinces compared with the province of Western Cape (the reference).

We only used pairs of sequences that were separated by less than 10 years of evolutionary time between them. After 10 years, there are limited spatial signals remaining, as the bacterium has had many opportunities to move. This approach was used to make the model computationally tractable. To incorporate phylogenetic uncertainty, we repeatedly refit the model using 50 randomly selected phylogenies from the BactDating posterior. We report the 2.5 and 97.5 percentiles from the resulting distribution. In a sensitivity analysis, we showed that increasing the model to 15 years resulted in similar estimates (Extended Data Fig. 12a).

Model performance

To assess the performance of our model, we used a simulation framework. We simulated 50,000 pairs of events, locations and the location of the MRCA between pairs, whereby the probability of mobility between each pair of locations was determined by the Meta mobility matrix adjusted by a parameter of known value of −2. We tested our ability to recapture this input parameter. To incorporate the biased observation process, we downsampled the simulated data based on the by-province sampling probabilities from our true data.

We used the downsampled data to fit our model with 20,000 steps of a MCMC with a jump step of 0.08 in 3 chains. We were able to recapture the downsampled data utilizing the human mobility framework and the estimated parameter for the probability of staying at home, accounting for the sampling probability per province. We then utilized the same human mobility framework and the estimated parameter, excluding the sampling probability, and were able to recapture the complete simulated data, including the input parameter (Extended Data Fig. 5a,b).

We repeated these simulations but downsampled with various biases. We determined how well the model performed when we only sampled two provinces. We also tested how far off our estimates would be if our generation time estimate was 50% higher or 50% lower than we had estimated given the Kenyan and Gambian data^31,32 (Extended Data Fig. 5c,d).

Model sensitivity analyses

We estimated our parameters only to include isolates from patients with invasive pneumococcal disease (Supplementary Fig. 1B).

To test the impact of a range of generation times on the parameter estimates, we also re-ran our MCMC with generation times of 15, 35 (as included in the model) and 55 days. To quantify uncertainty, we sampled posterior parameters, reporting the 2.5 and 97% percentiles (Extended Data Fig. 12b).

We confirmed that the chains converged for each of the nine parameters estimated (Supplementary Fig. 16).

Probability guided transmission simulations

Person-to-person transmission chains

We simulated person-to-person transmission chains seeded in a starting municipality weighted by the population size. We determined the RR of being in a specific municipality after 10 transmission generations (approximately 1 year) across 500,000 simulations.

We fixed the starting municipality to be rural (population density <50 km^–2) or urban (population density >500 km^–2) and repeated the above simulation. For 10,000 sequential simulations, we counted the number of unique municipalities affected and distance travelled at each transmission interval weighting by population size. We then determined the number of municipalities travelled to across all transmission chains (Extended Data Fig. 3).

Branching epidemic

We simulated a branching epidemic in which we drew the number of transmission events seeded by each event from a Poisson distribution around an effective reproductive number (R_eff) of 1 and amplitude of 0.15 over 100 iterations. We determined the mean distance from the starting municipality after 60 generations (5.8 years with a 35-day generation time) and the number of municipalities visited over that time. We calculated the uncertainty at the 95% CI of a normal distribution.

Gravity model

We compared the performance of the model that used Meta data with the performance of a simple gravity model whereby the probability of mobility is determined by the distance and human population size of the municipalities. We calculated the probability of mobility between locations i and j (GravMob_i,j) to be the log of the destination population size (popsize_j) raised to parameter β divided by the distance between locations, loc_i and loc_j (dist_i,j) raised to parameter γ.

$${{\rm{GravMob}}}_{i,j}=\frac{\log ({{{\rm{popsize}}}_{j}}^{{\beta }})}{{{{\rm{dist}}}_{i,j}}^{{\gamma }}}$$

(13)

We also tested a model including only distance and estimating an exponent adjustment parameter γ.

We calculated the DIC comparing the three models and found that the Meta mobility model (DIC = 12,290.97) was the best-performing model⁶⁷ when compared with the gravity model (DIC = 12,294.34). Both of these models performed better than distance alone (DIC = 12,424.32) (Extended Data Fig. 4).

All statistical analysis for the RR framework and the mobility model was performed in R (v.3.6.2)⁶⁸.

Population-level fitness model

Overall strategy

We developed logistic growth models to fit the changing prevalence of different serotypes and assess the impact of vaccine implementation, utilizing a method that has previously been implemented for the endemic bacterium Bordetella pertussis⁶⁹.

Vaccine-type model

We first binned serotypes into three groups: those not included in the vaccine (NVTs), serotypes included in PCV7 (4, 6B, 9V, 14, 18C, 19F and 23F), and additional serotypes included in PCV13 (1, 3, 5, 6A, 7F and 19A). We used the full data for this analysis. We computed the relative abundance f_i_,ref of each group of serotypes, i, compared with a reference type, ref.

$${f}_{i,{\rm{ref}}}\left(t\right)=\frac{{N}_{i}\left(t\right)}{{N}_{i}\left(t\right)+{N}_{{\rm{ref}}}\left(t\right)}$$

(14)

We chose to use NVTs as the reference. Varying the reference did not affect the model, as long as the reference samples span all years. We then used a simple logistic model, assuming a constant growth rate to capture the evolution of this abundance, at each time t.

$${f}_{i,{\rm{ref}}}\left(t\right)=\frac{1}{1+\left(\frac{1-{f}_{i,{\rm{ref}},0}}{{f}_{i,{\rm{ref}},0}}\right)\exp \left({-r}_{i,{\rm{ref}}}\cdot t\right)}$$

(15)

where r_i,ref is the growth rate of that abundance shared across all provinces, and f_i,ref,0 is the initial relative abundance of the serotype group i with respect to a chosen ref.

To control for the varying presence of all circulating serotypes through time, we present fitness as the average relative growth rate, $\bar{{r}_{i}}$, for each group with respect to a randomly selected group in the population:

$$\bar{{r}_{i}}=\mathop{\sum }\limits_{j\ne i}^{n}\bar{{f}_{j}}({r}_{i,{\rm{ref}}}-{r}_{j,{\rm{ref}}})$$

(16)

where n is the number of groups, and $\bar{{f}_{j}}$ is the average absolute frequency of the group in the period of time considered.

This average relative growth rate, $\bar{{r}_{i}}$, can be identified as the selection rate coefficient of the group in the population considered⁶⁹. The selection rate coefficient is a direct measure of the fitness advantage of emerging variants and is one of the best indicators as to whether a strain will increase in frequency during an outbreak^70,71.

We can further multiply the selection rate by the mean generation time (35 days) to obtain the selection coefficient per generation, which is the relative fitness advantage per transmission generation.

In the Article, we also present estimates of the relative fitness advantage of groups in particular time frames. We used the same computation as for the average relative growth rate, but tailored it to specific references. For example, to compute the relative fitness advantage of NVTs compared with VTs in the PCV era $\Delta \overline{\,{r}_{{\rm{NVT}},{\rm{VT}},{\rm{after}}{\rm{PCV}}}}$, we computed:

$$\Delta \overline{\,{r}_{{\rm{NVT}},{\rm{VT}},{\rm{after}}{\rm{PCV}}}}=\frac{1}{\overline{\,{f}_{{\rm{PCV}}7}}+\overline{\,{f}_{{\rm{PCV}}13}}}\sum _{i{\epsilon }({\rm{PCV}}7,{\rm{PCV}}13)}\overline{{f}_{i}}\,(\overline{\,{r}_{{\rm{NVT}},{\rm{after}}{\rm{PCV}}}}-\overline{\,{r}_{i,{\rm{after}}{\rm{PCV}}}})$$

(17)

To fit the model, we used Hamiltonian Monte Carlo as implemented in R-Stan⁷², with stan (v.2.26.1). We used a Poisson likelihood to fit the observed proportion of sequences that were of each category and the total number of isolates in that year as an offset. The model estimated the proportion of isolates that were of each type at the start of the dataset and the fitness parameters.

Exploration of vaccine effect

To investigate whether the serotypes fitness changed across after implementation of PCV7 and PCV13, we tested a range of models. We considered a model without any shift of fitness, a model with a single shift in 2009 and a model with two growth rates represented by a shift in fitness in both 2009 and 2011. The 2009 shift pertains only to those serotypes included in PCV7 and the 2011 shift pertains to those additional 6 serotypes in PCV13. Model comparison was done using the Watanabe–Akaike information criterion (WAIC) implemented in the loo package⁷³. Furthermore, we tested for a potential delay between implementation of PCVs and the change in fitness. Model comparison is presented in Extended Data Fig. 7.

The model performed best with an initial fitness switch in 2009 (the initial year of vaccine implementation) and another fitness switch in 2011 (the year of PCV13 implementation). We used this model in the main text.

Estimation of individual serotype fitness

Using the same framework, we also estimated the growth rate per serotype to capture whether the individual dynamics were concordant with, or deviated from, what was expected according to their respective group (NVT, PCV7 or PCV13) (Fig. 4a–c). The reference strain in this analysis was set to serotype 13, which is a NVT and for which samples span all years.

Exploration of predicted GPSC dynamics based on their serotype composition

We explored whether using the proportion of isolates that were VT versus NVT within each GPSC at the start of the study period and our fitness estimates could explain the subsequent dynamics of individual GPSCs.

To predict the dynamics of individual GPSCs based on their serotype composition, we used the same framework as described above, applied to each GPSC serotype group in our dataset. As the number of such groups is large (n = 340 GPSC serotype groups), we restricted the analysis to the GPSCs present at a minimum prevalence of 1%. This led to 26 GPSCs being considered for this analysis (representing 74.7% of the dataset), split in a total of 101 GPSC serotype groups. We then modelled the dynamics of each GPSC serotype group, estimating one starting frequency per group and using the previously estimated fitness parameters, either by VT or serotype. We also considered a model with no fitness parameters (relative fitness = 1). To assess the model performances, we computed the Akaike information criterion (AIC)⁷⁴ using the average likelihood of each model and the number of parameters used in each model. We also computed the predicted GPSC dynamics over time for each model by summing all the predicted GPSC serotype group dynamics for each GPSC. As a measure of goodness of fit we used the coefficient of determination of the observed versus predicted GPSC proportions each year:

$${R}^{2}=1-\frac{{\sum }_{i}{({\rm{o}}{\rm{b}}{\rm{s}}{\rm{e}}{\rm{r}}{\rm{v}}{\rm{e}}{\rm{d}}{\rm{p}}{\rm{r}}{\rm{o}}{\rm{p}}{\rm{o}}{\rm{r}}{\rm{t}}{\rm{i}}{\rm{o}}{\rm{n}}-{\rm{p}}{\rm{r}}{\rm{e}}{\rm{d}}{\rm{i}}{\rm{c}}{\rm{t}}{\rm{e}}{\rm{d}}{\rm{p}}{\rm{r}}{\rm{o}}{\rm{p}}{\rm{o}}{\rm{r}}{\rm{t}}{\rm{i}}{\rm{o}}{\rm{n}})}^{2}}{{\sum }_{i}{({\rm{o}}{\rm{b}}{\rm{s}}{\rm{e}}{\rm{r}}{\rm{v}}{\rm{e}}{\rm{d}}{\rm{p}}{\rm{r}}{\rm{o}}{\rm{p}}{\rm{o}}{\rm{r}}{\rm{t}}{\rm{i}}{\rm{o}}{\rm{n}}-{\rm{m}}{\rm{e}}{\rm{a}}{\rm{n}}{\rm{o}}{\rm{b}}{\rm{s}}{\rm{e}}{\rm{r}}{\rm{v}}{\rm{e}}{\rm{d}}{\rm{p}}{\rm{r}}{\rm{o}}{\rm{p}}{\rm{o}}{\rm{r}}{\rm{t}}{\rm{i}}{\rm{o}}{\rm{n}})}^{2}}$$

(18)

R² is also the proportion of variation explained by the variables considered in each model.

Our model estimated a constant fitness for each group considered (VTs or individual serotypes), assuming that if a group has the highest fitness, it will eventually replace the other groups. This assumption is meaningful for VTs and serotypes, as some are directly targeted by the vaccines, a strong selective force in the population. However, the fitness of each GPSC cannot be modelled with this simple assumption as it has been shown that their fitness is inherently multifactorial, as described in the NFDS model²⁴.

AMR

We then used the model to capture the decreasing proportion of penicillin resistance in the population. To do this, we incorporated the dynamics of VTs (PCV7 and PCV13) and NVTs in the population and the respective proportion of penicillin resistance within them. To keep the model tractable, we did not differentiate here between serotypes included in PCV7 or PCV13, instead we group them into VTs. We use the PCV7 implementation (2009) as the year of fitness shift. This parametrization marginally differed from the best model (next best) (Extended Data Fig. 7) and enabled us to keep the number of categories tractable. We also compared the single switch model to a model with no change in fitness at the time of vaccine implementation. This approach showed that the model with a switch was superior (Supplementary Fig. 17).

We used the same approach described above to model the proportion of VTs, f_VT, and NVTs, f_NVT. We then modelled the proportion of strains that were and were not penicillin resistant within each group, either VT, f_AMR|VT or NVT, f_AMR|NVT. We estimated the fitness of resistance in each.

We then fit the model to all four groups, f_VX,AMR (resistant NVT, susceptible NVT, resistant VT and susceptible VT).

$${f}_{{\rm{VX}},{\rm{AMR}}}(t)={{f}_{{\rm{VX}}}(t)\cdot f}_{{\rm{AMR| VX}}}(t)$$

(19)

Where f_VX is the proportion of either VT or NVT in the population and f_AMR|VX is the proportion of penicillin resistance within each of those groups. We then derived the proportion of penicillin resistance overall in the population by summing the proportion of VT resistant, f_VT,AMR, and NVT resistant, f_NVT,AMR, strains.

Estimating effect of fitness on migration

We included the calculated average relative growth rates in the simulated branching epidemic to calculate the mean number of municipalities visited, distance travelled and probability of being in the home municipality over 5 years. We report uncertainty at one standard deviation from the mean. We repeated this incorporating the post-PCV selection coefficients for NVTs and VTs and estimated the relative increase in the number of municipalities visited for NVT serotypes compared with PCV serotypes after 2 years of transmission.

Fitness model carrying capacity sensitivity

Our fitness model assumed constant fitness over time. This specifically means that if a serotype is found to be fitter than the rest of the population, we expect it to replace the whole population after some time. However, the currently best-supported model proposes that NFDS drives pneumococcus population dynamics²⁴, whereby lower frequency genes become more fit.

To test the effect of our assumption on our fitness estimates, we performed a sensitivity analysis. We introduced minimum and maximum carrying capacities in our logistic model. In equation (20), we let the relative frequency f_i_,ref go to a maximum of K_max and in equation (21), a minimum of K_min:

$${f}_{i,{\rm{ref}}}(t)=\frac{{K}_{\max }}{1+\left(\frac{{K}_{\max }-{f}_{i,{\rm{ref}},0}}{{f}_{i,{\rm{ref}},0}}\right)\exp ({-r}_{i,{\rm{ref}}}\cdot t)},\,{\rm{if}}\,{r}_{i,{\rm{ref}}} > 0$$

(20)

$${f}_{i,{\rm{ref}}}(t)=1-\frac{{1-K}_{\min }}{1+\left(\frac{({1-K}_{\min })-(1-{f}_{i,{\rm{ref}},0})}{(1-{f}_{i,{\rm{ref}},0})}\right)\exp ({r}_{i,{\rm{ref}}}\cdot t)},\,{\rm{if}}\,{r}_{i,{\rm{ref}}} < 0$$

(21)

with ${K}_{\min }\le {K}_{\max }$, and ${K}_{\min }\in \left[0,1\right],\,{K}_{\max }\in \left[0,1\right]$.

We considered a range of values for K_min (0.1, 0.05 and 0) and K_max (0.9, 0.95 and 1) (Supplementary Fig. 20). We tested the effect of this carrying capacity has on both the vaccine status model (Supplementary Fig. 18) and the AMR and vaccine status model (Supplementary Fig. 19). In each case, we compared the fitness estimates obtained. We found that the fitness estimates were robust to variable carrying capacities across those tested, which implied that our assumption does not affect this framework. However, it is important to note that this model is robust to set variable carrying capacities while being unable to estimate them, which is in contrast to NFDS. Our model assumes that there is a population replacement up to the carrying capacity and does not allow for fine-scale estimates of an equilibrium. This fitness model can be used as a quantitative descriptor of the growth of distinct groups (that is, penicillin resistance or VT serotypes in the context of a perturbation) but cannot describe the complex underlying fitness effects of changing frequencies of genes under NFDS²⁴.

All statistical analysis for the fitness model was performed in R (v.4.0.5)⁶⁸.

Ethics

The study was coordinated by the GPS (https://www.pneumogen.net/gps/), whose activities are approved by respective ethics committee in-country. Isolates sequenced in this study come from the National Institute for Communicable Diseases and the University of Witwatersrand carriage studies as previously published. These isolates were all collected as part of existing public health surveillance approved protocols in each country. No personally identifiable information was used as part of this study.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link

Geographical migration and fitness dynamics of Streptococcus pneumoniae – Nature

Data sources and processing

Pneumococcal sequence data and metadata

Population data

Mobility data

Generation time distribution

Sample culture and genome sequencing

AMR

Constructing time-resolved phylogenetic trees

RR framework

Probabilistic mobility model

Overall strategy

Notation

Model fitting

Single transmission generation

Human mobility characterization

Probability of the pneumococcus being in each location after G transmission generations

Probability of observing a pair of cases in two specific locations

Probability of G generations between MRCA and a sequenced isolate

Location of the MRCA (P(L m))

Calculation of likelihood

Likelihood equation

Hamiltonian MCMC

Model performance

Model sensitivity analyses

Probability guided transmission simulations

Person-to-person transmission chains

Branching epidemic

Gravity model

Population-level fitness model

Overall strategy

Vaccine-type model

Exploration of vaccine effect

Estimation of individual serotype fitness

Exploration of predicted GPSC dynamics based on their serotype composition

AMR

Estimating effect of fitness on migration

Fitness model carrying capacity sensitivity

Ethics

Reporting summary

Latest news

Must read

You might also likeRELATEDRecommended to you

Location of the MRCA (P(L
_m))

You might also likeRELATED
Recommended to you