Introduction

Humans are inextricably connected with bacterial communities. For example, there are 1014 bacterial cells in the human alimentary tract, comprising an order of magnitude greater bacterial cells than human cells (Luckey 1972). Moreover, some bacterial populations engage in intimate relationships with humans and play important roles in our normal homeostasis (Riley et al. 2013). Although gut bacterial populations comprise normal gastrointestinal microbiota, they can still cause life-threatening infections. More than 13% of global deaths each year are associated with bacterial infectious diseases (Punina et al. 2015). Antibiotics are medicinal compounds that can be used to prevent and treat bacterial infections. However, the emergence of antibiotic resistance in bacterial populations can limit antibiotic treatment options for infection control and presents a major threat to human health (Li and Webster 2018; Woodford and Ellington 2007). According to the 2019 Antibiotic Resistance Threats in the United States published by the Centers for Disease Control and Prevention, over 2.8 million antibiotic-resistant infections occur in the United States every year, resulting in more than 35,000 deaths (Centers for Disease Control and Prevention 2019).

Horizontal gene transfer (HGT) is the lateral movement of genetic material between organisms rather than transmission of DNA from parents to offspring (Keeling and Palmer 2008). HGT represents an alternative pathway for bacteria to share antibiotic resistance genes and is thought to have a considerable influence on the dissemination of resistance genes (Kaiser 2020; Martinez and Baquero 2000; ReAct 2005; Von Wintersdorff et al. 2016). Bacteria typically contain genome defense systems to guard against intruding foreign DNA (e.g., resistance genes) including restriction–modification systems (i.e., the innate immune system) and CRISPR-Cas systems (i.e., an adaptive immune system). CRISPR-Cas systems comprise heritable sequence-specific genome defenses against resistance genes by acting as barriers to HGT and significantly inhibiting the dissemination of antibiotic resistance genes (Bar-rangou et al. 2007; Marraffini and Sontheimer 2008). Some studies have suggested that the widespread presence of pre-existing deletions of CRISPR spacers and mutations in Cas genes can compromise or inactivate CRISPR–Cas defenses of antibiotic resistance (Jiang et al. 2013; Lopez-Sanchez et al. 2012). Furthermore, various CRISPR spacer deletion mutants emerge over time when antibiotic selection for the pre-existing resistance plasmid has occurred. The loss of spacers is often coupled with losses of the surrounding spacers, ranging of the deletion size from dozens of to thousands of base pairs (Hullahalli et al. 2017; Huo et al. 2018; Price et al. 2016). Additionally, the proportion of each deletion mutant within a mixed population dynamically changes over time due to the continuous effects of antibiotics. Adaptive immune systems then become compromised with population genetic variation due to antibiotic selection (Huo et al. 2018). In such population with deletions, antibiotic resistance genes can escape CRISPR-Cas defenses through the mutations in the target region, thereby preventing their identification or cleavage. Based on these observations, the changes in the population proportions of deletions significantly impact the spread of antibiotic resistance genes.

Most studies that have identified SVs or mutations using NGS data have focused on human genomes. Korbel first introduced a method for detecting SVs in the human genome using NGS data in 2007 (Korbel et al. 2007). In contrast to human genomes, bacterial genomes consist of a diversity of repetitive sequences that enhance the chance of deletion and therefore increase genomic instability (Bao et al. 2014; Treangen et al. 2009). In addition, the majority of studies have focused on detecting SVs rather than estimating the population proportions of SVs among subpopulations. While the identification of mutations is critical, it is also necessary to understand the variability of mutant subpopulations over time. Some SV detection tools, including Pindel (Ye et al. 2009), Delly (Rausch et al. 2012), and Breseq (Deatherage and Barrick 2014), can report the quantity of reads supporting each identified mutation. Because most of the SV detection tools only use split-mapping information, the supporting reads remain insufficient. The supporting reads can help approximately estimate the population proportions that are carrying mutations. Huo et al. (2018) introduced a method that identifies reads supporting mutations by mapping all paired-end reads to manually created short reference sequences as single-end reads. The short references are sequences that connect some locus before a deletion with another locus after the deletion. Such techniques can effectively detect mutations, but they discard useful paired-end read information that is essential for estimating proportions.

To better understand antibiotic resistance gene transfer among and within bacterial species, the changes in distribution of each deletion within a mixed population over time are critical to assess. To accomplish this, we propose a mixture model that maximizes information from insert sizes of paired-end reads to estimate the proportions of subpopulations carrying single deletion mutations in mixed populations based on the mapping results. The proposed method was then evaluated by three simulation studies to demonstrate its ability to accurately estimate the population proportions of deletion mutations. Moreover, the results of the algorithm were observed to be numerically stable. Comparisons between the results generated with the proposed method against other methods using simulated datasets are evaluated in Sect. 2.3. In addition to contributing to a more comprehensive understanding of the relationships between mutations and antibiotic resistance, this study also provides a deeper insight into mutations as the engine of bacterial evolution. Indeed, to better understand the process of bacterial population evolution, the acquisition of complete information about mutations is needed in addition to their detection (Hershberg 2015).

The rest of this manuscript is organized as follows. In Results, we provide i. the statistical model, ii. three simulation studies of population mixtures, and iii. a case analysis using a deep sequencing dataset of a CRISPR array to demonstrate the performance of the proposed methods. The original source code for the simulation studies and the real datasets that were used in this study are available at https://github.com/zhan1530/EM. In Discussion, we outline and discuss our approaches and highlights potential future research areas. Finally, the modeling and mathematical derivation of the proposed methods are introduced in Materials and methods.

Results

We consider a mixture model in which a metapopulation comprises subpopulations carrying distinct mutations in the genome. Using the mixture model, the subpopulation proportions can be estimated at each time point. Consequently, the variability of population distribution over time can be investigated by evaluating the subpopulation proportions at different times. The model was validated by the three simulation studies presented in this section, demonstrating its accuracy and stability.

Identification of candidate deletions

To estimate the proportion of subpopulations, we have to first identify all candidate deletion mutations. A simple mixture population is first considered in which there are two subpopulations. One subpopulation is the wild-type genome without a mutation, and the other subpopulation has a smaller genome with a deletion mutation (Fig. 1A). After splitting the sample genome into short DNA fragments, our aim is to estimate the proportions of each subpopulation based on next-generation paired-end sequencing data. Next-generation paired-end sequencing technology enables both ends of a DNA fragment to be sequenced, as shown in Fig. 1B. Read length, denoted by r, refers to the number of base pairs (bps) sequenced from each end of a DNA fragment. A set of n DNA fragments is sampled from the mixed population, as represented by i = 1, 2…, n, and Xi is used to indicate the length of ith insert size.

Fig. 1
figure 1

A DNA fragmentation of a mixed population with two subpopulations. B Visualization of a paired-end read. A unique adapter sequence is present on both ends. The left-end read, also known as the forward read, extends from the adapter in the 5′−3′ direction along the forward DNA strand. The right-end read, or the reverse read, extends from the adapter in the 5′−3′direction along the reverse DNA strand. The inner distance is the length of DNA sequence inserted between the left-end read and the right-end reads. Insert size is the length of the insert between the two adaptors. C Read alignments near a deletion. The rectangle layer shows the location where the deletion happened. For each paired-end set of reads, the arrow from left to right indicates the left-end read, while the arrow from right to left indicates the right-end read. The horizontal dashed lines show the segments of the DNA fragments that are not sequenced. Li is the left start locus of the ith read mapped to reference sequence. Ri is the right-end locus of the ith read mapped to the reference sequence. (a) is mapped as concordant reads. (b) is mapped as discordant reads. (c) and (d) are mapped as split reads. (c) is the left-split read and (d) is the right-split read. D The genomes of K subpopulations. When zi = 1, the subpopulation contains the wild-type genome without a deletion. When zi = k, where k = 2, 3, …, K, the subpopulation has a shorter genome due to deletion k. The length of deletion k is labeled as dk. The left most locus and the right most locus of deletion k are labeled as dLk and dRk, respectively

For the next-generation paired-end sequencing data, the identification of deletion mutations requires comparing the DNA sequences to a reference genome that has already been assembled. Figure 1C shows four paired-end reads that are sampled from the genome of subpopulation containing a deletion mutation and aligned to the reference genome. Suppose a set of n paired-end reads are observed, wherein i = 1, 2…, n. The start locus of the left-end read of the ith read mapped to the reference genome is Li, and Ri is the ending locus of the ith right-end read mapped to the reference genome. The mapped distance is the distance between Li and Ri plus the read length r. The Li and Ri can be used to compute the mapped distance and infer the insert size, Xi. After mapping the paired-end reads to the reference genome, possible mapping results are shown in Fig. 1C: including concordant reads, discordant reads, and split reads. In general, if paired-end reads are mapped to the correct strain, in the correct orientation, and the distance between mapped reads is consistent with the insert size distribution, the reads are considered concordantly mapped read pairs (Guan and Sung 2016). In this example, the concordant read [Read (a) in Fig. 1C] is the paired-end read that does not cover the deletion junction, while its mapped distance is the same as the expected distance, which is the insert size of the DNA fragment. The discordant reads are the paired reads that map to incorrect strains and are mapped in an incorrect orientation or contain an incorrect insert size (Guan and Sung 2016). In this example, the discordant read [Read (b) in Fig. 1C] is the paired-end read that covers the deletion junction and has a significant difference between its mapped distance and its insert size. The incorrect insert size of the discordant reads will then provide evidence of a deletion mutation. Both discordant reads and concordant reads are referred to as non-split reads. In contrast, split reads [Read (c) and (d) in Fig. 1C] are instances where a read has one portion that maps to one end of a deletion, with the other portion of the same read mapping to the other end of a deletion, and these can also provide evidence of a deletion mutation. More specifically, both discordant reads and split reads can indicate a possible deletion mutation. Consequently, (Li, Ri, Si) are our observed data of ith paired-end read mapped to the reference genome, where Si  ϵ {NS, LS, RS}. Here, NS is the non-split read (Read (a) and (b) in Fig. 1C), LS is the left-split read (Read (c) in Fig. 1C), and RS is the right-split read (Read (d) in Fig. 1C). NS, LS and RS are mutually exclusive and exhaustive.

The start and end loci, denoted as dLk and dRk, for all candidate deletions are inputs for our methods and thus need to be prepared in the pre-processing step. This can be accomplished with existing SV detection tools that can identify candidate deletions within mixed populations. Most SV detection programs focus on human genomes, with the exception of Breseq (Deatherage and Barrick 2014) that can correctly identify all candidate deletion of simulation datasets in Sect. 2.3. Therefore, the use of Breseq is recommended to detect deletions and obtain dLk and dRk in the bacterial population.

A mixture model

Suppose we obtain K−1 candidate deletion mutations. Each deletion mutation is represented by k = 2, 3, 4…, K, while k = 1 indicates the wild type. We consider a mixture population that comprises K subpopulations. A set of n paired-end reads are observed, i = 1, 2…, n. After mapping to a reference genome, the observed data of the ith read are denoted as (Li, Ri, Si), where Si ∈ {NS, LS, RS}. The ith read is sampled from one subpopulation, represented by a latent variable zi. The joint distribution can then be written as

$$\mathrm{Pr}\left({L}_{i},{R}_{i},{S}_{i}\right)=\sum_{k=1}^{K}\mathrm{Pr}\left({L}_{i},{R}_{i},{S}_{i}\mid {z}_{i}=k\right)\mathrm{Pr}\left({z}_{i}=k\right),$$

where \(\mathrm{Pr}\left({z}_{i}=k\right)\) is the probability that the \(i\)th read comes from the \(k\)th subpopulation, and \(\mathrm{Pr}\left({L}_{i},{R}_{i},{S}_{i}\mid {z}_{i}=k\right)\) is the conditional probability of \(\left({L}_{i},{R}_{i},{S}_{i}\right)\), given that it comes from subpopulation \(k\). We can also define \({\theta }_{k}=\mathrm{Pr}\left({z}_{i}=\right.\) \(k;\theta ),\) and \(\theta =\left({\theta }_{1},\cdots ,{\theta }_{K}\right)\). In biology, \({\theta }_{k}\) is the proportion of sampled DNA fragments from subpopulation \(k\). Then, \({\theta }_{k}\ge 0\) and \(\sum_{k=1}^{K}{\theta }_{k}=1\).

DNA is assumed to be randomly fragmented, wherein the start locus of each DNA fragment is roughly equally likely to be chosen from the genome of subpopulation \(k\). For the kth subpopulation, the start locus is uniformly distributed over the whole genome minus the mean length of DNA fragments theoretically. Therefore, the length is \({l}_{k}-\mu \). Here, \(\mu \) is the mean of insert size and also is the mean length of DNA fragments. \({L}_{i}\mid {z}_{i}=k\) is the left start locus of \(i\)th paired-end read mapped to the reference sequence if the fragment is sampled from the subpopulation, \(k\). Consequently, \({L}_{i}\mid {z}_{i}=k\sim U\left(0,{l}_{k}-\mu \right)\), which is used to compute \(\mathrm{Pr}\left({L}_{i}\mid {S}_{i},{z}_{i}=k;\mu \right)\). Let \({X}_{ik}\) be the insert size of the \(i\)th read if it is sampled from subpopulation \(k\). Then, \({X}_{ik}\) can be inferred from the observed data \(\left({L}_{i},{R}_{i},{S}_{i}\right)\) given \({z}_{i}=k\). We assume that \({X}_{ik}\sim N\left(\mu ,{\sigma }^{2}\right)\), which is used to compute \(\mathrm{Pr}\left({R}_{i}\mid {L}_{i},{S}_{i},{z}_{i}=k;\mu ,\sigma \right)\). Here, \(\sigma \) is the standard deviation of insert size for DNA fragment. Concordant reads will then follow this distribution, while discordant reads will tend to be longer than expected. Involving \(\theta ,\mu \), and \(\sigma \), the probability function is

$$\mathrm{Pr}\left({L}_{i},{R}_{i},{S}_{i};\theta ,\mu ,\sigma \right)=\sum_{k=1}^{K} \mathrm{Pr}\left({L}_{i},{R}_{i},{S}_{i}\mid {z}_{i}=k;\mu ,\sigma \right)\mathrm{Pr}\left({z}_{i}=k;\theta \right).$$

Next, we describe the log-likelihood function of the mixture model based on the observed data \(\left({L}_{i},{R}_{i},{S}_{i}\right)\). The log-likelihood function based on the conditional probability can be written as (please see Supplementary Material A.1 in detail)

$$\ell \left(\theta ,\mu ,\sigma \right)=\sum_{i=1}^{n} \mathrm{log}\sum_{k=1}^{K} \mathrm{Pr}\left({z}_{i}=k;\theta \right)\mathrm{Pr}\left({L}_{i},{R}_{i},{S}_{i}\mid {z}_{i}=k;\mu ,\sigma \right).$$

Here, \({\theta }_{k}\) is the proportion of DNA fragments from subpopulation \(k\), rather than the proportion of cells from subpopulation \(k\), which is denoted by \({p}_{k}.\) \({\theta }_{k}\) is related to both \({p}_{k}\) and the genome lengths of subpopulations. For example, if we have an equal proportion of cells from two subpopulations in an experiment, the subpopulation with the longer genome will have a higher \({\theta }_{k}\). To compute \({p}_{k}\), we can adjust the sample proportion \({\theta }_{k}\) by \({l}_{k}\), the length of the genome for the subpopulation \(k\), as follows:

$${p}_{k}=\frac{\frac{{\theta }_{k}}{{l}_{k}}}{\sum_{k} \frac{{\theta }_{k}}{{l}_{k}}}.$$

Figure 1D shows an example of genomes from \(K\) subpopulations within a mixed population. For subpopulation \(k\), let \({d}_{k}\) be the sequence length of the deletion mutation \(k\), and let \({d}{L}_{k}\) and \({d}{R}_{k}\) denote the left and right locus of deletion \(k\), respectively. Thus, \({d}_{k}=d{R}_{k}-d{L}_{k}+1\). For the wild type, i.e., when \(k=1\), we set \({d}_{1}=0\). Note that \({l}_{k}={l}_{1}-{d}_{k}\).

Most SV detection methods only use split reads to detect deletions, because split reads can provide direct evidence of deletion mutations. However, the split reads only account for a small portion of all data. The remaining large portion of the data comprise non-split reads whose insert sizes can provide indirect evidence of deletion mutations. The focus of this work is in estimating the proportions of deletion mutants rather than detecting the deletions, and thus, the non-split reads can provide useful information for these estimations. In the study, the method using only non-split reads (Method 1) was first used to estimate the proportions to assess how much useful information can be obtained from these non-split reads that were typically discarded by most other algorithms. The method that uses both non-split and split reads together (Method 2) was then introduced. For details of the methods, please see Materials and methods.

Simulation study

Three simulation studies were employed to evaluate the performance of our proposed methods. The reference sequence in the simulation was a 2060 base pairs (bp) region (EF1813) derived from Enterococcus faecalis strain V583 (NCBI reference sequence: NC_004668.1).

Three subpopulations

A mixed population was constructed consisting of three subpopulations (Supplementary Table S1) to test the performance of two methods: (1) using only the non-split reads, and (2) using both the split- and non-split reads. The first population (labeled NA in the Table S1) is the wild-type reference strain NC_004668.1. To simulate the deletion mutants for the second and third subpopulations, we removed the sequences at two loci, 400–599 and 800–1299 (labeled as such in the Table S1), from the original reference genome. The proportions of the three mixed source sequences were set at \(0.20, 0.45\), and \(0.35\). We then randomly drew 20,000 DNA fragments from the mixed population. The insert sizes were simulated from a normal distribution with a mean of 500 bp \(\left(\mu =500\right)\) and a standard deviation of \(50\;\mathrm{ bp }\left(\sigma =50\right)\). Paired-end reads were generated from the sampled DNA fragments, with a read length of 75 bp from both ends. The Burrows–Wheeler Aligner (BWA-MEM) with default parameters was used to map all simulated paired-end reads to the original NC_004668.1 reference sequence. The observed data \(\left({L}_{i},{R}_{i},{S}_{i})\right.\) of the \(i\) th paired-end read were derived from the SAM file after BWA alignment. The parameters of proportions \(\left({p}_{1},{p}_{2},{p}_{3}\right)\) were subsequently estimated for each simulated data set. Simulations were repeated 100 times for each setting. Table 1 provides the average, \(2.5\mathrm{\%}\) and \(97.5\mathrm{\%}\) empirical quantiles, and mean squared error (MSE) values for estimated proportions over 100 simulations, as denoted by \(\left({\hat p_1},{\hat p_2},{\hat p_3}\right)\).

Table 1 The estimated proportions of three subpopulations

The average values of the estimated subpopulation proportions \(\left({\hat p_1},{\hat p_2},{\hat p_3}\right)\) were quite close to the true values \(\left(\mathrm{0.20,0.45,0.35}\right)\) for both methods (Table 1). Using only non-split reads, Method 1 can provide accurate estimations for the proportions. As expected, the MSEs of Method 2 were lower than those of Method 1 for all three subpopulations, although the MSE values were all in the range of \({10}^{-5}\). However, the \(2.5\mathrm{\%}\) and \(97.5\mathrm{\%}\) empirical quantiles for all three subpopulation proportions encompassed the true proportions when using Method 2. In contrast, only one of the intervals encompassed the true value when using Method 1. Therefore, by adding split reads, Method 2 is less biased than Method \(1.\)

In addition to estimating proportions, the algorithm also computes (\(\hat{\mu },\hat{\sigma })\) representing mean and standard deviation of the insert sizes. Supplementary Table S2 shows the average values, 2.5% and 97.5% empirical quantiles, and mean squared errors (MSE) of (\(\hat{\mu ,}\hat{\sigma })\) based on the 100 simulated datasets. The estimates (\(\hat{\mu },\hat{\sigma })\) were close to their true values. However, the intervals of quantiles indicated that the estimations for both methods are slightly biased. The method using both non-split and split reads significantly lowered the MSE of (\(\hat{\mu },\hat{\sigma })\), when compared to the method using only non-split reads. Therefore, Method 2 also exhibited better performance than Method 1 for estimating the distribution of the insert size.

A bootstrap approach was used to estimate the standard errors (SE) of the estimators \(\left(\hat p_{1},\hat p_{2},\hat p_{3},\hat{\mu },\hat{\sigma }\right)\). From each simulated data set, 100 bootstrap samples were drawn, each of which would produce an estimate for a parameter. After 100 iterations, we can estimate the \(\mathrm{SE}\), denoted by \({\widehat{\mathrm{SE}}}_{b}\), of the sampling distribution of the estimator. Supplementary Table S3 shows \(\widehat{\mathrm{SE}}\), the sample standard deviation for estimated parameters based on the 100 simulated datasets, and the average of \({\widehat{\mathrm{SE}}}_{b}\). Figure 2A and B shows the distribution of \({\widehat{\mathrm{SE}}}_{b}\) for Method 1 and Method 2, respectively. The vertical dashed line indicates the \(\widehat{\mathrm{SE}}\). These results suggest that both methods seem to be biased, especially in \({\widehat{\mathrm{SE}}}_{b}\left(\hat p_{3}\right)\), with a noticeable shift to the left.

Fig. 2
figure 2

The distribution of SE with bootstrapping. A, B Used Method 1 and Method 2, respectively

Our methods were compared to previously published SV detection tools using the simulation data, including Delly (Rausch et al. 2012), Breseq (Deatherage and Barrick 2014), and the method Huo et al. (2018) used in their study. In addition to detecting deletions, these tools can also provide quantity of the reads supporting each deletion mutant. Thus, information regarding identified deletions and the numbers of reads supporting these mutations can be used to roughly estimate the population proportions. Figure 3 shows a boxplot of the estimated proportions for the three subpopulations over 100 simulations using Delly, Breseq, and Huo's Method, along with Method 1 and Method 2. Both Breseq and Huo's Method primarily use split reads, thereby discarding a large proportion of non-split reads that contain useful information regarding insert size. After having the candidate deletions and supporting reads for each deletion, the insert size distribution estimated by our methods and the information of genome length were used to approximate the ratio of split reads to the entire data and eventually assess the population proportions of deletions. These will increase the variability of proportion estimation, so Breseq and Huo's method have large variances in Fig. 3. Delly uses both non-split reads and split reads to detect deletions. However, deletions are detected as paired-end outliers for the insert size distribution based on alignments of non-split reads. The default cutoff is three standard deviations from the median insert size, rendering it insensitive to detecting short deletions. Figure 3 suggests that Breseq and Delly are biased, while the method of Huo and our methods are approximately unbiased. Importantly, our proposed methods exhibit smaller variability and higher accuracy than the other approaches.

Fig. 3
figure 3

Boxplot showing the population proportion estimation for the three subpopulations using different methods. The red horizontal dashed lines indicate the true proportions corresponding to 0.45, 0.35, and 0.2

Five subpopulations

The proposed methods were also used to evaluate a more complex mixed population scenario comprising five subpopulations. The first subpopulation was the original NC 004668.1 sequence that was used in the three subpopulation simulation. The genomes of subpopulations 2 through 5 were constructed by deleting sequences of the various lengths of nucleotides from the NC 004668.1 sequence; details for the five subpopulations, including the deletion loci and length in addition to genome sizes, are shown in Supplementary Table S4. Four sample sizes were evaluated including n = 100, 1,000, 10,000, and 100,000. For each sample size, simulations were conducted 100 times. Data generation and parameter estimation were conducted as described in the previous simulation case.

The average values, variance, bias, mean squared errors (MSE), and 2.5% and 97.5% empirical quantiles for each estimated proportions of Method 1 using only non-split reads are shown in Table 2. The corresponding results for Method 2 that used both non-split and split reads are shown in Table 3. The MSE of \(\hat p_{k}\) for both methods after 100 simulations using the four different sample sizes are shown in Fig. 4A. Each plot shows the MSE for one of the five subpopulations.

Table 2 Estimated proportions of the five subpopulations based on Method 1 using only non-split reads
Table 3 Estimated proportions of the five subpopulations based on Method 2 using both non-split reads and split reads
Fig. 4
figure 4

A MSEs of estimated mutant population proportions (\(\hat{{p}_{k}}\)) for both methods using the four different sample sizes. B The MSE of \(\hat{{p}_{k}}\) for both methods using the different standard deviations of insert size

The results indicated that similar observations were present as observed in the three subpopulation case. Method 1, that uses only non-split reads that are discarded by most existing methods, can provide close estimations for mutant population proportions. Adding split reads in Method 2 can further improve the estimation. The bias in estimators for the proportions remained near zero when sample sizes were large (Tables 2, 3). A possible reason for the existence of minor bias may be due to the well-documented instance of normal mixtures can be an ill-posed model when disparities between the component distributions are small (i.e., between subpopulation one and five) (Lourens et al. 2013). Another reason for the bias could be the mapping errors when paired-end reads were mapped to the reference sequence. Furthermore, Tables 2 and 3 demonstrating the proposed method can work for small sample sizes (i.e., n = 100). Also, the MSEs for \(\hat p_{k}\) decreased with increased sample sizes. The MSEs of \(\hat p_{k}\) were always lower for Method 2 than for Method 1 when the sample size was the same (Fig. 4A). When the sample size increased, the difference of MSE between the two methods became smaller. Thus, the method using both non-split and split reads can better estimate population distributions of deletions than the method using only the non-split reads. In other words, including the split reads can further reduce the MSE and improve the overall estimation.

In addition to estimating proportions, the distribution of insert sizes was also estimated. The mean, variance, bias, mean squared error (MSE), and \(2.5\mathrm{\%}\) and \(97.5\mathrm{\%}\) empirical quantiles for estimating \(\mu \) and \(\sigma \) with Method 1 and Method 2 are shown in Supplementary Table S5. The MSEs of \(\hat{\mu }\) and \(\hat{\sigma }\) decreased as the sample size increased for both methods. The MSEs of \(\hat{\mu }\) and \(\hat{\sigma }\) for Method 2 were generally lower than those of Method 1. However, the bias of \(\hat{\mu }\) for Method 2 was slightly higher than for Method 1, suggesting that including split reads introduced some bias in the estimation of insert size. When one end of a paired-end read is split into two or more segments, the length of each read becomes shorter and the chance of mismatches increases. Nevertheless, Method 2 performed better for estimating \(\sigma \) than Method 1.

Three subpopulations in various settings

Since the length of DNA fragments for real data may be much noisier, the proposed methods were also evaluated by two additional sets of simulated reads with larger standard deviations of insert size (i.e., \(\sigma =100\) and \(\sigma =150\)). The settings of mixed populations were the same as the previous three subpopulations case (Table S1). The simulations were also conducted 100 times with a sample size \(\left(n\right)\) of 20,000. Supplementary Table S6 shows the average values, variance, bias, mean squared errors (MSE), and \(2.5\mathrm{\%}\) and \(97.5\mathrm{\%}\) empirical quantiles for each estimated proportion with distinct standard deviations of insert size. The MSE of \(\hat{{p}_{k}}\) for both methods using the different standard deviations of insert size are shown in Fig. 4B. Comparing the two methods, the MSEs of Method 2 were lower than those of Method 1 in all settings. The MSEs were all in the range of \({10}^{-4}\) to \({10}^{-5}\) even as the standard deviations increased, indicating that the proposed methods have less bias and smaller variances. When the effects of increased standard deviations on the MSEs of the three subpopulations were closely looked at, the slight difference in MSEs of \(\hat{{p}_{3}}\) is perhaps due to random sampling variance. However, the MSEs of \(\hat{{p}_{1}}\) and \(\hat{{p}_{2}}\) increased with increased standard deviations. A reason for the increased errors could be the trade-off between Subpopulation 1 and Subpopulation 2. The difference in mapped distance between Subpopulation 1 and Subpopulation 2 should be approximate to the deletion length (200). Because many data will fall outside of one standard deviation of the mean, the difference in lengths of insert size will become larger when the standard deviation increase (i.e., \(\sigma =100\) and \(\sigma =150\)), resulting in smaller disparities between the two distributions.

Another possible scenario that causes smaller disparities between the mixed distributions is when the length of deletions reduces. To evaluate that, we applied our method to two more mixed populations with shorter deletions based on the three subpopulation case. Details for the settings, including the deletion loci, length, and true proportions, are shown in Supplementary Table S7. The corresponding results for each estimated proportion with different deletion lengths are shown in the Supplementary Table S8.

The methods can still approximately estimate proportions of smaller deletions. However, the MSEs for estimated proportions increased with the length of deletion decreased (Supplementary Table S8). When the deletion length is shorter (i.e., 30 bps and 100 bps), the difference between insert size distributions of two populations becomes smaller, introducing errors to algorithms based on the normal mixture model. In addition to the reduced disparities between the mixed distributions, mapping errors could be another reason for the increased MSEs. Between the two methods, the advantages of Method 2 over Method 1 became smaller as the deletion size decreased. Because all reads were equally likely to be sampled from subpopulations, the smaller deletion size reduced the ratio of sampled number of split reads to the number of reads in entire data. This, in turn, results in a reduced improvement in estimation accuracy through involving split reads. In conclusion, the estimation accuracy decreases when the standard deviations of insert size increase and the size of deletion decreases. However, our methods can still approximately estimate proportions of deletions in these kinds of less ideal situations.

Case study: the CRISPR-Cas System of Enterococcus faecalis

Deep sequencing data from an experiment of E. faecalis (Huo et al. 2018) were used to evaluate the performance of the proposed methods. E. faecalis is a Gram-positive bacterium that normally inhabits the gastrointestinal tracts of humans and other animals (Lebreton et al. 2014). They can cause life-threatening infections in hospitalized patients (Arias and Murray 2012; Higuita and Huycke 2014). Antibiotic resistance genes have emerged among E. faecalis strains. These genes can further spread to other cells by HGT through the pheromone-responsive plasmids (PRPs) that are a type of narrow host-range conjugative plasmids that contain and disseminate the antibiotic resistance genes. However, donor PRPs have to overcome the genome defense systems in the recipient cells to achieve successful gene transfers. CRISPR-Cas systems are adaptive immune systems that contain spacers that are derived from past invasions, and which can target PRPs, thereby greatly reducing their dissemination. Thus, the immune systems can significantly impact the spread of antibiotic resistance genes among E. faecalis populations.

In the experiment of this study, the donor was a strain containing a PRP harboring erythromycin resistance via ermB (pAM714), while the recipient was E. faecalis T11RF that encodes a Type II CRISPR-Cas system consisting of 21 unique spacers including space 6 that exhibits 100% sequence identity to pAM714 (Fig. 5A) (Price et al. 2016, 2019). E. faecalis can transiently maintain both active CRISPR-Cas systems and its target PRP (Hullahalli et al. 2017). The presence of antibiotics that exert a selection advantage for cells containing the PRP, toxic CRISPR spacers, like spacer 6 in the data, may be lost over time. We applied the proposed methods to evaluate the dynamics of population structures identified by deletion mutants in spacers under antibiotic selection. In particular, we estimated the proportion of each deletion mutant among the total population and the variation of population proportions over time in the presence of antibiotic selection effects.

Fig. 5
figure 5

A T11 CRISPR-Cas System. B, C The distribution of insert size before and after the log transformation for E. faecalis data. D The distribution of \({L}_{i}\) for E. faecalis data. E The read is non-split and covers the deletion junction, indicating that \({L}_{i}\le {d}{L}_{k}-r\) and \({R}_{i}>{d}{R}_{k}\). F The left-end read is split by the deletion, indicating that \({L}_{i}\in \) \(\left({d}{L}_{k}-r,{d}{L}_{k}\right)\). The insert size is \({R}_{i}-{L}_{i}+1-{d}_{k}+r\). G The right-end is split by the deletion, indicating that \({R}_{i}\in \left({d}{L}_{k}-r,{d}{L}_{k}\right)\)

Deep sequencing data of CRISPR3 amplicons were obtained from serial passage of two distinct T11RF (pAM714) transconjugants (referred to as WT3 and WT5) from the NCBI Sequence Read Archive under BioProject ID: PRJNA418345. The sequencing data represented amplicons at Day 1 and Day 14 from daily serial passage of cultures in media with erythromycin. The sample sizes for Day 1 and Day 14 of WT3 are 4,334,878 and 4,176,691 and the sample sizes for Day 1 and Day 14 of WT5 are 4,490,190, and 3,715,094, respectively. The deep sequencing data comprised 2 × 150 bp paired-end sequence reads based on the 1,763 bp CRISPR3 amplicon reference locus (NZ GG688647.1, positions 646,834–648,596). The CRISPR3-Cas amplicon reference sequence contains 22 repeats and 21 unique spacer sequences (Fig. 5A). Each spacer comprises 30 nucleotides, and each repeat comprises 36 nucleotides. Further details of the data and experiment can be found in Huo et al. (2018).

The Illumina amplicon sequence reads were filtered using BBDuk (https://jgi.doe.gov/data-and-tools/bbtools/) to remove adapters. Reads with read length shorter than 75 bp were discarded after trimming based on Q20 scores. Around 5% of the paired-end reads were removed from each sample during the quality filtering process. The remaining high-quality reads were mapped to the 1763 bp amplicon reference using the BWA aligner. Matches longer than 66 bp were kept, because they could cover at least one repeat-spacer junction. A total of 0.19% and 0.33% of sequences for WT3 at Day 1 and Day 14, respectively, were unmapped, while the unmapping rates were 0.22% for WT5 at both days. After removing the unmapped reads, the proposed algorithm was evaluated with the remaining mapped reads.

Huo et al. (2018) identified two deletion mutants in WT3, the deletion of spacers from 5 to 7 (S5-S7), and the deletion of spacer 6 (S6). Multiple deletions were also observed in WT5 (i.e., S5–S8, S3–S16, S6, S4–S9, S1–S14, and S6–S18). To support these putative deletions, Breseq (Deatherage and Barrick 2014) was used with both WT3 and WT5 genomic data. The top-ranked deletions were the same as those reported by Huo et al. (2018). Thus, these deletions were used to construct candidate subpopulations for our analyses. After aligning reads to the amplicon reference sequence, a preliminary evaluation was conducted for the distribution of the insert sizes using the Day 1 data for WT3. All reads that may encompass a deletion were removed. The distribution of insert size exhibited a long right tail, so a natural log transformation was applied for the insert size. Histograms of insert size before and after the transformation are shown in Fig. 5B and C. \({L}_{i}\) is the start locus of the left-end read of the ith read mapped to the reference genome. The distribution of \({L}_{i}\) was also evaluated using the same data. A histogram of \({L}_{i}\) is shown in Fig. 5D, indicating that the distribution of \({L}_{i}\) is nearly uniform. The proposed methods using both split reads and non-split reads were then employed to estimate the proportions of each deletion mutation. For brevity, only the results using both split reads and non-split reads were shown. To obtain estimates of biases and standard errors, estimations were repeated with 100 bootstrap samples. Summary statistics for the estimators (\(\widehat{p}\)) of WT3 and WT5 are shown in Tables 4 and 5, respectively.

Table 4 Estimated proportions of deletions and parameters for insert size in WT3
Table 5 Estimated proportions of deletions and parameters of insert size for WT5

On Day 1, about 72.50% of the WT3 population did not have a deletion, but by Day 14, only 4.06% of the population did not carry a deletion. Likewise, about 45.63% of WT5 population lacked deletions on Day 1, while only 8.76% of the population remained wild type by Day 14. These results demonstrate a rapid and diverse change in the population distribution favoring the loss of spacer 6 over time due to the antibiotic selection. Although spacer 6 targets pAM714, we identified higher proportions of S5-S7 in WT3, and higher proportions of S5-S8 and S4-S9 in WT5, than S6 alone. Thus, the loss of S6 was often coupled with the loss of flanking spacers (Tables 4, 5). In addition, large differences between WT3 and WT5 may indicate numerous pathways leading to various deletion mutants among distinct T11RF (pAM714) transconjugants. The estimated proportions and rank ordering of mutant populations differed from that reported by Huo et al. (2018), because their estimation was only based on the number of supporting split reads and did not use the insert size information for non-split reads. Additionally, the authors did not consider differences between sample proportions and population proportions.

To further evaluate the robustness of the proposed methods, we introduced a deletion (S13–S15) that did not encompass the targeted spacer 6 to WT3 as a negative control. The estimated proportion of the negative control (S13–S15) was less than 1% (Supplementary Table S9). Additional negative controls, including those detected on Day 14 of WT5, were included for the WT3 populations. Supplementary Table S10 indicates that the estimated proportions of most negative controls were less than 1% on both Day 1 and Day 14, with the only exception of S4–S9 on Day 1 (1.25%). A small proportion of WT3 cells, if any, carried a deletion of S4–S9 on Day 1. Moreover, the estimated µ and σ of insert size were almost identical with or without the negative controls. A deletion (S5–S10) was added for WT5 that encompassed spacer 6 and a deletion (S13–S15) that did not contain spacer 6 were also included as negative controls. In Supplementary Table S10, the proportions of the two negative controls were less than 1%.

Discussion

Antibiotic resistance is one of the biggest threats to global health and can affect all parts of the world. Recently, more patients have been hospitalized because of the COVID-19 pandemic that creates a perfect storm for antibiotic-resistant infections (Centers for Disease Control and Prevention 2021). Although the causes and spread of antibiotic resistance mechanisms are more complicated, that mutation is associated with resistance to antibiotics has been ensured by researchers. The deletion mutations of the CRISPR-Cas system can compromise or inactivate the defense of antibiotic resistance. The loss of a functional CRISPR-Cas system results in a significant increase in antibiotic resistance acquisition (Hatoum and Marraf-fini 2014; Price et al. 2016). Specifically, CRISPR spacer deletion mutants emerged overtime under continuous antibiotic selection for the resistance plasmids. Consequently, it is necessary to study population proportions of deletion mutation and dynamics of subpopulations when an antibiotic selection has occurred.

In the study, we present a statistical method for estimating the proportions of bacterial populations exhibiting deletion mutations using NGS paired-end data. Unlike other SV studies that have focused on detecting deletion mutations, this study estimated the distributional changes over time of populations exhibiting various deletion mutations. Most SV detection tools are either based on insert sizes, which use only non-split reads, or otherwise using only split-mapping with only split reads. The method described here uses a mixture model with both insert size and split-mapping information to improve proportion estimations. Most existing algorithms also can only be used for human genome data. Here, an algorithm that is suitable for short genomes is presented. Since the implementation of EM algorithms does not require a large number of computational resources, our algorithm can also efficiently run on a typical desktop computer.

Some challenges that can influence the performance of our algorithm remain. The first is the NGS read quality. Biases are associated with NGS methods, including bias inherent to NGS library preparation and sequencing bias from the NGS technology. The biases associated with NGS libraries can result in compromised NGS data quality (Dijk et al. 2014). For example, Illumina has been shown to exhibit extreme base compositional biases in the sequencing of GC-poor and GC-rich sequences (Chen et al. 2013). A second challenge arises from the mapping sensitivity. Each genome has unique characteristics, and one particularly problematic characteristic is that bacterial genomes consist of a diversity of repetitive sequences. Genomic repeats complicate the alignment process and impact mapping sensitivity (Kosalai et al. 2017). For example, we observed some mismatches in our simulation studies that are likely due to the BWA aligner exhibiting a higher sensitivity toward longer read (> 100 bp) (Treangen and Salzberg 2011). A third potential challenge is related to a proper size selection step. Highly precise sizing of paired-end sequences provides the most accurate information about distances between read pairs (Science 2016). Finally, the insert size information of paired-end reads was used in our method. However, only Illumina NGS sequencers support paired-end reads.

A limitation may affect the performance of the proposed approaches. If the candidate deletions are not detected completely, our algorithm will overestimate the proportions of subpopulations with similar genome structures, resulting in estimation biases. For future work, to reduce the estimation biases, the SV detection step could be implemented by integrating multiple SV calling programs.

There are a few potential directions of our study in the future. Currently, we only involve the subpopulation of deletion mutations in the algorithms. The algorithm could also be extended to consider more types of SVs, e.g., insertions, inversions, and translocations. In addition, we could integrate the small programs that we used separately and implement our methods as a finished R package. Furthermore, the deep sequencing data of the E. faecalis experiment only contained two stages: the initial (Day 1) and the final (Day 14) stage. We could extend our model to a dynamical model involving time-series data. Another future improvement for our model would be to implement population genetic parameters. For example, population structure is usually inferred by differences in allele frequencies of subpopulations, dramatically different from the plain descriptive definition in this study. In addition to the population structure related to antibiotic resistance, the described method could be useful as a part of population genomics of structural variation studies where exact estimations are needed, such as the study of evolutionary and ecological genomics.

Materials and methods

Method 1: using only non-split reads

Since only non-split reads are used, the conditional distribution \(\mathrm{Pr}\left({L}_{i},{R}_{i}\mid {S}_{i}=\right.\) \(NS)\) was considered. Let \(\Theta =\left(\theta ,\mu ,\sigma \right)\) be all of the parameters that we want to estimate. The log (conditional) likelihood function is then (details can be found in Supplementary Material A.2)

$$\begin{array}{ll}\ell \left(\Theta \mid {S}_{i}=NS\right)& =\sum_{i=1}^{n} \mathrm{log}\sum_{k=1}^{K} \mathrm{Pr}\left({L}_{i},{R}_{i},{z}_{i}=k\mid {S}_{i}={NS};\mu ,\sigma ,\theta \right)\\ & =\sum_{i=1}^{n} \mathrm{log}\sum_{k=1}^{K} \mathrm{Pr}\left({L}_{i}\mid {z}_{i}=k,{S}_{i}={NS};\mu \right)\mathrm{Pr}\left({R}_{i}\mid {L}_{i},{z}_{i}=k,{S}_{i}=NS;\mu ,\sigma \right){\theta }_{k}^{*},\end{array}$$
(1)

where \({\theta }_{k}^{*}=\mathrm{Pr}\left({z}_{i}=k\mid {S}_{i}={NS};\theta \right)\) is the probability that the \(i\)th read is sampled from subpopulation \(k\) among all non-split reads. The insert size of the \(i\)th DNA fragment given \({z}_{i}=k\), denoted by \({X}_{ik}\), is assumed to follow a normal distribution with mean \(\mu \) and variance \({\sigma }^{2}\), if the size selection is strictly performed. The conditional distribution of \(\mathrm{Pr}\left({L}_{i}\mid {z}_{i}=k,{S}_{i}={NS};\mu \right)\) approximately follows a uniform distribution that is related to the length of the \(k\)-th genome, \({l}_{k}\), and the mean length of insert sizes, \(\mu \).

Given \({z}_{i}=1\), which implies that the \(i\)th paired-end read is sampled from the wild-type genome without deletion, \(\mathrm{Pr}\left({L}_{i}\mid {z}_{i}=1,{S}_{i}={NS};\mu \right)=\frac{1}{{l}_{1}-\mu }\), and \(\mathrm{Pr}\left({R}_{i}\mid {L}_{i},{z}_{i}=1,{S}_{i}={NS};\mu ,\sigma \right)=\mathcal{N}\left({X}_{i1};\mu ,{\sigma }^{2}\right)\), where \(\mathcal{N}\left(.;.,.\right)\) is the density function of a normal distribution and the insert size \({X}_{i1}={R}_{i}-{L}_{i}+\) \(1+r\). Thus, the formula 1 becomes

$$\mathrm{Pr}\left({L}_{i},{R}_{i},{z}_{i}=k\mid {S}_{i}=NS;\Theta \right)=\frac{{\theta }_{1}^{*}}{{l}_{1}-\mu }\mathcal{N}\left({X}_{i1};\mu ,{\sigma }^{2}\right).$$

When \({z}_{i}=k\), for \(k=\mathrm{2,3},\dots ,K\), the \(i\) th paired-end read is sampled from a subpopulation with a deletion. Since \({S}_{i}=NS,{L}_{i}\notin \left(d{L}_{k}-r,d{R}_{k}\right]\) and \({R}_{i}\notin \left(d{L}_{k}-r,d{R}_{k}\right].\) We have \(\mathrm{Pr}\left({L}_{i}\mid {z}_{i}=k,{S}_{i}={NS};\mu \right)=\mathcal{U}\left({L}_{i};0,\left({l}_{k}-\right.\right.\left.\mu )\left(1-{a}_{k}-{b}_{k}\right)\right)\) and \(\mathrm{Pr}\left({R}_{i}\mid {L}_{i},{z}_{i}=k,{S}_{i}=\mathrm{NS};\mu ,\sigma \right)=\mathcal{N}\left({X}_{ik};\mu ,{\sigma }^{2}\right)\). Therefore

$$\mathrm{Pr}\left({L}_{i},{R}_{i},{z}_{i}=k\mid {S}_{i}={NS};\Theta \right)=\frac{{\theta }_{k}^{*}}{\left({l}_{k}-\mu \right)\left(1-{a}_{k}-{b}_{k}\right)}\mathcal{N}\left({X}_{ik};\mu ,{\sigma }^{2}\right),$$

where \({a}_{k}=\mathrm{Pr}\left({S}_{i}=LS\mid {z}_{i}=k\right)\) and \({b}_{k}=\mathrm{Pr}\left({S}_{i}=RS\mid {z}_{i}=k\right)\). Detailed formulae for \({a}_{k}\) and \({b}_{k}\) are shown in Supplementary Material A.3. Then, \(\mathrm{Pr}\left({S}_{i}=NS\mid {z}_{i}=k;\theta \right)=1-\left({a}_{k}+{b}_{k}\right)\). There are two possible values for \({X}_{ik}\). If the \(i\) th paired-end read covers the \(k\) th deletion junction, it is considered a discordant read (Fig. 5E), and the insert size equals the mapped distance minus the length of the deletion \(k\), i.e., \({X}_{ik}={R}_{i}-{L}_{i}+1-{d}_{k}+r\). Otherwise, the read is considered concordant read and \({X}_{ik}={R}_{i}-{L}_{i}+1+r\).

Because \({\theta }_{k}^{*}\) is the conditional probability given reads are non-split, we can then obtain \({\theta }_{k}\) as follows:

$${\theta }_{k}=\frac{\frac{{\theta }_{k}^{*}}{1-\left({a}_{k}+{b}_{k}\right)}}{\sum_{k} \frac{{\theta }_{k}^{*}}{1-\left({a}_{k}+{b}_{k}\right)}}.$$

Method 2: using both non-split reads and split reads

For method 2, both non-split and split reads are used. The log-likelihood function can be written as (Supplementary Material A.4)

$$\begin{array}{ll}\ell \left(\Theta \right)& =\sum_{i=1}^{n} \mathrm{log}\sum_{k=1}^{K} \mathrm{Pr}\left({L}_{i},{R}_{i},{S}_{i},{z}_{i}=k;\Theta \right)\\ & =\sum_{i=1}^{n} \mathrm{log}\sum_{k=1}^{K} \mathrm{Pr}\left({L}_{i}\mid {S}_{i},{z}_{i}=k;\mu \right)\mathrm{Pr}\left({R}_{i}\mid {L}_{i},{S}_{i},{z}_{i}=k;\mu ,\sigma \right)\mathrm{Pr}\left({S}_{i}\mid {z}_{i}=k\right)\mathrm{Pr}\left({z}_{i}=k;\theta \right).\end{array}$$

Here, \(\mathrm{Pr}\left({z}_{i}=k;\theta \right)={\theta }_{k}\), and \(\mathrm{Pr}\left({R}_{i}\mid {L}_{i},{S}_{i},{z}_{i}=k;\mu ,\sigma \right)\) can be computed from \({X}_{ik}\) that is the insert size of the \(i\) th DNA fragment given that it is sampled from the subpopulation \(k\). \({S}_{i}\) can be left-split \(\left(LS\right)\), right-split \(\left(RS\right)\), or non-split \(\left(NS\right)\). Next \(\mathrm{Pr}\left({L}_{i},{R}_{i},{S}_{i},{z}_{i}=k;\Theta \right)\) is derived for the three cases.

Case 1: the ith paired-end read is left-split \(\left({{\varvec{S}}}_{{\varvec{i}}}={\varvec{L}}{\varvec{S}}\right)\)

The \(i\) th paired-end read can be sampled from subpopulation \(k\), where \(k\in \) \(\left\{\mathrm{2,3},\dots ,K\right\}\). We only need to consider \(k\), such that \({L}_{i}\in \left({d}{L}_{k}-r,{d}{L}_{k}\right)\) (Fig. 5F). Since \(\mathrm{Pr}\left({S}_{i}=LS\mid {z}_{i}=k\right)={a}_{k},\mathrm{Pr}\left({L}_{i}\mid {S}_{i}=LS,{z}_{i}=k;\mu \right)=\frac{1}{{a}_{k}\left({l}_{k}-\mu \right)}\) and \(\mathrm{Pr}\left({R}_{i}\mid {L}_{i},{S}_{i}={LS},{z}_{i}=k;\mu ,\sigma \right)=\mathcal{N}\left({X}_{ik};\mu ,{\sigma }^{2}\right)\), we have

$$\begin{array}{ll}\mathrm{Pr}\left({L}_{i},{R}_{i},{S}_{i}=LS,{z}_{i}=k;\Theta \right)& ={\theta }_{k}{a}_{k}\frac{1}{{a}_{k}\left({l}_{k}-\mu \right)}\mathcal{N}\left({X}_{ik};\mu ,{\sigma }^{2}\right)\\ & =\frac{{\theta }_{k}}{{l}_{k}-\mu }\mathcal{N}\left({X}_{ik};\mu ,{\sigma }^{2}\right).\end{array}$$

Here, the left-end is split by the deletion and covers the junctions, so the insert size equals the mapped distance minus the length of the deletion \(k\), that is

$${X}_{ik}={R}_{i}-{L}_{i}+1-{d}_{k}+r.$$

Case 2: the ith paired-end read is right-split \(\left({{\varvec{S}}}_{{\varvec{i}}}={\varvec{R}}{\varvec{S}}\right)\)

When the \(i\) th paired-end read is right-split \(\left({S}_{i}={RS}\right)\), we only need to consider subpopulation \(k\) such that \({R}_{i}\in \left({d}{L}_{k}-r,{d}{L}_{k}\right)\) (Fig. 5G). Since \(\mathrm{Pr}\left({S}_{i}=\right.\) \(\left.{RS}\mid {z}_{i}=k\right)={b}_{k}\) and \(\mathrm{Pr}\left({L}_{i}\mid {S}_{i}={RS},{z}_{i}=k;\mu \right)=\frac{1}{{b}_{k}\left({l}_{k}-\mu \right)}\), we have

$$\begin{array}{cc}\mathrm{Pr}\left({L}_{i},{R}_{i},{S}_{i}={RS},{z}_{i}=k;\Theta \right)& ={\theta }_{k}{b}_{k}\frac{1}{{b}_{k}\left({l}_{k}-\mu \right)}\mathcal{N}\left({X}_{ik};\mu ,{\sigma }^{2}\right)\\ & =\frac{{\theta }_{k}}{{l}_{k}-\mu }\mathcal{N}\left({X}_{ik};\mu ,{\sigma }^{2}\right).\end{array}$$

As with the case of left-split case, \({X}_{ik}={R}_{i}-{L}_{i}+1-{d}_{k}+r\).

Case 3: the ith paired-end read is non-split \(\left({{\varvec{S}}}_{{\varvec{i}}}={\varvec{N}}{\varvec{S}}\right)\)

When \({z}_{i}=1,\mathrm{Pr}\left({S}_{i}=NS\mid {z}_{i}=1\right)=1\). As with method 1, we have \(\mathrm{Pr}\left({L}_{i}\mid {S}_{i}=\right.\) \(\left.NS,{z}_{i}=1;\mu \right)=\mathcal{U}\left({L}_{i};0,{l}_{1}-\mu \right)\) and \(\mathrm{Pr}\left({R}_{i}\mid {L}_{i},{S}_{i}=NS,{z}_{i}=1;\mu ,\sigma \right)=\) \(\mathcal{N}\left({X}_{i1};\mu ,{\sigma }^{2}\right)\). Thus

$$\mathrm{Pr}\left({L}_{i},{R}_{i},{S}_{i}=NS,{z}_{i}=1;\Theta \right)={\theta }_{1}\frac{1}{{l}_{1}-\mu }\mathcal{N}\left({X}_{i1};\mu ,{\sigma }^{2}\right),$$

where the insert size is \({X}_{ik}={R}_{i}-{L}_{i}+1+r\), because it is a concordant read. When \({z}_{i}=k,k=\mathrm{2,3},\dots ,K\), we only need to consider \(k\), such that \({L}_{i}\notin \left({d}{L}_{k}-r,{d}{R}_{k}\right]\) and \({R}_{i}\notin \left({d}{L}_{k}-r,{d}{R}_{k}\right]\). Thus, \(\mathrm{Pr}\left({S}_{i}={NS}\mid {z}_{i}=k\right)=\) \(1-{a}_{k}-{b}_{k}\). The \({L}_{i}\) only can fall into the region of the sequence that cannot cause split reads, that is, \(\mathrm{Pr}\left({L}_{i}\mid {S}_{i}={NS},{z}_{i}=k;\mu \right)=\frac{1}{\left({l}_{k}-\mu \right)\left(1-{a}_{k}-{b}_{k}\right)}\). Thus

$$\begin{array}{cc}\mathrm{Pr}\left({L}_{i},{R}_{i},{S}_{i}={NS},{z}_{i}=k;\Theta \right)& ={\theta }_{k}\left(1-{a}_{k}-{b}_{k}\right)\frac{1}{\left({l}_{k}-\mu \right)\left(1-{a}_{k}-{b}_{k}\right)}\mathcal{N}\left({X}_{ik};\mu ,{\sigma }^{2}\right)\\ & =\frac{{\theta }_{k}}{{l}_{k}-u}\mathcal{N}\left({X}_{ik};\mu ,{\sigma }^{2}\right).\end{array}$$

There are two possible values of \({X}_{ik}\) depending on whether the read is concordant or discordant. If the read is discordant, i.e., \({L}_{i}\le d{L}_{k}-r\) and \({R}_{i}>d{R}_{k}\), the insert size is \({X}_{ik}={R}_{i}-{L}_{i}+1-{d}_{k}+r\). Otherwise, \({X}_{ik}={R}_{i}-{L}_{i}+1+r\).

The Expectation–Maximization (EM) algorithm

To derive estimators of \(\Theta \), the maximum-likelihood estimation (MLE) is used. However, it could be difficult to explicitly find the maximum-likelihood estimators of the parameters \(\Theta \), because \({z}_{i}\)'s are the unobserved latent random variables. If the \({z}_{i}\)'s were observed, finding the maximum-likelihood estimators would be straightforward. Consequently, the Expectation–Maximization (EM) algorithm (Dempster et al. 1977) is an efficient method for maximum-likelihood estimation under this scenario.

The EM algorithm is an iterative procedure to find the MLE of a parameter when data are incomplete, have missing values, or have hidden latent variables. The EM algorithm is an important statistical model that has been used in many fields since its introduction including in computer science and computational biology. The basic steps of an EM algorithm are: 1. Expectation (E) step: Creation of a function to obtain expected values using an initial estimate for parameters; 2. maximization (M) step: Computation of the MLE for the parameters using the estimated data from E-step; and 3. the \(\mathrm{E}\) and \(\mathrm{M steps}\) are repeated until convergence.

In this model, the algorithm tries to infer the values of \({z}_{i}\)'s in the E-step. In the M-step, the parameters of the model are updated based on outcomes in the E-step. For each \(i\), let \({Q}_{i}\left({z}_{i}\right)\) be the distribution over \({z}_{i}\)'s, such that \({Q}_{i}\left({z}_{i}\right)>0\) and \(\sum_{k}{Q}_{i} \left({z}_{i}=k\right)=1\). In Supplementary Material B.1, the \(\mathrm{log}-\mathrm{likelihood}\) function is shown as

$${\ell}\left(\Theta \right)\ge \sum_{i=1}^{n} \sum_{k=1}^{K} {Q}_{i}\left({z}_{i}=k\right)\mathrm{log}\frac{\mathrm{Pr}\left({L}_{i},{R}_{i},{S}_{i},{z}_{i}=k;\Theta \right)}{{Q}_{i}\left({z}_{i}=k\right)}.$$

Here, \({Q}_{i}\left({z}_{i}=k\right)=\mathrm{Pr}\left({z}_{i}=k\mid {L}_{i},{R}_{i},{S}_{i};\Theta \right)\) represents the posterior distribution for \({z}_{i}\)'s given \({L}_{i}, {R}_{i}{,S}_{i}\) and the parameter set, \(\Theta \). In general, the iterative EM procedures of Method 1 that uses non-split reads only is a subset of Method 2 that uses split read and non-split reads together. Therefore, Method 2 is used as an example. The steps for the EM algorithm iterative procedure are as follows.

E-Step: calculate posterior distribution

For each \(i\), we compute \({Q}_{i}\left({z}_{i}=k\right)\), denoted by \({w}_{ik}\), for all mixture components \(k=\mathrm{1,2},\dots ,K\). This yields an \(n\times K\) matrix of \({w}_{ik}\), where each of the rows sum to 1 because of \(\sum_{k}{w}_{ik}=1\)

$$\begin{array}{cc}{w}_{ik}& =\mathrm{Pr}\left({z}_{i}=k\mid {L}_{i},{R}_{i},{S}_{i};\theta ,\mu ,\sigma \right)\\ & =\frac{\mathrm{Pr}\left({L}_{i},{R}_{i},{S}_{i},{z}_{i}=k;\theta ,\mu ,\sigma \right)}{\sum_{k=1}^{K} \mathrm{Pr}\left({L}_{i},{R}_{i},{S}_{i},{z}_{i}=k;\theta ,\mu ,\sigma \right)}.\end{array}$$

M-step: update \({\varvec{\Theta}}\) to reach the lower bound

A detailed derivation of M-steps is shown in Supplementary Material B.2. When \(\mathrm{Pr}\left({L}_{i},{R}_{i},{S}_{i},{z}_{i}=k;\mu ,\sigma \right)=0,{w}_{ik}=0\), the situation is not considered in the M-step. When \({w}_{ik}>0\)

$$\Theta =\mathrm{argmax}\sum_{i,k,{w}_{ik}>0} {w}_{ik}\left(\mathrm{log}{\theta }_{k}+\mathrm{log}\frac{1}{{l}_{k}-\mu }+\mathrm{log}\frac{1}{\sigma \sqrt{2\pi }}{e}^{-\frac{{\left({X}_{ik}-\mu \right)}^{2}}{2{\sigma }^{2}}}-\mathrm{log}{w}_{ik}\right).$$

Three groups of parameters \(\left({\theta }_{k},\mu ,\sigma \right)\) can be estimated in the mixture model. The formulas for each of them are as follows:

$$\begin{array}{ll}{\theta }_{k} =\frac{\sum_{i=1}^{n} {w}_{ik}}{n}\\ \mu \approx \frac{\sum_{i=1}^{n} \sum_{k=1}^{K} {w}_{ik}{X}_{ik}}{n}\\ \sigma =\sqrt{\frac{\sum_{i=1}^{n} \sum_{k=1}^{K} {w}_{ik}{\left({X}_{ik}-\mu \right)}^{2}}{n}}.\end{array}$$