Child Care and Early Education Research Connections
Pre-experimental designs.
Pre-experiments are the simplest form of research design. In a pre-experiment either a single group or multiple groups are observed subsequent to some agent or treatment presumed to cause change.
Types of Pre-Experimental Design
One-shot case study design, one-group pretest-posttest design, static-group comparison.
A single group is studied at a single point in time after some treatment that is presumed to have caused change. The carefully studied single instance is compared to general expectations of what the case would have looked like had the treatment not occurred and to other events casually observed. No control or comparison group is employed.
A single case is observed at two time points, one before the treatment and one after the treatment. Changes in the outcome of interest are presumed to be the result of the intervention or treatment. No control or comparison group is employed.
A group that has experienced some treatment is compared with one that has not. Observed differences between the two groups are assumed to be a result of the treatment.
Validity of Results
An important drawback of pre-experimental designs is that they are subject to numerous threats to their validity . Consequently, it is often difficult or impossible to dismiss rival hypotheses or explanations. Therefore, researchers must exercise extreme caution in interpreting and generalizing the results from pre-experimental studies.
One reason that it is often difficult to assess the validity of studies that employ a pre-experimental design is that they often do not include any control or comparison group. Without something to compare it to, it is difficult to assess the significance of an observed change in the case. The change could be the result of historical changes unrelated to the treatment, the maturation of the subject, or an artifact of the testing.
Even when pre-experimental designs identify a comparison group, it is still difficult to dismiss rival hypotheses for the observed change. This is because there is no formal way to determine whether the two groups would have been the same if it had not been for the treatment. If the treatment group and the comparison group differ after the treatment, this might be a reflection of differences in the initial recruitment to the groups or differential mortality in the experiment.
Advantages and Disadvantages
As exploratory approaches, pre-experiments can be a cost-effective way to discern whether a potential explanation is worthy of further investigation.
Disadvantages
Pre-experiments offer few advantages since it is often difficult or impossible to rule out alternative explanations. The nearly insurmountable threats to their validity are clearly the most important disadvantage of pre-experimental research designs.
One-Group Posttest Only Design: An Introduction
The one-group posttest-only design (a.k.a. one-shot case study ) is a type of quasi-experiment in which the outcome of interest is measured only once after exposing a non-random group of participants to a certain intervention.
The objective is to evaluate the effect of that intervention which can be:
- A training program
- A policy change
- A medical treatment, etc.
As in other quasi-experiments, the group of participants who receive the intervention is selected in a non-random way (for example according to their choosing or that of the researcher).
The one-group posttest-only design is especially characterized by having:
- No control group
- No measurements before the intervention
It is the simplest and weakest of the quasi-experimental designs in terms of level of evidence as the measured outcome cannot be compared to a measurement before the intervention nor to a control group.
So the outcome will be compared to what we assume will happen if the intervention was not implemented. This is generally based on expert knowledge and speculation.
Next we will discuss cases where this design can be useful and its limitations in the study of a causal relationship between the intervention and the outcome.
Advantages and Limitations of the one-group posttest-only design
Advantages of the one-group posttest-only design, 1. advantages related to the non-random selection of participants:.
- Ethical considerations: Random selection of participants is considered unethical when the intervention is believed to be harmful (for example exposing people to smoking or dangerous chemicals) or on the contrary when it is believed to be so beneficial that it would be malevolent not to offer it to all participants (for example a groundbreaking treatment or medical operation).
- Difficulty to adequately randomize subjects and locations: In some cases where the intervention acts on a group of people at a given location, it becomes infeasible to adequately randomize subjects (ex. an intervention that reduces pollution in a given area).
2. Advantages related to the simplicity of this design:
- Feasible with fewer resources than most designs: The one-group posttest-only design is especially useful when the intervention must be quickly introduced and we do not have enough time to take pre-intervention measurements. Other designs may also require a larger sample size or a higher cost to account for the follow-up of a control group.
- No temporality issue: Since the outcome is measured after the intervention, we can be certain that it occurred after it, which is important for inferring a causal relationship between the two.
Limitations of the one-group posttest-only design
1. selection bias:.
Because participants were not chosen at random, it is certainly possible that those who volunteered are not representative of the population of interest on which we intend to draw our conclusions.
2. Limitation due to maturation:
Because we don’t have a control group nor a pre-intervention measurement of the variable of interest, the post-intervention measurement will be compared to what we believe or assume would happen was there no intervention at all.
The problem is when the outcome of interest has a natural fluctuation pattern (maturation effect) that we don’t know about.
So since certain factors are essentially hard to predict and since 1 measurement is certainly not enough to understand the natural pattern of an outcome, therefore with the one-group posttest-only design, we can hardly infer any causal relationship between intervention and outcome.
3. Limitation due to history:
The idea here is that we may have a historical event, which may also influence the outcome, occurring at the same time as the intervention.
The problem is that this event can now be an alternative explanation of the observed outcome. The only way out of this is if the effect of this event on the outcome is well-known and documented in order to account for it in our data analysis.
This is why most of the time we prefer other designs that include a control group (made of people who were exposed to the historical event but not to the intervention) as it provides us with a reference to compare to.
Example of a study that used the one-group posttest-only design
In 2018, Tsai et al. designed an exercise program for older adults based on traditional Chinese medicine ideas, and wanted to test its feasibility, safety and helpfulness.
So they conducted a one-group posttest-only study as a pilot test with 31 older adult volunteers. Then they evaluated these participants (using open-ended questions) after receiving the intervention (the exercise program).
The study concluded that the program was safe, helpful and suitable for older adults.
What can we learn from this example?
1. work within the design limitations:.
Notice that the outcome measured was the feasibility of the program and not its health effects on older adults.
The purpose of the study was to design an exercise program based on the participants’ feedback. So a pilot one-group posttest-only study was enough to do so.
For studying the health effects of this program on older adults a randomized controlled trial will certainly be necessary.
2. Be careful with generalization when working with non-randomly selected participants:
For instance, participants who volunteered to be in this study were all physically active older adults who exercise regularly.
Therefore, the study results may not generalize to all the elderly population.
- Shadish WR, Cook TD, Campbell DT. Experimental and Quasi-Experimental Designs for Generalized Causal Inference . 2nd Edition. Cengage Learning; 2001.
- Campbell DT, Stanley J. Experimental and Quasi-Experimental Designs for Research . 1st Edition. Cengage Learning; 1963.
Further reading
- Understand Quasi-Experimental Design Through an Example
- Experimental vs Quasi-Experimental Design
- Static-Group Comparison Design
- One-Group Pretest-Posttest Design
An official website of the United States government
Official websites use .gov A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
- Publications
- Account settings
- Advanced Search
- Journal List
Time series experimental design under one-shot sampling: The importance of condition diversity
Xiaohan Kang
Bruce hajek, yoshie hanzawa.
- Author information
- Article notes
- Copyright and License information
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
Received 2019 Jun 14; Accepted 2019 Oct 16; Collection date 2019.
This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Many biological data sets are prepared using one-shot sampling, in which each individual organism is sampled at most once. Time series therefore do not follow trajectories of individuals over time. However, samples collected at different times from individuals grown under the same conditions share the same perturbations of the biological processes, and hence behave as surrogates for multiple samples from a single individual at different times. This implies the importance of growing individuals under multiple conditions if one-shot sampling is used. This paper models the condition effect explicitly by using condition-dependent nominal mRNA production amounts for each gene, it quantifies the performance of network structure estimators both analytically and numerically, and it illustrates the difficulty in network reconstruction under one-shot sampling when the condition effect is absent. A case study of an Arabidopsis circadian clock network model is also included.
Introduction
Time series data is important for studying biological processes in organisms because of the dynamic nature of the biological systems. Ideally it is desirable to use time series with multi-shot sampling , where each individual (such as a plant, animal, or microorganism) is sampled multiple times to produce the trajectory of the biological process, as in Fig 1 . Then the natural biological variations in different individuals can be leveraged for statistical inference, and thus inference can be made even if the samples are prepared under the same experimental condition.
Fig 1. Multi-shot sampling.
Each plant is observed four times.
However, in many experiments multi-shot sampling is not possible. Due to stress response of the organisms and/or the large amount of cell tissue required for accurate measurements, the dynamics of the relevant biological process in an individual of the organism cannot be observed at multiple times without interference. For example, in an RNA-seq experiment an individual plant is often only sampled once in its entire life, leaving the dynamics unobserved at other times. See the Discussion section for a review of literature on this subject. We call the resulting time series data, as illustrated in Fig 2 , a time series with one-shot sampling . Because the time series with one-shot sampling do not follow the trajectories of the same individuals, they do not capture all the correlations in the biological processes. For example, the trajectory of observations on plants 1–2–3–4 and that on 1–6–7–4 in Fig 2 are statistically identical. The resulting partial observation renders some common models for the biological system dynamics inaccurate or even irrelevant.
Fig 2. One-shot sampling.
Each plant is observed once.
To address this problem, instead of getting multi-shot time series of single individuals, one can grow multiple individuals under each condition with a variety of conditions, and get one-shot time series of the single conditions. The one-shot samples from the same condition then become a surrogate for multi-shot samples for a single individual, as illustrated in Fig 3 . In essence, if we view the preparation condition of each sample as being random, then there should be a positive correlation among samples grown under the same condition. We call this correlation the condition variation effect . It is similar to the effect of biological variation of a single individual sampled at different times, if such sampling were possible.
Fig 3. One-shot sampling with two different conditions.
For each condition, the one-shot samples at different times are also complemented by biological replicates , which are samples from independent individuals taken at the same time used to reduce technical and/or biological variations. See the Discussion section for a review on how replicates are used for biological inference. With a budget over the number of samples, a balance must be kept between the number of conditions, the number of sampling times and the number of replicates.
To illustrate and quantify the effect of one-shot sampling in biological inference, we introduce a simple dynamic gene expression model with a condition variation effect. We consider a hypothesis testing setting and model the dynamics of the gene expression levels at different sampling times by a dynamic Bayesian network (DBN), where the randomness comes from nominal (or basal) biological and condition variations for each gene. The nominal condition-dependent variation of gene j is the same for all plants in that condition and the remaining variation is biological and is independent across the individuals in the condition. In contrast to GeneNetWeaver [ 1 ], where the effect of a condition is modeled by a random perturbation to the network coefficients, in our model the condition effect is characterized by correlation in the nominal variation terms of the dynamics. Note in both models samples from different individuals under the same condition are statistically independent given the randomness associated with the condition.
The contributions of this paper are threefold.
A composite hypothesis testing problem on the gene regulatory network is formulated and a gene expression dynamic model that explicitly captures the per-gene condition effect and the gene regulatory interactions is proposed.
The performance of gene regulatory network structure estimators is analyzed for both multi-shot and one-shot sampling, with focus on two algorithms. Furthermore, single-gene and multi-gene simulation results indicate that multiple-condition experiments can somewhat mitigate the shortcomings of one-shot sampling.
The difficulty of network reconstruction under one-shot sampling with no condition effect is illustrated. This difficulty connects network analysis and differential expression analysis, two common tasks in large-scale genomics studies, in the sense that the part of network involving non-differentially expressed genes may be harder to reconstruct.
The simulation code for generating the figures is available at [ 2 ].
Materials and methods
Stochastic model of time-series samples.
This section formulates the hypothesis testing problem of learning the structure of the gene regulatory network (GRN) from gene expression data with one-shot or multi-shot sampling. The GRN is characterized by an unknown adjacency matrix. Given the GRN, a dynamic Bayesian network model is used for the gene expression evolution with time. Two parameters σ co, j and σ bi, j are used for each gene j , with the former explicitly capturing the condition variation effect and the latter capturing the biological variation level.
For any positive integer n , let [ n ] = {1, 2, …, n }. We use ( f ( x ) ) x ∈ I to denote the family of elements in the set { f ( x ) : x ∈ I } indexed by the index set I . The indicator function on a statement or a set P is denoted by 1 P . The n -by- n identity matrix is denoted by I n . The transpose of matrix A is denoted by A *.
Model for gene regulatory network topology
Let n be the number of genes and let A ∈ A ⊆ R n × n be the unknown adjacency matrix of the GRN. The sign of the entry a ij of A for i ≠ j indicates the type of regulation of j by i , and the absolute value the strength of the regulation. A zero entry a ij = 0 with i ≠ j indicates no regulation of j by i . The diagonal of A characterizes protein concentration passed from the previous time, protein degradation, and gene autoregulation. Let S = { S 1 , S 2 , … , S | S | } be a finite set of network structures and let D be a mapping from A to S ; D ( A ) represents the network structure of an adjacency matrix A . Then A is partitioned by the associated network structures. Fix a loss function l : S 2 → R . Let Y ∈ Y be the random observation and let δ : Y → S be an estimator for the structure. The performance of an estimator is evaluated by the expected loss E l ( D ( A ) , δ ( Y ) ) . This is a hypothesis testing problem with composite hypotheses { D - 1 ( S ) : S ∈ S } . This paper considers network reconstruction up to regulation type with D ( A ) = ( sgn ( A i j ) ) ( i , j ) ∈ [ n ] 2 , where sgn ( s ) = 1 { s > 0 } - 1 { s < 0 } . In other words, the ternary value of the edge signs (positive, negative, or no edge) are to be recovered. A structure S has the form S = ( S i j ) ( i , j ) ∈ [ n ] 2 with S ij ∈ {0, 1, −1}, and it can be interpreted as a directed graph with possible self-loops. Some examples of loss functions are as follows.
Note the FDR and the FNR are well-defined when S ′ and S contains nonzero elements, respectively, and the FPR is well-defined when S contains zeros. The error rate is always well-defined. It can be seen that l FDR ( S , S ′) = l FNR ( S ′, S ). Also if S does not contain zeros then l FNR ( S , S ′) = l E ( S , S ′). Similarly if S ′ does not contain zeros then l FNR ( S , S ′) = l E ( S , S ′). As an example, for a random guessing algorithm with probabilities of S i j ′ = 0 , 1 , - 1 being 1 − q , q /2, q /2 and a network prior with probabilities of S ij = 0, 1, −1 being 1 − p , p /2, p /2, l FDR = 1 − p /2, l FNR = 1 − q /2, and l FPR = q .
Model for gene expression dynamics
This section models the gene expression dynamics of individuals by a dynamic Bayesian networks with parameters σ co, j and σ bi, j as the condition variation level and biological variation level for gene j .
Let K , T and C be the number of individuals, sampling times, and conditions, respectively. Let X j k ( t ) ∈ R be the expression level of gene j ∈ [ n ] in individual k ∈ [ K ] at time t ∈ [ T ], and let c k ∈ [ C ] be the label that indicates the condition for individual k . Here we assume X j k ( t ) represents both the mRNA abundance and the protein concentration. The gene expression levels evolve according to the Gaussian linear model (GLM) with initial condition X j k ( 0 ) = 0 for any j ∈ [ n ], k ∈ [ K ] and the following recursion (note the values of X can be the expression levels after a logarithm transform, in which case lowly expressed genes have negative X values)
for j ∈ [ n ], k ∈ [ K ], and t ∈ {0, 1, …, T −1}, where ( W co , j c ( t ) ) ( c , j , t ) ∈ [ C ] × [ n ] × [ T ] and ( W bi , k j ( t ) ) ( j , k , t ) ∈ [ n ] × [ K ] × [ T ] are collections of independent standard Gaussian random variables that are used to drive the dynamics. Here the last two terms in ( 1 ) denote the condition variation and biological variation, respectively. To write ( 1 ) in matrix form, we let X ( t ) = ( X j k ( t ) ) ( k , j ) ∈ [ K ] × [ n ] and W ( t ) = ( W j k ( t ) ) ( k , j ) ∈ [ K ] × [ n ] be K -by- n matrices, where W j k ( t ) = σ co , j W co , j c k ( t ) + σ bi , j W bi , j k ( t ) . Then
The variable W j k ( t ) is the nominal mRNA production amount for target gene j , individual k at time t that would occur in the absence of regulation by other genes.
Model for sampling method
In this section two sampling methods are described: one-shot sampling and multi-shot sampling. For simplicity, throughout this paper we consider a full factorial design with CRT samples obtained under C conditions, R replicates and T sampling times, denoted by Y = ( Y c , r , t ) ( c , r , t )∈[ C ]×[ R ]×[ T ] . In other words, instead of X we observe Y , a noisy and possibly partial observation of X . Here the triple index for each sample indicates the condition, replicate, and time. As we will see in the Discussion at the end of this section, for either sampling method, the biological variation level σ bi, j can be reduced by combining multiple individuals to form a single sample.
Multi-shot sampling
Assume an individual can be sampled multiple times. This sampling model corresponds to K = CR and c k = ⌈ k R ⌉ ∈ [ C ] for all k ∈ [ K ]. Equivalently, multi-index ( c , r ) can be used to determine the individual instead of k for X and W with c denoting the condition and r the replicate. Then ( 1 ) for multi-shot sampling can be rewritten as
and the observation for condition c , replicate r and time t is
with ( Z j c , r , t ) ( j , c , r , t ) ∈ [ n ] × [ C ] × [ R ] × [ T ] being a collection of independent standard Gaussian random variables modeling the observation noise, and σ te, j is the technical variance level of gene j . We see that for fixed c and r the observations at different times are from the same individual with the multi-index ( c , r ). As a result, with multi-shot sampling Y is a noisy full observation of X .
One-shot sampling
Assume an individual can be sampled only once. This model corresponds to K = CRT and c k = ⌈ k R T ⌉ ∈ [ C ] for all k ∈ [ K ]. Equivalently, with multi-index ( c , r , s ) denoting the condition, the replicate, and the target sampling time, the evolution ( 1 ) for one-shot sampling can be rewritten as
and the observation is
Again σ te, j is the observation noise level of gene j and the Z ’s are independent standard Gaussian random variables. Note that for fixed c and r the observations at different times are from different individuals because the triple indices are different. Hence with one-shot sampling, Y is a noisy partial observation of X (to see this, note for gene 1 and the individual indexed by condition 1, replicate 1, and target sampling time 1, X 1 1 , 1 , 1 ( 1 ) , which is the expression level at time 1, is observed through Y 1 1 , 1 , 1 but X 1 1 , 1 , 1 ( 2 ) , which is the expression level at time 2, is not observed).
Discussion on sources of variance
The σ co , j W co , j c ( t ) terms measure the condition-dependent nominal production level as global driving noise terms that are shared across individuals under the same condition. They are independent and identically distributed (i.i.d.) across conditions. The σ bi , j W bi , j k ( t ) terms measure the biological nominal production level of individuals as local driving noise terms. They are i.i.d. across individuals. The σ te , j Z j c , r , t terms measure the technical variation of samples as additive observational noise terms that are not in the evolution of X . They are i.i.d. across samples. We then have the following observations.
If the samples of the individuals under many different conditions are averaged and treated as a single sample, then effectively σ co, j , σ bi, j and σ te, j are reduced.
If the samples of R individuals under same conditions (biological replicates) are averaged and treated as a single sample, then effectively σ bi , j 2 and σ te , j 2 are reduced by a factor of R while σ co , j 2 remains unchanged.
If material from multiple individuals grown under the same condition is combined into a composite sample before measuring, then effectively σ bi, j is reduced while σ co, j and σ te, j remain unchanged. Note for microorganisms a sample may consist of millions of individuals and the biological variation is practically eliminated ( σ bi, j ≈ 0).
If the samples from same individuals (technical replicates) are averaged and treated as a single sample, then effectively σ te, j is reduced while σ co, j and σ bi, j remain unchanged.
Note this sampling model with hierarchical driving and observational noises can also be used for single-cell RNA sequencing (scRNAseq) in addition to bulk RNA sequencing and microarray experiments. For scRNAseq, σ co, j can model the tissue-dependent variation (the global effect) and σ bi, j the per-cell variation (the local effect).
Performance evaluation of network structure estimators
This section studies the performance of network structure estimators with multi-shot and one-shot sampling data. First, general properties of the two sampling methods are obtained. Then two algorithms, the generalized likelihood-ratio test (GLRT) and the basic sparse linear regression (BSLR), are studied. The former is a standard decision rule for composite hypothesis testing problems and is shown to have some properties but is computationally infeasible for even a small number of genes. The latter is an algorithm based on linear regression, and is feasible for a moderate number of genes. Finally simulation results for a single-gene network with GLRT and for a multi-gene network with BSLR are shown.
General properties
By ( 3 ), ( 4 ) and ( 5 ), the samples Y are jointly Gaussian with zero mean. The covariance of the random tensor Y is derived under the two sampling methods in the following.
Under multi-shot sampling, the samples under different conditions are independent and hence uncorrelated. Consider Y c , r , t and Y c , r ′, t ′ , which are two samples under the same condition and collected at times t and t ′. The covariance matrix between Y c , r , t and Y c , r ′, t ′ is the sum of the covariance matrices of their common variations at times τ for 1 ≤ τ ≤ t ∧ t ′ multiplied by ( A *) t − τ on the left and A t ′− τ on the right, plus covariance for the observation noise. Let Σ co = diag ( σ co , 1 2 , σ co , 2 2 , … , σ co , n 2 ) , Σ bi = diag ( σ bi , 1 2 , σ bi , 2 2 , … , σ bi , n 2 ) , and Σ te = diag ( σ te , 1 2 , σ te , 2 2 , … , σ te , n 2 ) . Then the covariance matrix of the variations is Σ co + Σ bi if the two samples are from the same individual (i.e., r = r ′), and Σ co otherwise. This yields:
Under one-shot sampling the only difference compared with multi-shot sampling is that two samples indexed by ( c , r , t ) and ( c , r , t ′) are from different individuals if t ≠ t ′. So
For any fixed network structure estimator:
If Σ bi = 0 and C , R and T are fixed, the joint distribution of the data is the same for both types of sampling. So the performance of the estimator would be the same for multi-shot and one-shot sampling.
If Σ bi = 0 and Σ te = 0 (no observation noise) and C , T are fixed, the joint distribution of the data is the same for both types of sampling (as noted in item 1) and any replicates beyond the first are identical to the first. So the performance of the estimator can be obtained even if all replicates beyond the first are discarded.
Under multi-shot sampling, when C , R , T are fixed with R = 1, the joint distribution of the data depends on Σ co and Σ bi only through their sum. So the performance of the estimator would be the same for all Σ co and Σ bi such that Σ co + Σ bi is the same.
In the homogeneous gene case with σ co, j = σ co , σ bi, j = σ bi , σ te, j = σ te for all j with σ co * + σ bi * + σ te * > 0 , suppose that the estimator δ is based on replicate averages y = ( y c , t ) ( c , t )∈[ C ]×[ T ] with y c , t = 1 R ∑ r = 1 R Y c , r , t , and that δ is scale-invariant (i.e., δ ( Y ) = δ ( c 0 Y ) for any c 0 ≠ 0 and Y ). Then under multi-shot sampling, δ ’s performance depends on σ co , σ bi , σ te and R only through the ratio σ te 2 / R σ co 2 + σ bi 2 / R + σ te 2 / R . Under one-shot sampling, the estimator’s performance depends on σ co , σ bi , σ te and R only through the ratios σ te 2 / R σ co 2 + σ bi 2 / R + σ te 2 / R and σ co 2 σ co 2 + σ bi 2 / R (through the latter only when σ co 2 + σ bi 2 > 0 ).
To see 4), recall from observation 2 above that averaging reduces the variance of the biological variation and that of the observation noise by a factor of R due to independence, but preserves the condition variation because it is identical across replicates. Hence the variance of the driving noise in the averages is σ co 2 + σ bi 2 / R and the variance of the observation noise of the averages is σ te 2 / R . Then the averages are essentially single-replicate data, and the performance under multi-shot sampling depends only on the ratio of the new driving noise variance to the new observational noise variance. For one-shot sampling the ratio between the condition variation and the biological variation also matters for the single-replicate data when the condition variation and the biological variation are not both zero, so the performance also depends on σ co 2 σ co 2 + σ bi 2 / R .
Network reconstruction algorithms
In this section we introduce GLRT and BSLR. GLRT is a standard choice in composite hypothesis testing setting. We observe some properties for GLRT under one-shot and multi-shot sampling. However, GLRT involves optimizing the likelihood over the entire parameter space, which grows exponentially with the square of the number of genes. Hence GLRT is hard to compute for multiple-gene network reconstruction. In contrast, BSLR is an intuitive algorithm based on linear regression, and will be shown in simulations to perform reasonably well for multi-gene scenarios.
The GLRT (see, e.g., page 38, Chapter II.E in [ 3 ]) is given by δ ( y ) = D ( A ^ ML ( y ) ) , where A ^ ML ( y ) is the maximum-likelihood estimate for A based on the covariance of Y given the observation Y = y .
Proposition 1 GLRT ( with the knowledge of Σ co , Σ bi and Σ te ) has the following properties .
For a fixed σ 2 , under multi-shot sampling with Σ te = 0 ( no observation noise ), σ co, j = σ co , σ bi, j = σ bi , and σ co 2 + σ bi 2 = σ 2 , the performance of GLRT for sign estimation is the same for all ( R , σ co , σ bi ) excluding ( R ≥ 2, σ bi = 0).
Under one-shot sampling and Σ co = 0, the log likelihood of the data as a function of A ( i . e . the log likelihood function ) is invariant with respect to replacing A by − A . So , for the single-gene n = 1 case , the log likelihood function is an even function of A , and thus the GLRT will do no better than random guessing .
For 2 it suffices to notice in ( 6 ) the covariance is invariant with respect to changing A to − A . A proof of 1 is given below.
Proof of 1) We first prove it for the case of a single gene with constant T and a constant number of individuals, CR . To do that we need to look at the likelihood function closely.
We may assume σ 2 = 1. Because the trajectories for different conditions are independent (for given parameters ( A , σ co 2 ) ), we shall first consider the case with a single condition; i.e., C = 1. There are hence R trajectories of length T . Then the covariance matrix of the length- R driving vector used at time t for the trajectories is
When σ co > 0, Σ is not the identity matrix multiplied by some constant; i.e., the noise vector W ( t ) is colored across replicates. It can be checked when σ co < 1 (i.e., σ bi > 0) the matrix Σ is positive definite. Then there exists an orthogonal matrix U and a diagonal matrix Λ with positive diagonal elements such that Σ = U Λ U *. Let Σ −1/2 = U Λ −1/2 U * and let
for all t ∈ [ T ]. Then the trajectories for the R replicates in a single condition become:
It can be checked that after the linear transformation by Σ −1/2 , which does not depend on A , the new driving vectors are white (i.e., Cov ( W ˜ ( t ) ) = I R ). It follows that the distribution of X ˜ | ( A , σ co 2 ) is the same as the distribution of X |( A , 0) (i.e. σ co = 0). Therefore, for x = ( x r ( t ) ) ( r , t ) ∈ [ R ] × [ T ] , if we let L X ( x | A , σ co 2 ) denote the likelihood of X = x for parameters A , σ co 2 , then
where d ( R , T , σ co 2 ) = ( det Σ ) - T / 2 is a function of R , T and σ co 2 , and x ˜ ( t ) = Σ - 1 / 2 x ( t ) .
Now consider the likelihood function for all CRT samples with general C . It is the product of C likelihood functions for the samples prepared under the C different conditions. It is thus equal to d ( R , T , σ co 2 ) C times the likelihood of the transformed expression levels x ˜ , which is the likelihood function for σ co = 0 and a total of CRT samples. The form of the product depends on C and R only through CR , because under the transformation, all CR trajectories are independent. Hence, for fixed A , σ co 2 , C , R , T the distribution of the maximum likelihood estimate of A , when the samples are generated using a given σ co > 0 (so the R individuals under each condition are correlated) and the likelihood function also uses σ co 2 , is the same as the distribution of the maximum likelihood estimate of A when σ co = 0 (in which case the CR individual trajectories are i.i.d.). Formally,
where E σ co denotes that the condition variation level of the random elements X and Y is σ co 2 . The above fails if σ co = 1 (i.e., σ bi = 0) and R ≥ 2 because then Σ is singular. It also fails if σ co and σ bi are unknown to the GLRT.
For the general model with multiple genes, if σ co, j is the same for each gene j , 1) holds as before—for the proof, apply left multiplication by Σ - 1 2 for each gene, time, and condition to all R samples in the condition. View ( 2 ) as an update equation for an R × n matrix for each group of R samples in one condition. One column of length R per gene, and one row per sample.
In BSLR, replicates are averaged and the average gene expression levels at different times under different conditions are fitted in a linear regression model with best-subset sparse model selection, followed by a Granger causality test to eliminate the false discoveries. BSLR is similar to other two-stage linear regression–based network reconstruction algorithms, notably oCSE [ 4 ] and CaSPIAN [ 5 ]. Both oCSE and CaSPIAN use greedy algorithms in the first build-up stage, making them more suitable for large-scale problems. In contrast, BSLR uses best subset selection, which is conceptually simpler but computationally expensive for large n . For the tear-down stage both BSLR and CaSPIAN use the Granger causality test, while oCSE uses a permutation test.
Build-up stage
In the first stage BSLR finds potential regulatory interactions using a linear regression model. Let Y j c ( t ) = 1 R ∑ r = 1 R Y c , r , t and let Y ( t ) = ( Y j c ( t ) ) ( c , j ) ∈ [ C ] × [ n ] denote the C -by- n matrix. Let
For each target gene j ∈ [ n ], BSLR solves the following best subset selection problem with a subset size k < n :
Denote the solution by ( A *, b *, d *). The output of the first stage is then A *.
A naive algorithm to solve the above optimization has a computational complexity of O ( n k +1 ) for fixed k as n → ∞. Faster near-optimal alternatives exist [ 6 ].
Tear-down stage
The second stage is the same as that of CaSPIAN. For each j ∈ [ n ] and each i ∈ supp ( A · j * ) , let the unrestricted residual sum of squares be
and the restricted residual sum of squares
The F -statistic is given by
The potential parent i of j is removed in the tear-down stage if the p -value of the F -statistic with degrees of freedom (1, C ( T − 1) − k − 2) is above the preset significance level (e.g., 0.05). Note the tests are done for all parents in A ⋅ j simultaneously; both the restricted and the unrestricted models contain the other potential parents regardless of the results of the tests on them.
Simulations on single-gene network reconstruction
The GLM is used to simulate one-shot sampling data with a single gene. The goal is to determine the type of autoregulation of the single gene (activation or repression). The protein concentration passed from the previous time is ignored so the type of autoregulation is represented by the sign of the scalar A . In order to compare one-shot and multi-shot sampling, we view the main expense to be proportional to the number of samples to prepare as opposed to the number of individuals to grow. We thus fix a total budget of CRT = 180 samples and consider full factorial design with C and R varying with CR = 30, and T = 6 with 10 000 simulations. We assume the knowledge of the existence of the autoregulation (i.e., A ≠ 0), in which case the FDR, the FNR and the error rate coincide, so we only look at error rates. The results are plotted in Fig 4 . The four plots on the left are for one-shot sampling and the four on the right are for multi-shot sampling. Consider the homogeneous case with σ co, j = σ co , σ bi, j = σ bi and σ te, j = σ te for all j and let γ = σ co 2 σ co 2 + σ bi 2 be the fraction of condition variation in the driving noise. For each plot the observed probability of sign (of A ) error is shown for γ ∈ {0, 0.2, 0.4, 0.6, 0.8, 1.0} and for R ranging over the divisors of 30 from smallest to largest. Fig 4a–4d show the performance for the GLRT algorithm assuming no observation noise ( σ te = 0), known γ and known total driving variation σ 2 = σ co 2 + σ bi 2 = 1 . Fig 4e–4h show the performance for the GLRT algorithm assuming known driving noise level σ = 1 and observational noise level σ te = 1, while both γ and A are unknown to the algorithm.
Fig 4. Performance of the GLRT in single-gene sign recovery with different numbers of replicates.
The numerical simulations reflect the following observations implied by the analytical model.
Under one-shot sampling, when γ = 0, the GLRT is equivalent to random guessing.
The GLRT performs the same under one-shot and multi-shot sampling when γ = 1.
Under no observation noise, the performance for multi-shot sampling is the same for all γ < 1.
Some empirical observations are in order.
Multi-shot sampling outperforms one-shot sampling (unless γ = 1, where they have the same error probability).
For one-shot sampling, the performance improves as γ increases. Regarding the number of replicates R per condition, if γ = 0.2 (small condition effect), a medium number of replicates (2 to 5) outperforms the single replicate strategy. For larger γ , one replicate per condition is the best.
For multi-shot sampling, performance worsens as γ increases. One replicate per condition ( R = 1) is best.
Comparing Fig 4a–4d vs. Fig 4e–4h , we observe that the performance degrades with the addition of observation noise, though for moderate noise ( σ te = 1.0) the effect of observation noise on the sign error is not large. Also, the effect of the algorithm not knowing γ is not large.
Simulations on multi-gene network reconstruction
This subsection studies the case when multiple genes interact through the GRN. The goal is to compare one-shot vs. multi-shot sampling for BSLR under a variety of scenarios, including different homogeneous γ values, varying number of replicates, varying observation noise level, and heterogeneous γ values.
The performance evaluation for multi-gene network reconstruction is trickier than the single-gene case because of the many degrees of freedom introduced by the number of genes. First, the network adjacency matrix A is now an n -by- n matrix. While some notion of “size” of A (like the spectral radius or the matrix norm) may be important, potentially every entry of A may affect the reconstruction result. So instead of fixing a ground truth A as in Fig 4 , we fix a prior distribution of A with split Gaussian prior described in S2 Appendix (note we assume the knowledge of no autoregulation), and choose A i.i.d. from the prior distribution with d max = 3. Second, because the prior of A can be subject to sparsity constraints and thus far from a uniform distribution, multiple loss functions that are more meaningful than the ternary error rate can be considered for performance. So we consider ternary FDR, ternary FNR and ternary FPR for the multi-gene case. In the simulations we have 20 genes and d max = 3 with in-degree uniformly distributed over {0, 1, …, d max }, so the average in-degree is 1.5. The number of sampling times is T = 6 and CR = 30.
Varying γ , R and σ te
In this set of simulations we fix the observation noise level and vary the number of replicates R and the condition correlation coefficient γ . The performance of BSLR under one-shot and multi-shot sampling is shown in Fig 5 ( σ te = 0) and Fig 6 ( σ te = 1). Note BSLR does not apply to a single condition with 30 replicates due to the constraint that the degrees of freedom C ( T − 1) − k − 2 in the second stage must be at least 1.
Fig 5. Performance of the BSLR in multi-gene network reconstruction with different numbers of replicates, σ te = 0.
Fig 6. Performance of the BSLR in multi-gene network reconstruction with different numbers of replicates, σ te = 1.
For one-shot sampling, when γ = 0, we see in both Figs 5 and 6 that BSLR is no different from random guessing, with an FDR close to 1 - 1 2 1 . 5 19 ≈ 0 . 96 and an FNR and an FPR such that l FNR + 1 2 l FPR ≈ 1 (recall the example of random guessing at the end of the section of the model for gene regulatory network topology). When γ = 1, BSLR performs similarly with one-shot or multi-shot sampling, which is consistent with property 1 in the section on general properties. As γ increases from 0 to 1, under one-shot sampling for a fixed number of replicates, the FDR and FNR reduce greatly. For example, as γ increases from 0.2 to 1, the FDR for single replicate under one-shot sampling decreases from 0.74 to 0.31 with noiseless data ( Fig 5 ), and from 0.79 to 0.36 with noisy data ( Fig 6 ), while the FNR decreases from 0.70 to 0.00 with noiseless data, and from 0.78 to 0.04 with noisy data. This decrease is more pronounced for smaller number of replicates. Note the trend of the performance of BSLR under one-shot sampling as a function of R and γ is very similar to that of GLRT in Fig 4e and 4g .
For multi-shot sampling, in the noiseless case, we see all three losses are invariant with respect to different γ for fixed R , which is consistent with property 4 in the section on general properties because BSLR is an average-based scale-invariant algorithm (note CR is a constant so for different R the performance is different due to the change in C ). In the noisy case, the FDR and FNR slightly decrease as γ increases, which is an opposite trend compared with Fig 4f and 4h .
In summary, the main conclusions from Figs 5 and 6 are the following.
The performance of BSLR under multi-shot sampling is consistently better than that under one-shot sampling.
The performance of BSLR under one-shot sampling varies with γ , from random guessing performance at γ = 0 to the same performance as multi-shot sampling at γ = 1.
By comparing Figs 5 with 6 , we see the observation noise of σ te = 1 has only a small effect on the performance with the two sampling methods.
Reduced number of directly differentially expressed genes
In the above simulations we have assumed all genes are equally directly differentially expressed. In other words, we took σ co , j 2 + σ bi , j 2 = 1 and σ co, j = σ co for all j . To test what happens more generally, we conducted simulations such that only half of the genes are directly differentially expressed genes (DDEGs), while the other half are non-DDEGs. To do so, we assign σ co , j 2 = 0 . 8 and σ bi , j 2 = 0 . 2 for 1 ≤ j ≤ 10, and σ co , j 2 = 0 and σ bi , j 2 = 1 for 11 ≤ j ≤ 20. The results for R = 3 are pictured in Fig 7 . We see that with one-shot sampling the edges coming out of the DDEGs are reconstructed with lower FDR and FNR compared to those coming out of non-DDEGs. However, under one-shot sampling, even the edges from the non-DDEGs in Fig 7 are recovered with much lower FDR and FNR, as compared to one-shot sampling in Fig 6 with γ = 0 and R = 3 (both FDR and FNR are around 0.95). The results indicate that the performance of BSLR under one-shot sampling benefits from diversity in conditions even when not all genes are directly differentially expressed.
Fig 7. Performance of BSLR for heterogeneous σ co, j with σ te = 1.
We summarize the simulations performed in Table 1 . Note the last row is a summary of Table 2 in the Discussion section.
Table 1. Summary of simulation results.
OS and MS stand for the losses of one-shot sampling and multi-shot sampling. RG stands for random guessing. * indicates mathematically proved results.
Table 2. BSLR evaluation on Locke network.
The errors are estimated using Hoeffding’s inequality over 1000 simulations with significance level 0.05.
Information limitation for reconstruction under one shot sampling without condition effect
In the previous section it is shown that both GLRT and BSLR are close to random guessing under one-shot sampling when σ co, j = 0 for all j . This leads us to the following question: is the network reconstruction with no condition effect ( σ co, j = 0 for all j ) information theoretically possible? In this section we examine this question under general estimator-independent settings. Note in this case the trajectories of all individuals are independent given A regardless of ( c k ) k ∈[ K ] .
As we have seen in Proposition 1 part 2, when Σ co = 0, the distribution of the observed data Y is invariant under adjacency matrix A or − A , implying any estimator will have a sign error probability no better than random guessing for the average or worst case over A and − A . Here, instead of sign error probability, we consider the estimation for A itself.
The extreme case with infinite number of samples available for network reconstruction is considered to give a lower bound on the accuracy for the finite data case. Note that with infinite number of samples a sufficient statistic for the estimation of the parameter A is the marginal distributions of X 1 ( t ); no information on the correlation of ( X 1 ( t )) t ∈[ T ] across time t can be obtained. A similar observation is made in [ 7 ] for sampling stochastic differential equations.
We first consider the transient case with X (0) = 0 as stated in the section of the model for gene expression dynamics. With infinite data the covariance matrix Σ t ≜ Cov ( X ( t ) ) = ∑ τ = 1 t ( A * ) t - τ A t - τ can be recovered for t ∈ [ T ]. Now we want to solve A from (Σ t ) t ∈[ T ] . As a special case, if A * A = ρI n (i.e., ρ −1/2 A is orthogonal) then we will have Σ t = ∑ τ = 0 t - 1 ρ τ I n . As a result, given (Σ t ) t ∈[ T ] in the above form, no more information of A can be obtained other than ρ −1/2 A being orthogonal, with n ( n - 1 ) 2 degrees of freedom remaining. In general case it is not clear if A can be recovered from (Σ t ) t ∈[ T ] .
Now consider the case where X k is in steady state; i.e., X (0) is random such that Cov( X ( t )) is invariant with t . With infinite amount of data we can get the covariance matrix Σ, which satisfies Σ = A *Σ A + I . Since covariance matrices are symmetric, we have n ( n + 1 ) 2 equations for n 2 variables in A . Thus A is in general not determined by the equation uniquely. In fact, note that Σ is positive definite. Then by eigendecomposition Σ = Q Λ Q *, where Q is an orthogonal matrix and Λ the diagonal matrix of the eigenvalues of Σ. Then Λ = ( Q * AQ )*Λ( Q * AQ ) + I . Let B = QAQ *. Then Λ = B *Λ B . By the Gram–Schmidt process, B can be determined with n ( n - 1 ) 2 degrees of freedom. So the network cannot be recovered from the stationary covariance matrix.
In summary, the recovery of the matrix A is generally not possible in the stationary case, and also not possible in the transient case at least when A is orthogonal. To reconstruct A , further constraints (like sparsity) may be required.
One-shot sampling in the literature
This section reviews the sampling procedures reported in several papers measuring gene expression levels in biological organisms with samples collected at different times to form time series data. In all cases, the sampling is one-shot, in the sense that a single plant or cell is only sampled at one time.
Microorganisms
In the transcriptional network inference challenge from DREAM5 [ 8 ], three compendia of biological data sets were provided based on microorganisms ( E. coli , S. aureus , and S. cerevisiae ), and some of the data corresponded to different sampling times in a time series. Being based on microorganisms, the expression level measurements involved multiple individuals per sample, a form of one-shot sampling.
In [ 9 ], the plants are exposed to nitrate, which serves as a synchronizing event, and samples are taken from three to twenty minutes after the synchronizing event. The statement “… each replicate is independent of all microarrays preceding and following in time” suggests the experiments are based on one-shot sampling. In contrast, the state-space model with correlation between transcription factors in an earlier time and the regulated genes in a later time fits multi-shot sampling. [ 10 ] studied the gene expression difference between leaves at different developmental stages in rice. The 12th, 11th and 10th leaf blades were collected every 3 days for 15 days starting from the day of the emergence of the 12th leaves. While a single plant could provide multiple samples, namely three different leaves at a given sampling time, no plant was sampled at two different times. Thus, from the standpoint of producing time series data, the sampling in this paper was one-shot sampling. [ 11 ] devised the phenol-sodium dodecyl sulfate (SDS) method for isolating total RNA from Arabidopsis . It reports the relative level of mRNA of several genes for five time points ranging up to six hours after exposure to a synchronizing event, namely being sprayed by a hormone trans -zeatin. The samples were taken from the leaves of plants. It is not clear from the paper whether the samples were collected from different leaves of the same plant, or from leaves of different plants.
[ 12 ] likely used one-shot sampling for their −24, 60, 120, 168 hour time series data from mouse skeletal muscle C2C12 cells without specifying whether the samples are all taken from different individuals. [ 13 ] produced time series data by extracting cells from a human, seeding the cells on plates, and producing samples in triplicate, at a series of six times, for each of five conditions. Multiple cells are used for each sample with different sets of cells being used for different samples, so this is an example of one-shot sampling of in vitro experiment in the sense that each plate of cells is one individual. The use of (five) multiple conditions can serve as a surrogate for a single individual set of cells to gain the effect of multi-shot sampling. Similarly, the data sets produced by [ 14 ] involving the plating of HeLa S3 cells can be classified as one-shot samples because different samples are made from different sets of individual cells. Interestingly, the samples are prepared under one set of conditions, so the use of different conditions is not adopted as a surrogate for multi-shot sampling. However, a particular line of cells was selected (HeLa S3) for which cells can be highly synchronized. Also, the paper does not attempt to determine causal interactions.
The three in silico benchmark suites described in the GeneNetWeaver paper on performance profiling of network inference methods [ 1 ] consisted of steady state, and therefore one-shot, samples from dynamical models. However, the GeneNetWeaver software can be used to generate multi-shot time series data, and some of that was included in the network inference challenges, DREAM3, DREAM4, and DREAM5 [ 1 , 8 ].
On biological replicates
In many biological experiments, independent biological replicates are used to reduce the variation in the measurements and to consequently increase the power of the statistical tests. It turns out that both how to use biological replicates, and the power of biological replicates, depend on whether the sampling is one-shot or multi-shot. To focus on this issue we first summarize how replicates have traditionally been used for the more common problem of gene differential expression analysis, before turning to the use of replicates for recovery of gene regulatory networks.
The following summarizes the use of replicates for gene differential expression analysis. A recent survey [ 15 ] suggests a minimum of three replicates for RNA-seq experiments whenever sample availability allows. Briggs et al. [ 16 ] studies the effect of biological replication together with dye switching in microarray experiments and recommends biological replication when precision in the measurements is desired. Liu et al. [ 17 ] studies the tradeoff between biological replication and sequencing depth under a sequencing budget limit in RNA-seq differential expression (DE) analysis. It proposes a metric for cost effectiveness that suggests a sequencing depth of 10 million reads per library of human breast cells and 2–6 biological replicates for optimal RNA-seq DE design. Schurch et al. [ 18 ] studies the number of necessary biological replicates in RNA-seq differential expression experiments on S. cerevisiae quantitatively with various statistical tools and concludes with the usage of a minimum of six biological replicates.
The choice of replication strategy depends on how the statistical algorithm uses the replicate data. In many differential analysis software packages replicates are treated as independent samples with identical experimental conditions. For example, in edgeR [ 19 ] and sleuth [ 20 ] the logarithm of the abundance of gene i in sample m is assumed to be x m * β i , where x m is the column vector of design characteristics with respect to p variates for sample m and β i the column vector of the associated effects of the p variates to gene i . Replicate samples can then be used to expand the design matrix x with identical columns. Note that, as a result, replicates are not necessary for edgeR and sleuth because samples with different design characteristics can all contribute to the estimation of β . It is then not clear whether it is better to have more replicates under the same condition, or to have more conditions, for a fixed total number of samples.
For regulatory network reconstruction there is even less consensus on how replicates should be used. One straightforward way is to reduce the replicates into a single set of data by averaging either directly or after a random resampling of the original replicated data. In this case the mean of the replicates are used as a better estimate of the population than each single replicate, while higher moments of the empirical distribution of the replicates are practically ignored. Another way adopted in [ 9 ] is to account for all four potential transitions between two replicates in two adjacent sampling times in their machine learning algorithm due to the one-shot nature of the replicates. In the next section, we illustrate why replicates should be used differently for one-shot and multi-shot sampling, in the context of recovering a circadian clock network model.
A case study on Arabidopsis circadian clock network
As we have discussed earlier, the current expression datasets are prominently one-shot, making a direct comparison between one-shot and multi-shot sampling in real biological data difficult. The lack of a well-accepted ground truth of the gene regulatory network also makes performance evaluation hard, if not impossible. To test the applicability of the sampling models on real biological data, we generate expression data from a most-accepted Arabidopsis circadian clock model using stochastic differential equation (SDE) model similar to GeneNetWeaver with condition-dependent Brownian motions, and evaluate the performance of BSLR.
To extend the sampling models in this paper to the more biologically plausible SDE models, we model the individual and condition-dependent variations by independent and coupled Brownian motions. Following the Arabidopsis clock network in [ 21 ], we let genes 1, 2, 3, and 4 be LHY , TOC1 , X and Y, and construct the following group of SDEs (the dark condition in [ 21 ] is assumed here).
Here x i , t k , y i , t k , and z i , t k denote the mRNA abundance, and cytoplasmic and nuclear protein concentrations of gene i at time t in plant k . The B terms are independent standard Brownian motions (Wiener processes). Note the linear diffusion terms attenuate the Brownian motions as the processes get close to 0 and consequently keep the processes nonnegative. Compared to the GLM analyzed in this paper, this SDE model captures the nonlinearity of the regulatory interactions, the continuous-time nature of the dynamics, and the detailed diffusion from mRNA to cytoplasmic and nuclear protein. So the SDE model is much more complicated and considered a basic version of the state-of-the-art circadian clock model (see [ 22 , 23 ]). Nevertheless, the SDE model shares a property with the GLM that allows the gene regulatory interactions to be summarized by a single signed directed graph: the effect of increasing the mRNA abundance of one gene on that of another has a constant sign regardless of the mRNA abundances and the protein concentrations of any genes. For example, the drift of x 1 (mRNA of LHY ) would have a tendency of increasing with an increased x 3 (mRNA of X) through the equations for y 3 (cytoplasmic protein of X) and z 3 (nuclear protein of X) regardless of the specific values of all the x ’s, y ’s and z ’s. Fig 8 shows the signed directed graph for the SDE model of Locke network. Now we have a ground-truth network based on the SDE model.
Fig 8. The signed directed graph for the Locke network.
We choose the parameters in the drift coefficients (the terms in front of the d t ’s) accordingly to the supplementary material of [ 21 ], where the authors optimized the parameters to fit experimental results. We sample the SDE at times 0, 2, 4, 6, 8, and 10 for a single condition ( C = 1) with three replicates ( R = 3), σ co = 0.3 and σ bi = 0.4, and obtain the performance of BSLR in Table 2 . The estimates and the errors are based on 1000 simulations for each sampling method. Here the significance level of the Granger causality test in BSLR is set to 0.5. Note a random guess would have FDR = 0.75 and FNR + 1 2 FPR = 1 . We see from Table 2 that BSLR without replicate averaging on one-shot sampling data is no different from random guessing, while BSLR with replicate averaging doing slightly better in FDR and FNR. This is because replicate averaging of one-shot data practically increases the condition effect by reducing the biological variation, and thus gets a better temporal correlation between one-shot samples of adjacent times. BSLR performs better on multi-shot data compared to one-shot data because the biological variations of the previous times also contribute to the temporal correlation. In particular, BSLR without replicate averaging on the multi-shot data has the best performance because it allows tracking the individual replicates rather than merely tracking their averages. Although the performance numbers appear far from ideal, this demonstrates remarkable improvement from BSLR with replicate averaging on one-shot data to BSLR without replicate averaging on multi-shot data, especially considering the highly nonlinear SDE model, the unobserved protein concentrations levels, the very limited number of 18 samples (3 replicates with 6 times) and the fact that BSLR does not use any knowledge of the (around 60) parameters or the form of the equations, highlighting the difference in the statistical power of one-shot and multi-shot data and its implication in downstream statistical analysis decisions (replicate averaging vs. no replicate averaging).
In summary, we demonstrated a setting of the biologically plausible Arabidopsis circadian clock network with a single condition, where the BSLR performs similarly to a random guessing algorithm under one-shot sampling, and performs significantly better under multi-shot sampling. We also show that whether replicate averaging should be done or not varies with the sampling method.
Conclusions
One-shot sampling can miss a lot of potentially useful correlation information. Often gene expression data collected from plants is prepared under one-shot sampling. One factor that can partially mitigate the shortcomings of one-shot sampling is to prepare samples under a variety of conditions or perturbations. One-shot samples grown under the same condition can then be thought of as a surrogate for the multi-shot samples of an individual plant.
To clarify issues and take a step towards quantifying them, we proposed a gene expression dynamic model for gene regulatory network reconstruction that explicitly captures the condition variation effect. We show analytically and numerically the performance of two algorithms for single-gene and multi-gene settings. We also demonstrate the difficulty of network reconstruction without condition variation effect.
There is little agreement across the biology literature about how to model the impact of condition on the gene regulatory network. In some cases, it is not even clear that we are observing the same network structure as conditions vary. Nevertheless, our results suggest that the preparation of samples under different conditions can partially compensate for the shortcomings of one-shot sampling.
Supporting information
The parameters A , γ , σ , and σ te are assumed unknown and jointly estimated in GLRT.
The random network prior distribution used to generate the multi-gene network.
Data Availability
The computer simulation code is available at https://github.com/Veggente/one-shot-sampling .
Funding Statement
This work was supported by the Plant Genome Research Program from the National Science Foundation (NSF-IOS-PGRP-1823145) to B.H. and Y.H.
- 1. Schaffter T, Marbach D, Floreano D. GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics. 2011;27(16):2263–2270. 10.1093/bioinformatics/btr373 [ DOI ] [ PubMed ] [ Google Scholar ]
- 2. Kang X. One-shot sampling simulations; 2019. Available from: https://github.com/Veggente/one-shot-sampling .
- 3. Poor HV. An Introduction to Signal Detection and Estimation. Springer-Verlag; New York; 1994. [ Google Scholar ]
- 4. Sun J, Taylor D, Bollt EM. Causal Network Inference by Optimal Causation Entropy. SIAM Journal on Applied Dynamical Systems. 2015;14(1):73–106. 10.1137/140956166 [ DOI ] [ Google Scholar ]
- 5. Emad A, Milenkovic O. CaSPIAN: A Causal Compressive Sensing Algorithm for Discovering Directed Interactions in Gene Networks. PLoS ONE. 2014;9(3):e90781 10.1371/journal.pone.0090781 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 6. Bertsimas D, King A, Mazumder R. Best subset selection via a modern optimization lens. The Annals of Statistics. 2016;44(2):813–852. 10.1214/15-AOS1388 [ DOI ] [ Google Scholar ]
- 7. Bento J, Ibrahimi M, Montanari A. Learning Networks of Stochastic Differential Equations. In: Advances in Neural Information Processing Systems (NIPS); 2010. p. 172–180. Available from: http://papers.nips.cc/paper/4055-learning-networks-of-stochastic-differential-equations.pdf . [ Google Scholar ]
- 8. Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, et al. Wisdom of crowds for robust gene network inference. Nature Methods. 2012;9(8):796–804. 10.1038/nmeth.2016 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 9. Krouk G, Mirowski P, LeCun Y, Shasha DE, Coruzzi GM. Predictive network modeling of the high-resolution dynamic plant transcriptome in response to nitrate. Genome Biology. 2010;11(12):R123 10.1186/gb-2010-11-12-r123 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 10. Yamaoka C, Suzuki Y, Makino A. Differential Expression of Genes of the Calvin–Benson Cycle and its Related Genes During Leaf Development in Rice. Plant and Cell Physiology. 2015;57(1):115–124. 10.1093/pcp/pcv183 [ DOI ] [ PubMed ] [ Google Scholar ]
- 11. Taniguchi M, Kiba T, Sakakibara H, Ueguchi C, Mizuno T, Sugiyama T. Expression of Arabidopsis response regulator homologs is induced by cytokinins and nitrate. FEBS Letters. 1998;429(3):259–262. 10.1016/s0014-5793(98)00611-5 [ DOI ] [ PubMed ] [ Google Scholar ]
- 12. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. 2010;28(5):511–515. 10.1038/nbt.1621 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 13. Kiselev VY, Juvin V, Malek M, Luscombe N, Hawkins P, Novère NL, et al. Perturbations of PIP3 signalling trigger a global remodelling of mRNA landscape and reveal a transcriptional feedback loop. Nucleic Acids Research. 2015;43(20):9663–9679. 10.1093/nar/gkv1015 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 14. Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, Ball CA, Alexander KE, et al. Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Molecular Biology of the Cell. 2002;13(6):1977–2000. 10.1091/mbc.02-02-0030 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 15. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, et al. A survey of best practices for RNA-seq data analysis. Genome Biology. 2016;17(1):13 10.1186/s13059-016-0881-8 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 16. Liang M, Briggs AG, Rute E, Greene AS, Cowley AW. Quantitative assessment of the importance of dye switching and biological replication in cDNA microarray studies. Physiological Genomics. 2003;14(3):199–207. 10.1152/physiolgenomics.00143.2002 [ DOI ] [ PubMed ] [ Google Scholar ]
- 17. Liu Y, Zhou J, White KP. RNA-seq differential expression studies: more sequence or more replication? Bioinformatics. 2014;30(3):301–304. 10.1093/bioinformatics/btt688 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 18. Schurch NJ, Schofield P, Gierliński M, Cole C, Sherstnev A, Singh V, et al. How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA. 2016;22(6):839–851. 10.1261/rna.053959.115 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 19. McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research. 2012;40(10):4288–4297. 10.1093/nar/gks042 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 20. Pimentel H, Bray NL, Puente S, Melsted P, Pachter L. Differential analysis of RNA-seq incorporating quantification uncertainty. Nature Methods. 2017;14:687–690. 10.1038/nmeth.4324 [ DOI ] [ PubMed ] [ Google Scholar ]
- 21. Locke JCW, Southern MM, Kozma-Bognár L, Hibberd V, Brown PE, Turner MS, et al. Extension of a genetic network model by iterative experimentation and mathematical analysis. Molecular Systems Biology. 2005;1(1). 10.1038/msb4100018 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 22. Pokhilko A, Fernández AP, Edwards KD, Southern MM, Halliday KJ, Millar AJ. The clock gene circuit in Arabidopsis includes a repressilator with additional feedback loops. Molecular Systems Biology. 2012;8(1):574 10.1038/msb.2012.6 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 23. Seaton DD, Smith RW, Song YH, MacGregor DR, Stewart K, Steel G, et al. Linked circadian outputs control elongation growth and flowering in response to photoperiod and temperature. Molecular Systems Biology. 2015;11(1). 10.15252/msb.20145766 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
Decision Letter 0
Steven m abel.
21 Aug 2019
PONE-D-19-16969
Dear Dr. Kang,
Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.
In the reviewer comments below, I would emphasize the exploration of actual biological data and the comparison of performance to one or two other methods.
We would appreciate receiving your revised manuscript by Oct 05 2019 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.
If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.
To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols
Please include the following items when submitting your revised manuscript:
A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.
Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.
We look forward to receiving your revised manuscript.
Kind regards,
Steven M. Abel, Ph.D.
Academic Editor
Journal Requirements:
1. When submitting your revision, we need you to address these additional requirements.
Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at
http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf
[Note: HTML markup is below. Please do not edit.]
Reviewers' comments:
Reviewer's Responses to Questions
Comments to the Author
1. Is the manuscript technically sound, and do the data support the conclusions?
The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.
Reviewer #1: Partly
2. Has the statistical analysis been performed appropriately and rigorously?
Reviewer #1: Yes
3. Have the authors made all data underlying the findings in their manuscript fully available?
The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.
4. Is the manuscript presented in an intelligible fashion and written in standard English?
PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.
5. Review Comments to the Author
Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)
Reviewer #1: This papers explores the effect of condition variance and biological variance under one-shot and multi-shot sampling. It provides a general model for network representation when these parameters are known. Further, the authors describe the relationship between the sampling regimes and their representation in the model described. Finally, the authors simulation from this model and describe the performance of two-estimators under different configurations.
Things I would like to see:
- A table describing the simulations performed and summarizing the behavior
- A description of how the adjacency matrix A was chosen from simulations
- An application on "real" data
- A comparison of performance to other methods on some of the DREAM data
The section "On biological replicates" is a bit odd and seems tacked on. Particularly, the discussion of differential expression tools seems odd. The goal in those tools is not to infer gene regulation, but to infer whether the expression of some set of genes is changed given some experimental perturbation. Not sure what you mean here (line 551):
> It is then not clear whether replicates bring more benefit than the sheer additional amount of data compared to samples under different conditions.
6. PLOS authors have the option to publish the peer review history of their article ( what does this mean? ). If published, this will include your full peer review and any attached files.
If you choose “no”, your identity will remain anonymous but your review may still be made public.
Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .
Reviewer #1: No
[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]
While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at [email protected] . Please note that Supporting Information files do not need this step.
Author response to Decision Letter 0
Collection date 2019.
See the response letter.
Submitted filename: response.pdf
Decision Letter 1
17 Oct 2019
PONE-D-19-16969R1
We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.
Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.
Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/ , click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at [email protected] .
If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact [email protected] .
With kind regards,
Acceptance letter
22 Oct 2019
Dear Dr. Kang:
I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.
If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact [email protected] .
For any other questions or concerns, please email [email protected] .
Thank you for submitting your work to PLOS ONE.
PLOS ONE Editorial Office Staff
on behalf of
Dr. Steven M. Abel
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data availability statement.
- View on publisher site
- PDF (2.2 MB)
- Collections
Similar articles
Cited by other articles, links to ncbi databases.
- Download .nbib .nbib
- Format: AMA APA MLA NLM
Add to Collections
Click through the PLOS taxonomy to find articles in your field.
For more information about PLOS Subject Areas, click here .
Loading metrics
Open Access
Peer-reviewed
Research Article
Time series experimental design under one-shot sampling: The importance of condition diversity
Roles Writing – original draft, Writing – review & editing
* E-mail: [email protected]
Affiliation Coordinated Science Laboratory and Department of Electrical and Computer Engineering, University of Illinois at Urbana–Champaign, Urbana, Illinois, United States of America
Roles Writing – review & editing
Affiliation Department of Biology, California State University, Northridge, Northridge, California, United States of America
- Xiaohan Kang,
- Bruce Hajek,
- Faqiang Wu,
- Yoshie Hanzawa
- Published: October 31, 2019
- https://doi.org/10.1371/journal.pone.0224577
- See the preprint
- Peer Review
- Reader Comments
Many biological data sets are prepared using one-shot sampling, in which each individual organism is sampled at most once. Time series therefore do not follow trajectories of individuals over time. However, samples collected at different times from individuals grown under the same conditions share the same perturbations of the biological processes, and hence behave as surrogates for multiple samples from a single individual at different times. This implies the importance of growing individuals under multiple conditions if one-shot sampling is used. This paper models the condition effect explicitly by using condition-dependent nominal mRNA production amounts for each gene, it quantifies the performance of network structure estimators both analytically and numerically, and it illustrates the difficulty in network reconstruction under one-shot sampling when the condition effect is absent. A case study of an Arabidopsis circadian clock network model is also included.
Citation: Kang X, Hajek B, Wu F, Hanzawa Y (2019) Time series experimental design under one-shot sampling: The importance of condition diversity. PLoS ONE 14(10): e0224577. https://doi.org/10.1371/journal.pone.0224577
Editor: Steven M. Abel, University of Tennessee, UNITED STATES
Received: June 14, 2019; Accepted: October 16, 2019; Published: October 31, 2019
Copyright: © 2019 Kang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The computer simulation code is available at https://github.com/Veggente/one-shot-sampling .
Funding: This work was supported by the Plant Genome Research Program from the National Science Foundation (NSF-IOS-PGRP-1823145) to B.H. and Y.H.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Time series data is important for studying biological processes in organisms because of the dynamic nature of the biological systems. Ideally it is desirable to use time series with multi-shot sampling , where each individual (such as a plant, animal, or microorganism) is sampled multiple times to produce the trajectory of the biological process, as in Fig 1 . Then the natural biological variations in different individuals can be leveraged for statistical inference, and thus inference can be made even if the samples are prepared under the same experimental condition.
- PPT PowerPoint slide
- PNG larger image
- TIFF original image
Each plant is observed four times.
https://doi.org/10.1371/journal.pone.0224577.g001
However, in many experiments multi-shot sampling is not possible. Due to stress response of the organisms and/or the large amount of cell tissue required for accurate measurements, the dynamics of the relevant biological process in an individual of the organism cannot be observed at multiple times without interference. For example, in an RNA-seq experiment an individual plant is often only sampled once in its entire life, leaving the dynamics unobserved at other times. See the Discussion section for a review of literature on this subject. We call the resulting time series data, as illustrated in Fig 2 , a time series with one-shot sampling . Because the time series with one-shot sampling do not follow the trajectories of the same individuals, they do not capture all the correlations in the biological processes. For example, the trajectory of observations on plants 1–2–3–4 and that on 1–6–7–4 in Fig 2 are statistically identical. The resulting partial observation renders some common models for the biological system dynamics inaccurate or even irrelevant.
Each plant is observed once.
https://doi.org/10.1371/journal.pone.0224577.g002
To address this problem, instead of getting multi-shot time series of single individuals, one can grow multiple individuals under each condition with a variety of conditions, and get one-shot time series of the single conditions. The one-shot samples from the same condition then become a surrogate for multi-shot samples for a single individual, as illustrated in Fig 3 . In essence, if we view the preparation condition of each sample as being random, then there should be a positive correlation among samples grown under the same condition. We call this correlation the condition variation effect . It is similar to the effect of biological variation of a single individual sampled at different times, if such sampling were possible.
https://doi.org/10.1371/journal.pone.0224577.g003
For each condition, the one-shot samples at different times are also complemented by biological replicates , which are samples from independent individuals taken at the same time used to reduce technical and/or biological variations. See the Discussion section for a review on how replicates are used for biological inference. With a budget over the number of samples, a balance must be kept between the number of conditions, the number of sampling times and the number of replicates.
To illustrate and quantify the effect of one-shot sampling in biological inference, we introduce a simple dynamic gene expression model with a condition variation effect. We consider a hypothesis testing setting and model the dynamics of the gene expression levels at different sampling times by a dynamic Bayesian network (DBN), where the randomness comes from nominal (or basal) biological and condition variations for each gene. The nominal condition-dependent variation of gene j is the same for all plants in that condition and the remaining variation is biological and is independent across the individuals in the condition. In contrast to GeneNetWeaver [ 1 ], where the effect of a condition is modeled by a random perturbation to the network coefficients, in our model the condition effect is characterized by correlation in the nominal variation terms of the dynamics. Note in both models samples from different individuals under the same condition are statistically independent given the randomness associated with the condition.
The contributions of this paper are threefold.
- A composite hypothesis testing problem on the gene regulatory network is formulated and a gene expression dynamic model that explicitly captures the per-gene condition effect and the gene regulatory interactions is proposed.
- The performance of gene regulatory network structure estimators is analyzed for both multi-shot and one-shot sampling, with focus on two algorithms. Furthermore, single-gene and multi-gene simulation results indicate that multiple-condition experiments can somewhat mitigate the shortcomings of one-shot sampling.
- The difficulty of network reconstruction under one-shot sampling with no condition effect is illustrated. This difficulty connects network analysis and differential expression analysis, two common tasks in large-scale genomics studies, in the sense that the part of network involving non-differentially expressed genes may be harder to reconstruct.
The simulation code for generating the figures is available at [ 2 ].
Materials and methods
Stochastic model of time-series samples.
This section formulates the hypothesis testing problem of learning the structure of the gene regulatory network (GRN) from gene expression data with one-shot or multi-shot sampling. The GRN is characterized by an unknown adjacency matrix. Given the GRN, a dynamic Bayesian network model is used for the gene expression evolution with time. Two parameters σ co, j and σ bi, j are used for each gene j , with the former explicitly capturing the condition variation effect and the latter capturing the biological variation level.
Model for gene regulatory network topology.
Model for gene expression dynamics.
This section models the gene expression dynamics of individuals by a dynamic Bayesian networks with parameters σ co, j and σ bi, j as the condition variation level and biological variation level for gene j .
Model for sampling method.
In this section two sampling methods are described: one-shot sampling and multi-shot sampling. For simplicity, throughout this paper we consider a full factorial design with CRT samples obtained under C conditions, R replicates and T sampling times, denoted by Y = ( Y c , r , t ) ( c , r , t )∈[ C ]×[ R ]×[ T ] . In other words, instead of X we observe Y , a noisy and possibly partial observation of X . Here the triple index for each sample indicates the condition, replicate, and time. As we will see in the Discussion at the end of this section, for either sampling method, the biological variation level σ bi, j can be reduced by combining multiple individuals to form a single sample.
Multi-shot sampling.
One-shot sampling.
Discussion on sources of variance.
- If the samples of the individuals under many different conditions are averaged and treated as a single sample, then effectively σ co, j , σ bi, j and σ te, j are reduced.
- If material from multiple individuals grown under the same condition is combined into a composite sample before measuring, then effectively σ bi, j is reduced while σ co, j and σ te, j remain unchanged. Note for microorganisms a sample may consist of millions of individuals and the biological variation is practically eliminated ( σ bi, j ≈ 0).
- If the samples from same individuals (technical replicates) are averaged and treated as a single sample, then effectively σ te, j is reduced while σ co, j and σ bi, j remain unchanged.
Note this sampling model with hierarchical driving and observational noises can also be used for single-cell RNA sequencing (scRNAseq) in addition to bulk RNA sequencing and microarray experiments. For scRNAseq, σ co, j can model the tissue-dependent variation (the global effect) and σ bi, j the per-cell variation (the local effect).
Performance evaluation of network structure estimators
This section studies the performance of network structure estimators with multi-shot and one-shot sampling data. First, general properties of the two sampling methods are obtained. Then two algorithms, the generalized likelihood-ratio test (GLRT) and the basic sparse linear regression (BSLR), are studied. The former is a standard decision rule for composite hypothesis testing problems and is shown to have some properties but is computationally infeasible for even a small number of genes. The latter is an algorithm based on linear regression, and is feasible for a moderate number of genes. Finally simulation results for a single-gene network with GLRT and for a multi-gene network with BSLR are shown.
General properties.
By ( 3 ), ( 4 ) and ( 5 ), the samples Y are jointly Gaussian with zero mean. The covariance of the random tensor Y is derived under the two sampling methods in the following.
- If Σ bi = 0 and C , R and T are fixed, the joint distribution of the data is the same for both types of sampling. So the performance of the estimator would be the same for multi-shot and one-shot sampling.
- If Σ bi = 0 and Σ te = 0 (no observation noise) and C , T are fixed, the joint distribution of the data is the same for both types of sampling (as noted in item 1) and any replicates beyond the first are identical to the first. So the performance of the estimator can be obtained even if all replicates beyond the first are discarded.
- Under multi-shot sampling, when C , R , T are fixed with R = 1, the joint distribution of the data depends on Σ co and Σ bi only through their sum. So the performance of the estimator would be the same for all Σ co and Σ bi such that Σ co + Σ bi is the same.
Network reconstruction algorithms.
In this section we introduce GLRT and BSLR. GLRT is a standard choice in composite hypothesis testing setting. We observe some properties for GLRT under one-shot and multi-shot sampling. However, GLRT involves optimizing the likelihood over the entire parameter space, which grows exponentially with the square of the number of genes. Hence GLRT is hard to compute for multiple-gene network reconstruction. In contrast, BSLR is an intuitive algorithm based on linear regression, and will be shown in simulations to perform reasonably well for multi-gene scenarios.
Proposition 1 GLRT ( with the knowledge of Σ co , Σ bi and Σ te ) has the following properties .
- Under one-shot sampling and Σ co = 0, the log likelihood of the data as a function of A ( i . e . the log likelihood function ) is invariant with respect to replacing A by − A . So , for the single-gene n = 1 case , the log likelihood function is an even function of A , and thus the GLRT will do no better than random guessing .
For 2 it suffices to notice in ( 6 ) the covariance is invariant with respect to changing A to − A . A proof of 1 is given below.
Proof of 1) We first prove it for the case of a single gene with constant T and a constant number of individuals, CR . To do that we need to look at the likelihood function closely.
In BSLR, replicates are averaged and the average gene expression levels at different times under different conditions are fitted in a linear regression model with best-subset sparse model selection, followed by a Granger causality test to eliminate the false discoveries. BSLR is similar to other two-stage linear regression–based network reconstruction algorithms, notably oCSE [ 4 ] and CaSPIAN [ 5 ]. Both oCSE and CaSPIAN use greedy algorithms in the first build-up stage, making them more suitable for large-scale problems. In contrast, BSLR uses best subset selection, which is conceptually simpler but computationally expensive for large n . For the tear-down stage both BSLR and CaSPIAN use the Granger causality test, while oCSE uses a permutation test.
Build-up stage.
A naive algorithm to solve the above optimization has a computational complexity of O ( n k +1 ) for fixed k as n → ∞. Faster near-optimal alternatives exist [ 6 ].
Tear-down stage.
Simulations on single-gene network reconstruction.
https://doi.org/10.1371/journal.pone.0224577.g004
The numerical simulations reflect the following observations implied by the analytical model.
- Under one-shot sampling, when γ = 0, the GLRT is equivalent to random guessing.
- The GLRT performs the same under one-shot and multi-shot sampling when γ = 1.
- Under no observation noise, the performance for multi-shot sampling is the same for all γ < 1.
Some empirical observations are in order.
- Multi-shot sampling outperforms one-shot sampling (unless γ = 1, where they have the same error probability).
- For one-shot sampling, the performance improves as γ increases. Regarding the number of replicates R per condition, if γ = 0.2 (small condition effect), a medium number of replicates (2 to 5) outperforms the single replicate strategy. For larger γ , one replicate per condition is the best.
- For multi-shot sampling, performance worsens as γ increases. One replicate per condition ( R = 1) is best.
- Comparing Fig 4a–4d vs. Fig 4e–4h , we observe that the performance degrades with the addition of observation noise, though for moderate noise ( σ te = 1.0) the effect of observation noise on the sign error is not large. Also, the effect of the algorithm not knowing γ is not large.
Simulations on multi-gene network reconstruction.
This subsection studies the case when multiple genes interact through the GRN. The goal is to compare one-shot vs. multi-shot sampling for BSLR under a variety of scenarios, including different homogeneous γ values, varying number of replicates, varying observation noise level, and heterogeneous γ values.
The performance evaluation for multi-gene network reconstruction is trickier than the single-gene case because of the many degrees of freedom introduced by the number of genes. First, the network adjacency matrix A is now an n -by- n matrix. While some notion of “size” of A (like the spectral radius or the matrix norm) may be important, potentially every entry of A may affect the reconstruction result. So instead of fixing a ground truth A as in Fig 4 , we fix a prior distribution of A with split Gaussian prior described in S2 Appendix (note we assume the knowledge of no autoregulation), and choose A i.i.d. from the prior distribution with d max = 3. Second, because the prior of A can be subject to sparsity constraints and thus far from a uniform distribution, multiple loss functions that are more meaningful than the ternary error rate can be considered for performance. So we consider ternary FDR, ternary FNR and ternary FPR for the multi-gene case. In the simulations we have 20 genes and d max = 3 with in-degree uniformly distributed over {0, 1, …, d max }, so the average in-degree is 1.5. The number of sampling times is T = 6 and CR = 30.
Varying γ , R and σ te .
In this set of simulations we fix the observation noise level and vary the number of replicates R and the condition correlation coefficient γ . The performance of BSLR under one-shot and multi-shot sampling is shown in Fig 5 ( σ te = 0) and Fig 6 ( σ te = 1). Note BSLR does not apply to a single condition with 30 replicates due to the constraint that the degrees of freedom C ( T − 1) − k − 2 in the second stage must be at least 1.
https://doi.org/10.1371/journal.pone.0224577.g005
https://doi.org/10.1371/journal.pone.0224577.g006
For multi-shot sampling, in the noiseless case, we see all three losses are invariant with respect to different γ for fixed R , which is consistent with property 4 in the section on general properties because BSLR is an average-based scale-invariant algorithm (note CR is a constant so for different R the performance is different due to the change in C ). In the noisy case, the FDR and FNR slightly decrease as γ increases, which is an opposite trend compared with Fig 4f and 4h .
In summary, the main conclusions from Figs 5 and 6 are the following.
- The performance of BSLR under multi-shot sampling is consistently better than that under one-shot sampling.
- The performance of BSLR under one-shot sampling varies with γ , from random guessing performance at γ = 0 to the same performance as multi-shot sampling at γ = 1.
- By comparing Figs 5 with 6 , we see the observation noise of σ te = 1 has only a small effect on the performance with the two sampling methods.
Reduced number of directly differentially expressed genes.
https://doi.org/10.1371/journal.pone.0224577.g007
We summarize the simulations performed in Table 1 . Note the last row is a summary of Table 2 in the Discussion section.
OS and MS stand for the losses of one-shot sampling and multi-shot sampling. RG stands for random guessing. * indicates mathematically proved results.
https://doi.org/10.1371/journal.pone.0224577.t001
The errors are estimated using Hoeffding’s inequality over 1000 simulations with significance level 0.05.
https://doi.org/10.1371/journal.pone.0224577.t002
Information limitation for reconstruction under one shot sampling without condition effect
In the previous section it is shown that both GLRT and BSLR are close to random guessing under one-shot sampling when σ co, j = 0 for all j . This leads us to the following question: is the network reconstruction with no condition effect ( σ co, j = 0 for all j ) information theoretically possible? In this section we examine this question under general estimator-independent settings. Note in this case the trajectories of all individuals are independent given A regardless of ( c k ) k ∈[ K ] .
As we have seen in Proposition 1 part 2, when Σ co = 0, the distribution of the observed data Y is invariant under adjacency matrix A or − A , implying any estimator will have a sign error probability no better than random guessing for the average or worst case over A and − A . Here, instead of sign error probability, we consider the estimation for A itself.
The extreme case with infinite number of samples available for network reconstruction is considered to give a lower bound on the accuracy for the finite data case. Note that with infinite number of samples a sufficient statistic for the estimation of the parameter A is the marginal distributions of X 1 ( t ); no information on the correlation of ( X 1 ( t )) t ∈[ T ] across time t can be obtained. A similar observation is made in [ 7 ] for sampling stochastic differential equations.
In summary, the recovery of the matrix A is generally not possible in the stationary case, and also not possible in the transient case at least when A is orthogonal. To reconstruct A , further constraints (like sparsity) may be required.
One-shot sampling in the literature
This section reviews the sampling procedures reported in several papers measuring gene expression levels in biological organisms with samples collected at different times to form time series data. In all cases, the sampling is one-shot, in the sense that a single plant or cell is only sampled at one time.
Microorganisms.
In the transcriptional network inference challenge from DREAM5 [ 8 ], three compendia of biological data sets were provided based on microorganisms ( E. coli , S. aureus , and S. cerevisiae ), and some of the data corresponded to different sampling times in a time series. Being based on microorganisms, the expression level measurements involved multiple individuals per sample, a form of one-shot sampling.
In [ 9 ], the plants are exposed to nitrate, which serves as a synchronizing event, and samples are taken from three to twenty minutes after the synchronizing event. The statement “… each replicate is independent of all microarrays preceding and following in time” suggests the experiments are based on one-shot sampling. In contrast, the state-space model with correlation between transcription factors in an earlier time and the regulated genes in a later time fits multi-shot sampling. [ 10 ] studied the gene expression difference between leaves at different developmental stages in rice. The 12th, 11th and 10th leaf blades were collected every 3 days for 15 days starting from the day of the emergence of the 12th leaves. While a single plant could provide multiple samples, namely three different leaves at a given sampling time, no plant was sampled at two different times. Thus, from the standpoint of producing time series data, the sampling in this paper was one-shot sampling. [ 11 ] devised the phenol-sodium dodecyl sulfate (SDS) method for isolating total RNA from Arabidopsis . It reports the relative level of mRNA of several genes for five time points ranging up to six hours after exposure to a synchronizing event, namely being sprayed by a hormone trans -zeatin. The samples were taken from the leaves of plants. It is not clear from the paper whether the samples were collected from different leaves of the same plant, or from leaves of different plants.
[ 12 ] likely used one-shot sampling for their −24, 60, 120, 168 hour time series data from mouse skeletal muscle C2C12 cells without specifying whether the samples are all taken from different individuals. [ 13 ] produced time series data by extracting cells from a human, seeding the cells on plates, and producing samples in triplicate, at a series of six times, for each of five conditions. Multiple cells are used for each sample with different sets of cells being used for different samples, so this is an example of one-shot sampling of in vitro experiment in the sense that each plate of cells is one individual. The use of (five) multiple conditions can serve as a surrogate for a single individual set of cells to gain the effect of multi-shot sampling. Similarly, the data sets produced by [ 14 ] involving the plating of HeLa S3 cells can be classified as one-shot samples because different samples are made from different sets of individual cells. Interestingly, the samples are prepared under one set of conditions, so the use of different conditions is not adopted as a surrogate for multi-shot sampling. However, a particular line of cells was selected (HeLa S3) for which cells can be highly synchronized. Also, the paper does not attempt to determine causal interactions.
The three in silico benchmark suites described in the GeneNetWeaver paper on performance profiling of network inference methods [ 1 ] consisted of steady state, and therefore one-shot, samples from dynamical models. However, the GeneNetWeaver software can be used to generate multi-shot time series data, and some of that was included in the network inference challenges, DREAM3, DREAM4, and DREAM5 [ 1 , 8 ].
On biological replicates
In many biological experiments, independent biological replicates are used to reduce the variation in the measurements and to consequently increase the power of the statistical tests. It turns out that both how to use biological replicates, and the power of biological replicates, depend on whether the sampling is one-shot or multi-shot. To focus on this issue we first summarize how replicates have traditionally been used for the more common problem of gene differential expression analysis, before turning to the use of replicates for recovery of gene regulatory networks.
The following summarizes the use of replicates for gene differential expression analysis. A recent survey [ 15 ] suggests a minimum of three replicates for RNA-seq experiments whenever sample availability allows. Briggs et al. [ 16 ] studies the effect of biological replication together with dye switching in microarray experiments and recommends biological replication when precision in the measurements is desired. Liu et al. [ 17 ] studies the tradeoff between biological replication and sequencing depth under a sequencing budget limit in RNA-seq differential expression (DE) analysis. It proposes a metric for cost effectiveness that suggests a sequencing depth of 10 million reads per library of human breast cells and 2–6 biological replicates for optimal RNA-seq DE design. Schurch et al. [ 18 ] studies the number of necessary biological replicates in RNA-seq differential expression experiments on S. cerevisiae quantitatively with various statistical tools and concludes with the usage of a minimum of six biological replicates.
For regulatory network reconstruction there is even less consensus on how replicates should be used. One straightforward way is to reduce the replicates into a single set of data by averaging either directly or after a random resampling of the original replicated data. In this case the mean of the replicates are used as a better estimate of the population than each single replicate, while higher moments of the empirical distribution of the replicates are practically ignored. Another way adopted in [ 9 ] is to account for all four potential transitions between two replicates in two adjacent sampling times in their machine learning algorithm due to the one-shot nature of the replicates. In the next section, we illustrate why replicates should be used differently for one-shot and multi-shot sampling, in the context of recovering a circadian clock network model.
A case study on Arabidopsis circadian clock network
As we have discussed earlier, the current expression datasets are prominently one-shot, making a direct comparison between one-shot and multi-shot sampling in real biological data difficult. The lack of a well-accepted ground truth of the gene regulatory network also makes performance evaluation hard, if not impossible. To test the applicability of the sampling models on real biological data, we generate expression data from a most-accepted Arabidopsis circadian clock model using stochastic differential equation (SDE) model similar to GeneNetWeaver with condition-dependent Brownian motions, and evaluate the performance of BSLR.
https://doi.org/10.1371/journal.pone.0224577.g008
In summary, we demonstrated a setting of the biologically plausible Arabidopsis circadian clock network with a single condition, where the BSLR performs similarly to a random guessing algorithm under one-shot sampling, and performs significantly better under multi-shot sampling. We also show that whether replicate averaging should be done or not varies with the sampling method.
Conclusions
One-shot sampling can miss a lot of potentially useful correlation information. Often gene expression data collected from plants is prepared under one-shot sampling. One factor that can partially mitigate the shortcomings of one-shot sampling is to prepare samples under a variety of conditions or perturbations. One-shot samples grown under the same condition can then be thought of as a surrogate for the multi-shot samples of an individual plant.
To clarify issues and take a step towards quantifying them, we proposed a gene expression dynamic model for gene regulatory network reconstruction that explicitly captures the condition variation effect. We show analytically and numerically the performance of two algorithms for single-gene and multi-gene settings. We also demonstrate the difficulty of network reconstruction without condition variation effect.
There is little agreement across the biology literature about how to model the impact of condition on the gene regulatory network. In some cases, it is not even clear that we are observing the same network structure as conditions vary. Nevertheless, our results suggest that the preparation of samples under different conditions can partially compensate for the shortcomings of one-shot sampling.
Supporting information
S1 appendix. joint estimation for single-gene autoregulation recovery..
The parameters A , γ , σ , and σ te are assumed unknown and jointly estimated in GLRT.
https://doi.org/10.1371/journal.pone.0224577.s001
S2 Appendix. Split Gaussian network prior.
The random network prior distribution used to generate the multi-gene network.
https://doi.org/10.1371/journal.pone.0224577.s002
- View Article
- PubMed/NCBI
- Google Scholar
- 2. Kang X. One-shot sampling simulations; 2019. Available from: https://github.com/Veggente/one-shot-sampling .
- 3. Poor HV. An Introduction to Signal Detection and Estimation. Springer-Verlag New York; 1994.
IMAGES
VIDEO
COMMENTS
Pre-experiments are the simplest form of research design. In a pre-experiment either a single group or multiple groups are observed subsequent to some agent or treatment presumed to cause change. Types of Pre-Experimental Design. One-shot case study design; One-group pretest-posttest design; Static-group comparison; One-shot case study design
We will use “one-shot” test, and “experiment” for a set of tests at specified values of the stimulus. “One-shot” and “single event” indicate that a unit can only be tested once, regardless of the outcome, because after the stimulus the unit is no longer factory-fresh.
Pre-experiments are the simplest form of research design. In a pre-experiment either a single group or multiple groups are observed subsequent to some agent or treatment presumed to cause change. Types of Pre-Experimental Design. One-shot case study design; One-group pretest-posttest design; Static-group comparison; One-shot case study design
The one-group posttest-only design (a.k.a. one-shot case study) is a type of quasi-experiment in which the outcome of interest is measured only once after exposing a non-random group of participants to a certain intervention.
1.1 One-Shot Experimental Case Study The one-shot experiment case study is probably the most primitive type of experiment that might conceivably be termed “research”. An Experimental Treatment (Tx) is introduced, and then a measurement (Obs)- a posttest of some sort - is administered to determine the effects of the treatment.
To illustrate and quantify the effect of one-shot sampling in biological inference, we introduce a simple dynamic gene expression model with a condition variation effect.
Pre-experimental designs are so named because they follow basic experimental steps but fail to include a control group. In other words, a single group is often studied but no comparison between an equivalent non-treatment group is made. Examples include the following: The One-Shot Case Study.
What is “One-shot case study?” Characteristics of one-shot case study No control group, only experimental group No pre-test Compare the result with some intuitive standard An example: One wants to determine whether reading to children an extra ½ ho ur a day would increase their reading skill. A group of children are chosen. The teacher ...
We apply our approach to one-shot learning problems based on a real-world database of immigration records, and show that it outperforms a more standard Bayesian network approach.
To illustrate and quantify the effect of one-shot sampling in biological inference, we introduce a simple dynamic gene expression model with a condition variation effect.