1 Introduction

Social network analysis seeks to characterize the interrelationships between individuals (or other social entities) and can serve both to identify the network structure linking individuals as well as to examine how particular behaviors or health outcomes may be impacted by social ties. Social network analyses, long a staple in the fields of anthropology and sociology, are becoming more popular among health service researchers, given growing appreciation of the role that social context plays in shaping health-related behaviors and outcomes (O’Malley and Marsden 2008). Key mechanisms through which social relationships are hypothesized to impact health include: providing direct support and resources (including tangible resources, emotional support, information sharing); and shaping behavioral norms (Berkman and Glass 2000). In the health services context, one area of research interest is examining how social network connections between healthcare providers impact provider behavior and health outcomes. Specifically, it is hypothesized that providers who work together and share patients may influence each other through the transmission of information or development of mutually recognized norms of practice (Brunson and Laubenbacher 2018; Barnett et al. 2011; Breslau et al. 2021; Stein et al. 2017; Pollack et al. 2012a; Manchanda et al. 2008). For instance, if one provider adopts non-opioid treatment alternatives for pain disorders, closely linked colleagues may similarly alter their clinical practice. Other recent applications of social network analysis in the health context include: assessing the impact of provider team structure on health outcome and cost among Medicare patients (Kuo et al. 2020), examining the relationship between social network characteristics and living donor kidney transplantation (Gillespie et al. 2020), examining the impact of peer and family social network structure on adolescent drinking (McCann et al. 2019), and characterizing social transmission of positive and negative sentiment towards COVID-19 responses (Hung et al. 2020).

Unlike in many contexts where the analytic sample comprises individuals who are assumed to be an independent sample of the population, individuals constituting a social network do not reflect an independent sample. Indeed, the statistical independence of observed data points is a foundational assumption of most statistical techniques, including regression, comparisons of means, difference-in-differences approaches, instrumental variables, and many more (Naroll 1961a; Grafen and Hamilton 1989; Felsenstein 1985; Rohlf 2006). However, network data have long been known to violate the independence assumption because each data point (i.e., each “node,” often an individual) has a relational connection to other nodes within the social network structure. If a given observed characteristic (i.e., trait) of one node is potentially impacted by the corresponding characteristic of connected nodes in the network—e.g., through the process of social influence—then nodes cannot be assumed to be statistically independent (Leenders 2002). Rather, one individual’s trait value will directly impact the trait values of connected individuals via the social influence process. For example, while physician opioid prescribing behaviors and statin prescribing behaviors may be assumed to be independent, they both may be affected by information diffusion and social influence processes within physician social networks (O’Malley and Marsden 2008; Leenders 2002).

We note that a different problem of statistical nonindependence occurs when the traits of nodes induce connections to form between them, creating a higher likelihood of social ties between similar individuals (referred to as homophily). Although homophily and influence are not fully distinguishable in empirical data, they are conceptually distinct (Shalizi and Thomas 2011). Furthermore, nonindependence of nodes induced by processes acting across network connections (including social influence and homophily) is not the same as confounding that arises from unmeasured traits of nodes. The relationships that induce nonindependence are not features of the nodes themselves, but rather reflect relational connections linking nodes. In the case of social influence, the outcome of interest then influences itself based on the relationship structure amongst the nodes. Unmeasured confounding, in contrast, occurs when an observed association of variable Y with variable X in fact is due to by both Y and X being caused by a third unmeasured variable. Unmeasured confounding is well-known to cause biased estimates for slope coefficients in linear regression leading to statistical inconsistency, meaning that as more data are added to a regression, the model becomes more certain about a spurious association between measured variables when this association is, in fact, driven by the unmeasured variable(s) (Rohlf 2006; Lee 2021; Dow 1984; Naorli 1961).

In the presence of network nonindependence, increased sample size does not yield biased slope estimates that are more consistent around an underlying value (unlike for confounding), leading some to the erroneous conclusion that nonindependence is therefore relatively unimportant. Rather, adding more complete data in the context of nonindependence results in a different form of statistical inconsistency in which the model is more likely to find a significant result with a randomly positive or negative slope (Rohlf 2006). Somewhat paradoxically, reducing the sample size relative to the total population through sampling may in fact improve statistical performance for networked (nonindependent) data, because a randomly sampled subset may comprise nodes that are (generally) not directly linked together, and hence are statistically more independent (Fig. 1).

Fig. 1
figure 1

Illustration of how random sampling can select nodes that are relatively disconnected in a network and hence more statistically independent (relative to the complete network). Note: Dark gray nodes indicate those selected via random sampling

To our knowledge, the performance of random subsampling of network data has not been explored extensively in the social network analysis literature or in health policy research. It has been developed in the context of more hierarchical cross-cultural networks. Murdock and White’s seminal work creating the Standard Cross-Cultural Sample expressly used this approach to nonindependence to generate a dataset of relatively independent cultures that cross-cultural researchers could use for statistical analyses (Murdock and White 1969). The random sampling approach likely is ineffective specifically for highly hierarchical (tree-like) networks (Mace et al. 1994) but, hypothesizing that it may be effective for some social network contexts, we included it among others as a candidate method in our study.

Although the statistical problem of nonindependence in network data has been known for years, it never has been fully resolved. While there is an extensive methodological literature on the exponential random graph model (ERGM) that deals with the formation and dissolution of network connections in the context of homophily, there has been comparatively less methodological development to resolve pure cases of social influence. In the applied literature, studies employ a variety of methods—including the network autoregressive model, phylogenetic autoregression, robust standard errors, dyadic regression, and principal component analysis—that purport to solve the nonindependence problem (detailed below). Although the network autoregressive model is a primary approach used to address social influence, it is well-known to exhibit systematic biases regarding estimation of key parameters and has never been shown via simulation to produce acceptable statistical properties for Type I error (i.e., false findings of statistical significance) (Mizruchi and Neuman 2008; Neuman and Mizruchi 2010; Dittrich et al. 2017). To date, it is unclear which of these methods perform best, as their relative performance has not been directly evaluated in simulation studies.

To address this gap in the literature, we evaluate the relative performance of numerous analytic methods that have been used in the applied literature to correct for nonindependence arising from network social influence. We conducted a simulation study using simulated network data reflecting social influence processes occurring independently across multiple traits. While traits (and their diffusion) were simulated to be independent, the simulated social influence processes acting on traits means that nodes are not independent. Our simulation study compares various analytic methods for estimating the relationship between traits; methods that correct adequately for node nonindependence (either via a “robust” estimator or by transforming the network structure of the data) should infer no significant correlation between traits (estimating significant associations no more frequently than nominal significance level). To isolate the issue of nonindependence induced by network influence, our simulation assumed that the correct network structure was known with complete accuracy and that there was no confounding or homophily. If available statistical methods cannot correct for nonindependence under these conditions, they are very unlikely to perform well under real-world conditions in which these additional complexities like measurement inaccuracies, confounding, and homophily are common.

1.1 Statistical corrections for network nonindependence

We will focus on the context of estimating the association between outcome trait Y and predictor trait X. In simplest form, this could be assessed with a regression model of the form \(y=X\beta +e\), in which the estimated coefficient \(\widehat{\beta }\) quantifies the association of interest. Interrelated network data violates the assumption of independence of observations (and in turn the \(e\) terms) underlying many regression methods such as ordinary least squares. Many statistical methods to account for nonindependence of network data have been proposed. Some methods attempt to correct the variance of regression coefficients (i.e., slopes or betas) for nonindependence, but they do so in different ways. Other methods correct the standard error of the slopes so as to obtain correct p-values for the slope (that may or may not be estimated correctly in terms of magnitude). Herein we briefly characterize each method that we tested, all of which have received full treatments in articles we cite. Table 1 shows a compact overview and comparison of the methods and is followed by brief descriptions of each.

Table 1 Comparison of Methods to Correct for Nonindependence in Network Data

Network autoregression models were developed specifically to deal with nonindependence in network data. They keep the relationship data in their natural dyadic form (i.e., connections from node A to B, A to C, etc.) and specify an autoregressive structure for the error term \(e.\) This approach divides the residual variation into a portion that is correlated across the network ties and then an uncorrelated residual as would be typical of any regression model (O’Malley 2008; Leenders 2002; Grafen 1989; Pagel 1999). We implemented this method using the lnam function in the “sna” R package (Butts 2020). Specifically, lnam fits the following linear network autocorrelation model: \(y=X\beta +e,\) with \(e=\sigma We+\upsilon\). In this model, \(y\) denotes a vector of the outcome trait, \(X\) is a covariate trait matrix, \(W\) is the adjacency matrix (with multiplicative factor \(\sigma )\), and the uncorrelated residual \(\upsilon \sim N(0,{\sigma }^{2})\). \(W\) parameterizes the autocorrelation of each disturbance in \(y\) on its neighbors.

Phylogenetic autoregression applies the same general statistical approach as does network regression, but it first simplifies the network structure into a best-fitting bifurcating tree structure (Mathew and Perreault 2015a). The advantage of this approach is it greatly reduces the mathematical complexity and circularity of the original network. Prior simulation work has shown this method to be generally valid when the network is very close to a bifurcating tree structure (Matthews 2019), but this method presents a clear disadvantage if the network is not treelike and hence the phylogeny fails to represent aspects of the full network structure. We implemented this approach using the hclust function in R to perform hierarchical cluster analysis to identify the closest fitting tree for the network structure. We then used the corPagel function in the “ape” R package (Paradis and Schliep 2019) to estimate Pagel's lambda correlation structure (\(\lambda )\) from a phylogenetic representation of the estimated tree. Again, the underlying model is \(y=X\beta +e,\) with \(e=\lambda We+\upsilon\) but now \(\lambda\) is used as the scaling factor \(\sigma\) and \(W\) reflects the closest fitting tree. Finally, we regressed \(y\) on \(X\) using the the gls function in the R package “nlme” (Pinheiro et al. 2022), specifying the correlation structure as our estimated \(\lambda\).

Conley standard errors is a method derived from spatial statistics that has been applied to deal with social influence on social networks (Schulz et al. 2019). Specifically, this method uses a Generalized Method of Moments approach and computes a variance–covariance matrix with spatial weights. While typically applied to geographic spatial data, this method can be applied in the context of social network data using spatial weights that represent the network distance between nodes as physical distances (e.g., pretending as if the social network space existed in a physical geographic space). In order to implement this method, one must transform the social network data into a two-dimensional array (e.g., representing latitude and longitude of nodes); we used principal component analysis to estimate 2 components from each network using the isoMDS function in the “MASS” R package (Venables and Ripley 2002). We then used the “conleyreg” R package (Düben et al. 2022) to estimating a regression model of the form \(y=X\beta +e\), specifying the 2 estimated components as the “latitude” and “longitude” used to calculate the spatial weights.

Random effects for network communities model also adjusts the residual variation with a simplification, but it does this by first breaking up the network into a set of network communities. Membership in a network community then is assigned as a nominal variable to each individual, and this variable is entered into the model as a random effect; in other words, this approach fits a hierarchical linear model with network community as the grouping variable (Landon et al. 2018). In the literature this model also is known as a “clustered error” model or “mixed hierarchical” model. We inferred network communities with the cluster walktrap function in the “igraph” R package (Csardi and Nepusz 2005) and then used the lmer function in the “lme4” R package (Bates et al. 2015) to estimate a regression model of the form \(y=X\beta +u+e,\) where \(u\) denotes the network community random effect. Once again, while this simplifies the mathematical work needed to fit the regression, it may fail to account for important structural features of any given social network.

Principal components from networks first passes the network structure through a dimension reduction procedure to extract out several continuously varying variables. These are intended to recapitulate individuals’ positions in the network through a set of variables, which then are entered into the regression model as fixed effects (Mathew and Perreault 2015b). We used principal component analysis to estimate 5 components from each network using the isoMDS function in the “MASS” R package (Venables and Ripley 2002) and then estimated a linear regression model of the following form: \(y=X\beta +{\beta }_{1}{C}_{1}+{{\beta }_{2}C}_{2}+{{\beta }_{3}C}_{3}+{{\beta }_{4}C}_{4}+{{\beta }_{5}C}_{5}+ e,\) in which \({C}_{1}\), …, \({C}_{5}\) denote the 5 components.

Robust standard errors are an approach to estimate unbiased standard errors of OLS coefficients under certain violations the standard OLS assumption of independent and identically distributed (i.i.d.) error terms. In particular, robust standard errors can address heteroscedasticity (i.e., heterogenous variance of the error terms). Notably, robust standard errors do not require specifying the functional form of the underlying covariance matrix, unlike a weighted least squares approach. Although robust standard errors (and weighted least squares) were only ever designed to correct for the identically distributed assumption of i.i.d. (and not the independence assumption), we included it as a candidate method because robust standard errors have seen wide application in the literature far beyond their original intent (King and Roberts 2015). We implemented robust standard error estimation in R with the “sandwich” (Zeileis et al. 2020) and “lmtest” packages (Zeileis and Hothorn 2002), using the coeftest command (and specifying the “HC1” option) when estimating a regression model of the form \(y=X\beta +e\).

Dyadic regression with network covariates is a technique we developed as part of this research that is based on prior network studies (Nooy 2011; O'Malley and Christakis 2011; Matthews et al. 2013). In this approach we first convert the traits into pairwise distances. We then regress these distance values, one for each pair of data points \({y}_{dist}\) and \({X}_{dist}\), against one another, while including a binary indicator term for the presence of a network tie between a pair (denoted \(Z)\). We also include two random effect terms corresponding to the two nodes comprising a given dyad. We fitted this model of the form \({y}_{dist}=\left({X}_{dist}\right)\beta +{\beta }_{1}Z+{u}_{1}+{u}_{2}+e\) with the lmer function in the lme4 package, where \({u}_{1}\) and \({u}_{2}\) denote node random effects.

Random subsampling of network nodes relies on the property that as fewer nodes are sampled from a complete network it becomes less likely the sampled nodes are connected. If all the sampled nodes are sufficiently distant from each other, i.e., separated by at least one intermediary node, then they can be treated as independent data points in statistical models. Even in large samples of the U.S. population like the General Social Survey, node independence can be assumed because the sample is far less than even 1% of the total population. Physician networks, however, have often been analyzed for nearly complete sets of physicians in entire markets (Brunson et al. 2018). Random subsampling may be viable in very large markets, but in smaller markets it may be intractable due to the obvious loss of statistical power this method entails. We implemented random subsampling by first selecting either a 10%, 30%, or 50% random sample of nodes and then estimating a standard ordinary least squares regression model of the form \(y=X\beta +e\).

It is worth noting at this point that Matthews (2019) tested many of these same statistical corrections via simulation for highly hierarchical network structures derived from language trees (Matthews et al. 2019). That study found that random effects for network communities, principal components from networks, and phylogenetic autoregression all performed at acceptable levels for false positive rates when traits were simulated to diffuse via social influence across network ties that were tree-like. However, when the traits were inherited longitudinally down the branches of the tree, and without horizontal diffusion, only the autoregression method performed acceptably. In both forms of trait simulation, the network autoregression method had false positive rates at least 3 times greater than what they should have been. Thus, we acknowledge that network structural characteristics can matter a great deal to the performance of statistical models to deal with network data. In this current paper we are seeking to examine the types of social networks most commonly encountered in health policy research. For hierarchical (tree-like) networks, readers are recommended to consult Matthews (2019). Further research is needed to identify the effects of other types of network structures on model performance.

2 Methods

Our simulation analyses considered 2 data structure conditions: (1) fully simulated data with either 50 or 100 nodes and (2) partially simulated data, in which the network structure was empirically defined based on physician network with 914 nodes and node traits were simulated to diffuse via social influence. For both conditions, we simulated 500 datasets comprising 1,000 traits that each diffused independently on the network structures. That is, traits were neither correlated nor functionally related in our data. Generation of the social influence diffusion process for traits across networks was conducted as follows. For each trait, each node was initially assigned a trait value based on a random draw from a Gaussian distribution. For each node and each trait, the mean trait value among the node’s direct network connections was calculated and the node’s initial trait value was updated to by adding half of the distance between the initial value and the average value of their network. Traits were generated to be independent in order to assess the performance of methods with regard to Type I errors—in this setting, optimally-performing statistical methods should identify statistically significant correlations among the traits at or below the specified significance threshold, which we set at 0.05.

For the fully synthetic dataset, a Barabasi game was used to generate the simulated network structure using linear preferential attachment (the default for the barabasi.game function in the “igraph” R package). We used the Barabasi game as a starting point for our analysis because it is a well-known model in network science that has been used before in many simulation studies. The Barabasi game simulates the pattern common to real networks that nodes with many connections tend to acquire even more (left-skew degree distribution). We recognize, however, that it fails to approximate some features of real-world networks. For this reason, we also assessed performance on an empirical network comprising the largest network community among physicians who treat serious mental illness in the state of Colorado. We extracted this community with the walktrap algorithm on a network in which ties between physicians were inferred when they shared more than two patients in common using the “igraph” R package, Csardi and Nepusz (Csardi and Nepusz 2005). We interpret results that are common to both the simulated Barabasi networks and the empirical network to indicate robustness especially due to their many structural differences.

Having generated simulated diffused traits on networks (both simulated and empirical), we then assessed the relative performance of different analytic approaches. Beginning with naïve linear regression, we regressed one diffused trait as the outcome (\(y)\) of another randomly diffused trait (\(X)\) using a model of the form \(y=X\beta +e\) without any correction for nonindependence. The estimated slope coefficient for the predictor trait is the estimate of interest, and should be 0, as traits were generated to be independent. We also implemented 10 different statistical correction approaches described previously: robust standard errors; Conley standard errors; principal components; random effects; network autoregression; phylogenetic autoregression; dyadic regression; 10% random subsample; 30% random subsample; and 50% random subsample. For each method, we assessed 3 metrics of performance across the 500 simulation replications: bias of the estimated slope (average difference between the estimated \(\widehat{\beta }\) slope and true value of 0), estimated standard error, and the Type I error (frequency of incorrectly rejecting the null hypothesis, i.e., there truly is no effect). For techniques that demonstrated acceptable false positive rates, we then conducted simulations to estimate statistical power. For these simulations, traits were generated to be correlated, with correlation values between the traits ranging from 0.2 to 0.6 at intervals of 0.1 (generated using the function corgen in R package “ecodist” (Goslee and Urban 2007)). All analyses were conducted in using R version 4.0.4.

3 Results

The statistical methods we tested all involved regressing an outcome (i.e., a given trait) against a hypothesized predictor (i.e., a different trait). In our simulations, in order to assess false positive rates, data were generated such that node traits were fully independent (had no relationship to each other). However, in all simulation conditions, traits were simulated to diffuse across the social network such that trait values of a node’s connections affected its own trait values, thereby reflecting a social influence process. This influence process results in non-independence of traits across nodes. Social influence is distinct from unmeasured confounding, in that nonindependence does not bias slope estimates while unmeasured confounding does. Indeed, histograms of slope estimates (Fig. 2) demonstrate that slope estimates were generally centered around zero, indicating minimal bias as expected.

Fig. 2
figure 2

Histogram of slopes from linear regression of simulated traits on 50 and 100 node simulated networks, and the physician network

Even though the slopes were centered on zero, the results from the linear regression models indicated that our simulated trait diffusion process reliably created datapoint nonindependence and elevated the false positive rate (for randomly positive or negative slopes) (Table 2). Surprisingly, the two techniques that have received the most attention in the network science literature (random effects for network communities and network autoregression) did not adequately correct the elevated false positive rate. Methods adapted from the evolutionary science community (principal components and phylogenetic autoregression) did not correct the nonindependence problem either (Table 2). Methods derived from econometrics and spatial statistics, including robust standard errors and Conley standard errors, also failed to produce nominal type 1 error rates. Treating the simulated traits as distances in a dyadic regression, an adaptation of prior studies on relational data [De Nooy 2011; O'Malley 2011), also failed to lower false positive rate to the expected 0.05 (Table 2). Among the methods we tested, only randomly sampling to 10% of the original data successfully produced correct false positive rates.

Table 2 Summary of statistical performance across candidate statistical methods

It is notable, however, that among the methods that sought to use all the data by transforming residuals (robust standard errors, Conley standard errors, random effects for network communities, network autoregression, phylogenetic autoregression), the network autoregression method performed the best. This method was developed by network scientists for the purpose of correcting network nonindependence. Although it does not fully succeed in this goal, it shows the most promise at success, likely because it does not simplify the network structure, hence losing important statistical information, as do all the other methods that seek to correct nonindependence by transforming residual variation.

We estimated the statistical power for random sampling of 10% of the network, as this was the only method that exhibited acceptable proportions for false positive rate in the simulated networks and close to acceptable false positive rate proportions in the empirical physician network. The findings indicate that power was low when the networks had only 50 or 100 nodes. In the physician network of 914 nodes, statistical power to detect a significant result when traits were correlated was 85% or higher when the simulated correlations were 0.3 or greater (Table 3). In real world applications, power would likely be reduced from factors including inaccurate measurement error and inclusion of spurious control variables that are unrelated to the outcome. Variables are spurious when they are not actually related to the outcome variable but are included in a multiple regression as a putative control variable; this is a form of model misspecification that biases slope estimates toward zero and thereby reduces statistical power. While noting these limitations, our findings clearly demonstrate that, at least in principle, random sampling can achieve high statistical power while simultaneously yielding an appropriate Type I error rate (Table 3).

Table 3 Power and slope estimates for random 10% sampling of network

4 Discussion

Our simulations indicate that numerous analytic methods commonly used in applied network analyses do not adequately control for the statistical nonindependence that is introduced by processes in which nodes in a network (e.g., healthcare providers) influence one another. This calls into question research that has used standard regression-based methods to examine the relationships among variables determined in part by networks (Landon 2018; Barnett 2012; Pollack 2012). Such variables can include those measured from the network itself, such as number of connections or centrality, and also traits that are known to spread across network ties, such as physician prescribing practices.

What matters for statistical nonindependence to alter the false positive rate is not whether network measurements were part of a research design, but rather whether social influence in the network truly affects the traits of interest. If network influence is present, the data points are relationally connected within a network. In this context, conducting regression analyses of traits without any correction for network dynamics corresponds to the simple linear regression model in our simulations. Our results indicated that this approach yielded false positive rates that were three to four times larger than expected. Additionally, we observed that the common application of robust standard errors does not appropriately correct for nonindependence. This is consistent with recent literature that argues robust SEs are being used to correct for statistical violations well outside the context of their original development for heteroskedastic residuals, but that they are unlikely to actually fix the wide range of statistical problems to which empirical researchers are applying them (King and Roberts 2015).

A limitation of our current study is that we did not study how the statistical models perform under conditions of homophily or conditions of mixed homophily and influence. Although homophily can create identical patterns as does influence of trait correlations across network connections, it is unclear if the statistical models we tested would perform similarly under pure homophily conditions because homophily processes do not actually alter trait values of individuals based on network connections, whereas social influence definitely does this. Thus, the performance of the statistical models under homophily or mixed homophily/social influence conditions is an important matter for further research.

The many studies of practice patterns by physicians within a single hospital system provide concrete examples for the statistical nonindependence problem (Landon 2018; Barnett 2012; Pollack 2012). Physicians within a hospital system are likely linked together by a social network, and through this network likely influence each other about practice patterns and clinical norms. Studies that use regression in such a setting to compare practice patterns to other variables like cost and patient outcomes, but without accounting explicitly for network interdependence, may be subject to elevated false positive rates if included variables are affected by common network influence processes they share. Although traits like cost of patient care may at first glance seem to be wholly patient-driven rather than physician-driven features, the articles cited (Landon 2018; Barnett 2012; Pollack 2012) contended that physician medical judgement, in fact, plays an important role in cost. For example, with respect to judgements about the appropriate level of imaging needed for conditions ranging from kidney stones to back pain, prior work has shown that different physicians can make different judgements about patients with the same clinical presentation and physicians likely influence each other (via the physician network) about how to make these judgements. This type of process is how Pollack et al. (2012) explained higher cost of patient care for physician networks in which specialists are more central relative to costs of care in networks in which internal medicine physicians are more central (despite similar patient traits and outcomes across networks). The central physicians in these different networks may disproportionately influence their surrounding network physicians as to appropriate levels of diagnostics and procedures being applied to the same clinical presentations of patients, thereby influencing overall cost of care.

The results further suggest that research trends encouraging the use of complex “big data” may lead to erroneous findings if the data has an underlying network structure that is not accounted for by current analytic methods. For example, consider the increasingly common scenario in which a research group obtains the complete electronic health records from a major hospital system, then uses these data to evaluate the correlation of physician traits, such as the correlation between exposure to continuing medical education (CME) and a specific prescribing behavior targeted by CME. Under the likely scenario that physicians’ prescribing is influenced by discussions among themselves regarding prescribing, the study is at risk of conflating any effect of CME on prescribing with the effect on prescribing due to network influence, thereby potentially overestimating the impact of CME. Our results indicate that the deleterious effects of nonindependence would be reduced if such studies were to sample only a portion of the physicians in the system rather than use the complete physician population. Perhaps counterintuitively, our results suggest that in the presence of network structures, representative sampling will yield better statistical performance compared to complete data, but the latter often is perceived to be inherently better.

Unfortunately, existing procedures that incorporate complete network data into a regression model did not fully correct the problem or failed to correct it at all. In our simulations, the only method that generated accurate false positive rates was to randomly sample 10% of the nodes in network. In our applied example of physician network data, this method maintained false positive rates near nominal level and high power. Random sampling solves the statistical problem of nonindependence by restricting to nodes that are (generally) sufficiently distant from each other in the network, such that the assumption of statistical independence is plausible. Random sampling could be used to correct for network nonindependence even in cases in which the network structure has not been explicitly characterized. So long as the sampling is truly random with respect to the network structure, then our simulations suggest that sampling 10% or fewer nodes will separate them enough in the network to eliminate or largely mitigate the statistical effects of network nonindependence. We note that in a specific empirical example, it likely would be judicious to perform the random sample over a number of iterations to ensure robusticity to sampling variance of any specific random 10%. We also note that, in practice, random sampling will be viable from a statistical power standpoint only when sample sizes are large (N > 1000) and/or when the true effects being estimated are large (Pearson r > 0.6).

It is intuitive to further suggest that systematically sampling nodes that are disconnected in the network might allow for more than 10% of the data to be retained, thereby increasing power while still eliminating nonindependence. We leave this as a question for further research, but we note that non-random sampling is likely a fraught effort because it very likely will not sample the attribute variables representatively. This non-representative sampling of attributes may be exacerbated by diffusion of those same attributes on the network, or by the attributes tending to induce network connections (homophily). Hence, although nonrandom sampling may conserve statistical power better than random sampling, it likely induces other statistical problems by reducing the representativeness of the sample. Overall, further statistical work is particularly needed to develop methods that correct for nonindependence in network data while allowing use of the full data, thus preserving statistical power. Our simulations indicate that the network autoregression model shows the most promise for such further development.