International migration beyond gravity: A statistical model for use in population projections

Cohen, Joel E.; Roig, Marta; Reuman, Daniel C.; GoGwilt, Cai

doi:10.1073/pnas.0808185105

Research Article

Social Sciences

International migration beyond gravity: A statistical model for use in population projections

Joel E. Cohen [email protected], Marta Roig, Daniel C. Reuman, and Cai GoGwiltAuthors Info & Affiliations

October 7, 2008

105 (40) 15269-15274

https://doi.org/10.1073/pnas.0808185105

PDF/EPUB

Abstract

International migration will play an increasing role in the demographic future of most nations if fertility continues to decline globally. We developed an algorithm to project future numbers of international migrants from any country or region to any other. The proposed generalized linear model (GLM) used geographic and demographic independent variables only (the population and area of origins and destinations of migrants, the distance between origin and destination, the calendar year, and indicator variables to quantify nonrandom characteristics of individual countries). The dependent variable, yearly numbers of migrants, was quantified by 43653 reports from 11 countries of migration from 228 origins and to 195 destinations during 1960–2004. The final GLM based on all data was selected by the Bayesian information criterion. The number of migrants per year from origin to destination was proportional to (population of origin)^0.86(area of origin)^−0.21(population of destination)^0.36(distance)^−0.97, multiplied by functions of year and country-specific indicator variables. The number of emigrants from an origin depended on both its population and its population density. For a variable initial year and a fixed terminal year 2004, the parameter estimates appeared stable. Multiple R², the fraction of variation in log numbers of migrants accounted for by the starting model, improved gradually with recentness of the data: R² = 0.57 for data from 1960 to 2004, R² = 0.59 for 1985–2004, R² = 0.61 for 1995–2004, and R² = 0.64 for 2000–2004. The migration estimates generated by the model may be embedded in deterministic or stochastic population projections.

International migration will play an increasing role in the demographic future of nations if fertility continues to decline in most countries. In projecting international migration, the United Nations Population Division (ref. 1, paragraphs 57–59) identified the need for a demographically plausible, programmable algorithm that automatically projects a zero world balance of net migration and prevents projected net emigration from completely depleting the population of any sending country. To meet this need, we propose an algorithm (based only on demographic and constant geographic variables) for projecting future numbers of international migrants from any country or region to any other. It is comparable in transparency and generality to standard cohort-component methods of projecting births and deaths. The approach presented here is different from methods of projecting migrant flows currently practiced in international demographic institutions, the United States, European countries, and other developed countries (2–6).

Most theories of international migration draw on social, economic and/or political factors to explain migration (4, 7–9), such as differences among countries in gross domestic product, labor markets, migration policies, social networks of prior migrants, and cognitive and behavioral attributes of individuals (3, 10–11). For multidecadal demographic projections, it seems more difficult to project such nondemographic variables than it is to project demographic variables such as fertility and mortality. The proposed model assumes the availability only of constant geographic variables and of population sizes that can be projected incrementally in time by accepted demographic procedures. The model makes possible both deterministic and stochastic projections of migration and hence of population.

The intellectual antecedents of the proposed model include Zipf's (12, 13) model of intercity migration, which is one of several “gravity” models in the social sciences (6). Zipf (12) aimed to “show with supporting data that the number of persons that move between any two communities in the United States whose respective populations are P₁ and P₂ and which are separated by the shortest transportation distance, D, will be proportionate to the ratio, P₁·P₂/D, subject to the effect of modifying factors.” Unlike Zipf, we distinguish the number of people who move from community 1 to community 2 from the number of people who move from 2 to 1. Taking logarithms of P₁·P₂/D and adding an error term yields a linear model in log-transformed variables, log(migrants) = a₀ + a₁ log(ppnorig) + a₂ log(ppndest) + a₃ log(distance) + error, where ppnorig is the population of the origin, ppndest is the population of the destination, and the error term characterizes random deviations. Here and throughout, log refers to log₁₀, and ln refers to the natural logarithm log_e. Zipf posits that a₁ = 1, a₂ = 1, and a₃ = −1. We estimate all coefficients from data using a generalized linear model (GLM) (14).

Zipf (12) treated cities as points of negligible spatial extent. Our communities are countries or regions and our subject is international migration. To let the data reveal whether the area of a country influences its numbers of migrants, we add two terms to the above equation: log(areaorig), the log area of the origin, and log(areadest), the log area of the destination. By definition, the population density of the origin is ppnorig/areaorig, so log(density) = log(ppnorig) − log(areaorig). If the number of migrants from origin to destination depends on ppnorig and ppndest and not on their areas, then the estimated coefficients of log(areaorig) and log(areadest) should be close to zero. However, if the number of migrants from origin to destination depends on the population density of the origin and the population density of the destination, but not on their respective population numbers per se, then the estimated coefficients of log(areaorig) and log(areadest) should be nearly the negative of the respective estimated coefficients of log(ppnorig) and log(ppndest). The estimated coefficients of the terms for log population and log area of origin and destination reveal the relative importance of population per se and population density.

To allow for differences in migratory intensities among origins or destinations, we let the data reveal whether different origins and destinations had numbers of migrants different from the numbers of migrants expected on average from their respective populations, areas, and distances. We introduced four indicator variables for each country that provided data. For example, Australia was a source of data on numbers of emigrants and immigrants by year. One indicator variable for Australia, orig.indicator$Australia, equaled 1 whenever Australia was the origin and equaled 0 if any other country or area was the origin. A second indicator variable, dest.indicator$Australia, similarly equaled 1 whenever Australia was the destination and equaled 0 if any other country or area was the destination. A third indicator variable, orig.is.datasource$Australia, equaled 1 if Australia was the origin and was the source of the data and otherwise equaled 0. A fourth indicator variable, dest.is.datasource$Australia, equaled 1 if Australia was the destination and was the source of the data and otherwise equaled 0. Similar indicator variables were defined for each country that provided data on its flows of immigrants and/or emigrants.

To allow for the possibility that migration rates changed over time and for the simultaneous effects of the other independent variables (populations, areas, distance, and indicator variables), we introduced year as an independent variable. We used the centered variable year minus 1985 to assure the stability of the model's estimated intercept.

To summarize in notation similar to that of Zipf's gravity model, let P be population, A be area, and D be distance. Detailed definitions of these variables are given in Methods. Our “starting” model was the logarithmic transformation of model Eq. 1. This logarithmic transformation guaranteed that the number of migrants estimated by the model was positive.

where ε ∼ N(0,σ²) are independent.

This model contains factors of the form [10 if origin is Australia; 1 otherwise]^g. The common logarithm of this expression is g × [1 if origin is Australia; 0 otherwise], and the expression in square brackets is an indicator (or dummy) variable. The coefficient g summarizes the nonrandom effects on the expected number of migrants from Australia (in this case) apart from those due to calendar year, the populations and areas of origin and destination, and distance. The indicator variables quantified how numbers of migrants were affected by nonrandom characteristics of individual countries: migratory history, policy, and statistical completeness; economic, geographic, and cultural affinities or disparities; and effects of geographical adjacency not captured by the chosen measure of distance.

We estimated the intercept (log constant) and exponents a, b, c, d, …, which were linear coefficients in the GLM. We fitted this and other models to data from 11 countries (Australia, Belgium, Canada, Denmark, Germany, Italy, the Netherlands, Spain, Sweden, the U.K., and the U.S.A.). These countries were selected on the basis of the quality of their data on international migrants by year and places of origin and destination from 1960 to 2004. The data included 228 origins and 195 destinations of migrants. Oceania was the only destination not also an origin, so 229 countries or regions in total were named in these data.

Because of the limited quantity and quality of migration data, this article demonstrates a method and illustrates a modeling approach, rather than specifying numerical parameters definitively. Parameters and models will evolve as more and better data become available. The analysis showed that statistically simple and demographically interpretable modeling accounted for more than half the variation in the migration data. How much more than half could be accounted for by this approach with better data remains to be determined.

Results

Descriptive Bivariate Relationships.

On average, but with enormous variability, the log number of migrants increased with increasing log population of origin (r = 0.43), increasing log area of origin (r = 0.18), increasing log population of destination (r = 0.27), and increasing log area of destination (r = 0.10) (Fig. 1 A–D). The log number of migrants decreased, on average but with enormous variability, with increasing log distance from origin to destination (r = −0.24) and increased weakly with year (r = 0.01) (Fig. 1 E and F). Log population and log area were highly correlated (Fig. 1 G and H) for origins (r = 0.74) and destinations (r = 0.64). Because these were the two highest-magnitude correlations, collinearity was not a problem in fitting the GLM. Correlations between year and log migrants, and between log(ppnorig) and log(distance), were insubstantial.

Fig. 1.

Bivariate relationships in international migration data, 1960–2004. (A) Log number of migrants versus log population of origin. (B) Log number of migrants versus log area of origin. (C) Log number of migrants versus log population of destination. (D) Log number of migrants versus log area of destination. (E) Log number of migrants versus log distance from origin to destination. (F) Log number of migrants versus year. (G) Log population of origin versus log area of origin. (H) Log population of destination versus log area of destination. (I) Log population of origin versus log population of destination. Each plot has 43,653 points.

Log(ppnorig) and log(ppndest) were negatively correlated (r = −0.19) (Fig. 1I). This negative correlation could reflect the absence of data sources among countries with four million or fewer people, which may account for the absence of data points in the lower left quadrant of Fig. 1I. The long horizontal and vertical streaks in Fig. 1I represent the populations of countries that were data sources as origins and destinations, respectively, whereas the short diagonal streaks largely reflect population growth of a given (origin, destination) pair.

Model Selection.

The starting linear model was fitted [supporting information (SI) Table S1] with the dependent variable log(migrants) and with the six independent variables that we call “basic” [year minus 1985, log(ppnorig), log(areaorig), log(ppndest), log(areadest), log(distance)] and all indicator variables (orig.indicator, dest.indicator, orig.is.datasource, dest.is.datasource). The Multiple R² was 0.5693 and the Adjusted R² was 0.5689 (see Methods).

When the stepwise algorithm with Bayesian information criterion (15) was applied to this starting model, log(areadest) was eliminated and all other independent variables were retained in the resulting “final” model (Table 1). To the four significant digits shown, the Multiple R² and the Adjusted R² were unchanged between the starting and the final models.

Table 1.

The “final” model of log migrants as a function of year, log population of origin, log area of origin, log population of destination, log distance, and indicator variables; specification of the model in R and resulting coefficients and statistics

Call: lm (formula = logmigrants ∼ I (year−1985) + logppnorig + logareaorig + logppndest + logdistance + orig.indicator + dest.indicator + orig.is.datasource + dest.is.datasource)
Coefficients	Estimate	SE	t value
(Intercept)	−2.5135	0.0886	−28.3730
I (year–1985)	0.0016	0.0003	5.1670
Logppnorig	0.8631	0.0083	103.6400
Logareaorig	−0.2103	0.0066	−31.9050
Logppndest	0.3604	0.0089	40.7010
Logdistance	−0.9685	0.0102	−94.5470
orig.indicatorAustralia	1.1295	0.0436	25.8900
orig.indicatorBelgium	−0.2557	0.0404	−6.3300
orig.indicatorDenmark	−0.0441	0.0409	−1.0760
orig.indicatorGermany	0.0699	0.0409	1.7080
orig.indicatorItaly	0.1844	0.0401	4.5960
orig.indicatorNetherlands	0.0250	0.0408	0.6120
orig.indicatorSweden	0.1602	0.0473	3.3840
orig.indicatorUnited Kingdom	0.2486	0.0397	6.2580
dest.indicatorAustralia	1.4362	0.0558	25.7360
dest.indicatorBelgium	0.1314	0.0524	2.5090
dest.indicatorCanada	0.8557	0.0457	18.7160
dest.indicatorDenmark	0.2560	0.0523	4.8980
dest.indicatorGermany	0.5875	0.0502	11.7020
dest.indicatorItaly	0.7551	0.0493	15.3090
dest.indicatorNetherlands	0.4805	0.0509	9.4480
dest.indicatorSpain	0.6400	0.0470	13.6210
dest.indicatorSweden	0.2528	0.0696	3.6340
dest.indicatorUnited Kingdom	0.6284	0.0491	12.7910
dest.indicatorUnited States of America	1.1444	0.0457	25.0180
orig.is.datasourceAustralia	−0.3038	0.0633	−4.8010
orig.is.datasourceBelgium	0.4595	0.0626	7.3410
orig.is.datasourceDenmark	0.2340	0.0642	3.6450
orig.is.datasourceGermany	0.4872	0.0613	7.9500
orig.is.datasourceItaly	−0.4761	0.0630	−7.5620
orig.is.datasourceNetherlands	0.2112	0.0645	3.2750
orig.is.datasourceSweden	−0.0733	0.0658	−1.1140
orig.is.datasourceUnited Kingdom	1.3529	0.0682	19.8440
dest.is.datasourceAustralia	−0.0317	0.0719	−0.4420
dest.is.datasourceBelgium	0.5479	0.0716	7.6510
dest.is.datasourceCanada	0.1462	0.0638	2.2930
dest.is.datasourceDenmark	0.2687	0.0721	3.7270
dest.is.datasourceGermany	0.5646	0.0674	8.3730
dest.is.datasourceItaly	−0.2356	0.0687	−3.4280
dest.is.datasourceNetherlands	0.4550	0.0718	6.3360
dest.is.datasourceSpain	−0.2294	0.0678	−3.3850
dest.is.datasourceSweden	0.1258	0.0831	1.5140
dest.is.datasourceUnited Kingdom	1.4979	0.0762	19.6680
dest.is.datasource United States of America	NA	NA	NA
Residual standard error: 0.6957 on 43610 degrees of freedom
Multiple R²: 0.5693, Adjusted R²: 0.5689
F statistic: 1,372 on 42 and 43,610 DF, nominal P value: < 2.2e-16

Because the assumption of independence among observations was implausible, conventional measures of statistical significance were probably inapplicable. The destination.is.data.source parameter for the U.S. is labeled “NA” because one of the destination.is.data.source variables must be eliminated to prevent a singularity in estimating the coefficients. See SI Text for details.

When the values of log(migrants) were independently and randomly permuted in each of 100 simulations, the maximum of the 100 simulated multiple R² values was 0.00147, far smaller than the multiple R² value of 0.5693 for the data. Hence, the latter multiple R² value could not have been an artifact of the fitting procedure alone.

When all indicator variables were suppressed and only the six “basic” independent variables were retained, the multiple R² value dropped to 0.4345 (Table S3). The only notable change in the coefficients of the six “basic” independent variables was the increase in the coefficient of log(areadest) from 0.0239 to 0.1604. A model that omitted the indicator variables would have misleadingly suggested that log(areadest) was fairly influential. The stepwise algorithm, however, retained the indicator variables and dropped log(areadest).

Residuals.

In the final model, the interquartile range of the residuals was −0.4352 to 0.4414 log(migrants), meaning that half the time, the observed numbers of migrants fell in the interval from 36.7% to 2.763 times the predicted number of migrants (the predicted number was 10^{expected log(migrants)}). The smallest and largest residuals were −3.2449 and 3.2918, corresponding to cases where the observed number of migrants was >1,000 times smaller or larger than the predicted number.

The largest residuals occurred at intermediate fitted values from 0.5 to 3.5, corresponding to predicted numbers of migrants from ≈3 to 3,000 (Fig. S1A). The scatter of the residuals was clearly not constant over the range of fitted values. This lack of homoscedasticity justified the use of the Bayesian information criterion for model selection instead of a probabilistic interpretation of F tests for omitted variables. The validity of the latter approach assumes homoscedasticity of residuals and independence of observations.

When the fitted values were >4 on the log₁₀ scale (corresponding to 10,000 migrants per year or more), residuals were systematically negative, indicating fewer reported migrants than predicted. This pattern could result, among other possible causes, from underreporting of large migrant flows or from systematic policies intended to diminish the largest predicted flows.

Model Coefficients.

In the final model (Table 1), the predicted number of migrants increased by 0.38% = 10^0.00163 per year, in addition to the changes in numbers of migrants resulting from changes over time in log(ppnorig) and log(ppndest). The predicted number of migrants was proportional to the population of origin raised to 0.86 and to the population of destination raised to 0.36. Increases in log(ppnorig) increased log(migrants) more than twice as much as increases in the log(ppndest). In light of the small standard errors of these estimates (Table 1), these exponents very probably differed from the exponents of 1 in Zipf's (12) gravity model, even if the distributional assumptions of the linear model were not precisely met. The predicted number of migrants was proportional to the distance raised to −0.97, a bit more than three standard errors from Zipf's posited exponent of −1. The predicted number of migrants was proportional to the area of origin to the power −0.21. Because P^0.86A^−0.21 = P^0.65(P/A)^0.21, the number of migrants increased with both the population of origin (to the power of 0.65) and the population density of origin (to the power 0.21), and the population of origin contributed more to the number of migrants than did the population density of origin.

The indicator variables revealed substantial heterogeneity among countries in their propensity to send or receive migrants and in their reporting practices, given the other independent variables. According to its coefficient for orig.indicator, Australia had 13.47 = 10^1.1295 times as many emigrants as expected. At the opposite extreme, Belgium had 56% as many emigrants as expected from its other characteristics. Denmark, Germany, and the Netherlands had approximately as many emigrants as expected. According to its coefficient for dest.indicator, Australia had 27.31 times as many immigrants as expected from its geographic and demographic characteristics, followed by the U.S.A. and Canada with 13.95 and 7.17 times as many immigrants as expected, respectively. At the opposite extreme, Belgium had 1.35 times as many immigrants as expected, the smallest multiple among the countries that provided data in this study. According to dest.indicator, all reporting countries had more immigrants than expected on average. The countries that provided data for this study are countries with the resources to support effective statistical systems, and such countries are likely to be attractive destinations of migration.

According to its coefficient for orig.is.datasource, when Australia reported the number of emigrants from Australia, it reported on average 50% = 10^−0.3038 of the number of emigrants from Australia that the average country that reported emigration data in this study would have reported, given Australia's population and area, the destination and year, and the propensity to emigrate from Australia estimated without regard to the source of the data. For example, Australia reported 3,971 migrants from Australia to the U.K. in 1998, whereas the U.K. reported 41,800 migrants from Australia to the U.K. in 1998 (see Methods). Australia reported emigrants leaving permanently. The U.K. reported immigrants staying for 1 year or longer.

At the opposite extreme from low reporters of emigration like Australia (50%) and Italy (33%), the U.K. reported 22.54 times as many emigrants as would be expected otherwise, whereas Germany reported 3.07 times as many. The U.K. data are estimates derived mainly from a survey of arriving and departing international passengers; U.K.'s recording of emigration and immigration flows is equally complete. Australia, by contrast, focuses on entries. These differences between countries are detected by the statistical analysis and reflected in the estimated coefficients of the model.

According to its coefficient for dest.is.datasource, when the U.K. reported the number of immigrants to the U.K., its reports were on average 31.47 times the number of immigrants that would have been reported by the average reporting system over all countries in the study, given the U.K.'s population and area, the destination and year, and the propensity to immigrate to the U.K. estimated without regard to the source of the data. At the opposite extreme, Spain and Italy reported 59% and 58%, respectively, of the numbers of immigrants expected from other factors.

In general, immigration is better recorded at destinations than emigration is recorded at origins, in part because migrants often have more formal incentives to register at their destination than to deregister at their origin. The estimated coefficients of the indicator variables (Table 1) are consistent with this belief, although other interpretations are possible. For example, dest.indicator for Australia had coefficient 1.4362, greater than the coefficient 1.1295 of orig.indicator for Australia. The same inequality, dest.indicator coefficient > orig.indicator coefficient, held for all eight countries that supplied both immigration and emigration data. Similarly, the comparable inequality, dest.is.datasource coefficient > orig.is.datasource coefficient, held for all eight countries that supplied both immigration and emigration data. These inequalities indicated greater detection (or greater intensity, an alternative interpretation) of immigration than emigration in every country for which the comparison could be made.

Parameter Stability: How Much of the Past Is Relevant to the Future?

Estimated coefficients varied systematically as a function of the time interval from which data were drawn and as a function of the subset of variables selected from the starting model Eq. 1 (Table S3). The starting model rather than the final model was used for this analysis to allow for the possibility that the log area of destination might become dramatically important for some subset of 1960–2004. As it turned out, this possibility did not occur.

For each set of independent variables, the most recent five years (2000–2004) gave the highest multiple R². Using all variables, all data from 1960 to 2004 gave a multiple R² value of 0.57, whereas the 2000–2004 data gave a multiple R² value of 0.64 (Table S3). This higher value of the multiple R² may be partially due to improved quality of data in more recent years but may also be due to fewer external perturbations to migratory flows during 2000–2004 than during 1960–2004. For example, during the 45-year period of 1960–2004, the Berlin wall and the Soviet Union fell, and Germany was reunified, whereas no such events marked the 5-year period of 2000–2004, an interval only one-ninth as long. For each set of independent variables, as the initial year moved forward while the terminal year was 2004, the multiple R² value increased. Each set of independent variables accounted for more variation in log(migrants) when using the data from 1985 to 2004 than by adding additional data from earlier years.

All estimated coefficients were stabler when using a moving initial year than when using a moving terminal year. They were least stable when using moving tranches of 5 or 10 years' duration (Table S3). For the starting model with all variables, the standard deviation 0.3877 of the estimated intercept for the five time intervals with moving terminal year 1960–1984, 1960–1989, 1960–1994, 1960–1999, and 1960–2004 was larger than the standard deviation 0.2489 of the estimated intercept for the five time intervals with moving initial year 1965–2004, 1970–2004, 1975–2004, 1980–2004, and 1985–2004. The same inequality held for the standard deviations of the estimates of the coefficients of all six basic variables when the five time intervals with moving terminal year were compared with the five time intervals with moving initial year.

Discussion

We assembled annual data on immigrants and emigrants from 11 countries' sources and combined them with data on the populations and areas of 228 origins and 195 destinations and the distances between origins and destinations. These 43,653 reports did not suffice to cover the 228 × 195 = 44,460 possible origin-destination pairs in a single year and offered very sparse coverage over 45 years. A simple GLM was able to account for more than half the variation in log(migrants). Despite the present limitations of data, this approach may improve on existing demographic procedures for projecting international migration and may motivate the collection of better data.

The bivariate relations among variables demonstrated the need for a multivariate model. For example, the number of migrants increased with the population of origin and the population of destination, but the population of origin and the population of destination were negatively correlated. Only a multivariate model could reveal how the number of migrants depended on the populations of origin and of destination jointly.

In the final GLM, the coefficients of log(ppnorig) and log(ppndest) were both positive and <1. If either coefficient had been negative, then the estimated number of migrants could have diverged to infinity as the population of origin or destination became smaller. Because the coefficients were <1, the numbers of migrants did not increase in proportion to ppnorig or ppndest.

The GLM served two distinct purposes: understanding and prediction. For scientific understanding, we eliminated superfluous independent variables to obtain the most economical model, then interpreted the values of the coefficients in the model. For prediction, we sought as much predictive power as possible by adding variables that gave the highest coefficient of determination, provided that the parameter estimates did not become unstable (sensitive to the inclusion or exclusion of a small number of data points and/or other predictor variables). The starting and final models considered here balanced interpretability and predictive ability. Other models are discussed in SI Text (Table S2).

The principal problems with this method were the lack of data and the lack of comparability (discrepancies between countries in definitions and measurements) where the data existed (Methods). Most countries lack a system to record migration flows. Many do not publish their information on migration. The data did not consistently distinguish moves from movers (6); individuals who crossed borders multiple times may have been interpreted as multiple migrants.

Comparisons with Some Related Studies.

The 2003 Technical Panel on Assumptions and Methods of the Social Security Administration (16) suggested assuming that the number of net migrants to the U.S. will grow, at least in the long run, in direct proportion to the size of the U.S.A. population. The coefficients in the final model (Table 1) suggested, by contrast, that immigration to the United States is expected to grow in proportion to the population of the United States raised to the power 0.36, times independent multiplicative effects of the calendar year. The final model also anticipates changing log(migrants) as a result of population growth in countries of origin. Likewise, according to the final model, emigration from the United States should be expected to grow in proportion to the population of the United States raised to the power 0.86, times multiplicative effects that depend on year and country-of-destination populations.

If countries' populations changed by a factor of λ > 0, holding constant all other variables and GLM coefficients, the number of migrants would be multiplied by a factor of λ^a+b because (λ·ppnorig)^a(λ·ppndest)^b = λ^a+b·ppnorig^a·ppndest^b, where a + b is the sum of the coefficients of log(ppnorig) and log(ppndest). In the final model (Table 1), a + b = 1.22; so if ppnorig and ppndest both doubled, the predicted number of migrants from origin to destination would increase by a factor of 2^1.22 = 2.34. For moving time intervals in the starting model with all variables (Table S3), a + b tended to increase with the moving initial year or moving final year or tranche. For example, for data in the time intervals 1980–2004 and 1985–2004, a + b = 1.26 and 1.30, respectively.

Bijak et al. (17) projected migratory flows among 27 European countries by multiplying the initial emigration rates by an overall trend (mobility increasing by 0.5% yearly) and temporal effects of labor market policies. For comparison, our final model estimated that the global number of migrants rose by 10^0.00163 = 1.0038 = 0.4% per year in the data of 1960–2004, apart from the multiplicative effects of population growth or decline and the indicator variables of the countries or regions of origin and destination. This agreement in estimates is remarkable considering the difference in methods, data, and context (Europe versus the globe). Raymer (18) reconstructed the migratory flows for the European Union using a different log-linear model with multiplicative components.

Some migration models are based on transition probability matrices (6). In such models, the number of immigrants is independent of the destination's population and proportional to a weighted sum of the populations of origins or of the fractions of global population in different origins. The GLMs estimated here suggest that each destination's number of immigrants is a nonlinear function of the populations of both origin and destination and of other variables.

Use in Population Projection.

The projected number of migrants can be embedded in a population projection algorithm, initially ignoring age structure and then incorporating it. The initial goal is, given a vector of country population sizes P(t) with elements P(i,t) for country i at time t, to compute the population vector P(t+1) at the next time step. The GLM can estimate the number of migrants M(i,j,t) from country i to country j between t and t+1. Let M(i,i,t) = 0 for all i (despite some countries' reporting positive numbers of migrants from the country to itself). The matrix M(t) with elements M(i,j,t) is called the migration matrix at time t. It will be assumed that M(i,j,t) obtained from the GLM approximates the number of people who were in country i at time t and in country j (≠i) at time t + 1.

The number of emigrants from country i between t and t+1, denoted E(i,t), is then E(i,t) = Σ_jM(i,j,t). The vector E(t) with elements E(i,t) is called the emigration vector at time t. The number of immigrants to country i between t and t + 1, denoted I(i,t), is I(i,t) = Σ_hM(h,i,t). The vector I(t) with elements I(i,t) is called the immigration vector at time t. The number of net migrants to country i between t and t + 1, denoted N(i,t), is N(i,t) = I(i,t) − E(i,t). The vector N(t) with elements N(i,t) will be called the net migration vector at time t. Then populations of countries or regions can be projected as

where B(i,t) is the number of births and D(i,t) is the number of deaths projected for country i from separate models of fertility and mortality.

It has been assumed thus far that the elements of the migration matrix would be the predicted values 10^{expected log(migrants)} produced by the GLM, yielding a deterministic population projection. A stochastic (Monte Carlo) method would be to sample numerically from the distribution of residuals for given values of the independent variables. Then the number of migrants from an origin to a destination would become a random variable, and a population projection that incorporated the migration matrix would become a probabilistic ensemble of projections.

The migration matrix at time t depends on the indicator variables, which are estimated from data up to and including t. Given P(t+1) from Eq. 2, the simplest approach to computing the migration matrix at time t + 1 would be to keep the coefficients of the indicator variables constant and to update only the population vector P(t + 1). A more sophisticated approach would be to examine trends over time in the coefficients of the indicator variables and to extrapolate those trends forward to obtain updated coefficients for the indicator variables of future migration matrices.

To obtain an age-specific net migration vector, one could apply age-specific models of migration patterns for different countries and regions to E(i,t) and to I(i,t) (19–21) and combine that with the projected age-specific fertility and mortality vectors as in Eq. 2.

This method of projecting international migration in combination with cohort-component methods of projecting fertility and mortality solves the two problems identified by the United Nations Population Division (1). First, the global sum of net migrants is guaranteed to be 0 when the numbers of emigrants, immigrants, and net migrants are derived from a migration matrix (22). Second, net emigration should not completely deplete the population of any sending country for realistic model parameters. In all models considered here, the exponents of the populations of origin and destination are positive and the predicted number of emigrants is a small fraction of the population of origin. Consequently, the projected number of emigrants from an origin declines to 0 as its population declines to 0. Likewise, if the destination's population declines, the models predict that fewer people will migrate there. This feature of the model depends on realistic parameter values and may not hold in all mathematically possible cases. If the coefficients of the model were unrealistic and gave an unrealistically large number of emigrants, then it is possible that the origin population could be depleted.

Open Research Questions.

Many open questions remain. Why did the area of the destination influence the numbers of migrants much less than the area of origin? How well would the proposed models work for migration within a country, taking account of international migration? For a given origin-destination pair, were the residuals correlated over time or independent as the error term in the model assumed?

International migrants are mainly younger individuals of working age and their families. Migratory flows of elderly individuals may also be important. Would the fit of the models be improved by replacing total population size with a weighted average that emphasized age groups most prone to migrate, or by a simple index such as the proportion of the population age 20–34 years? Age-structure seems likely to matter to migration increasingly as all countries undergo population aging (23).

What is the long-run behavior of the projection model Eq. 2 assuming constant birth rates and death rates and constant coefficients in Eq. 1? For example, when, if ever, does there exist a fixed vector P of population size by country and a stable growth rate ρ such that lim_{t → ∞} P(t)/ρ^t = P? If this case arises, how does ρ depend on the parameters of the basic model Eq. 2? What irreducibility conditions on the migration matrix assure the uniqueness of P? When the migration matrix is reducible, could different “ergodic sets” of countries (sets of countries linked through migration flows) have different fixed vectors P of population size by country and different stable growth rates ρ? Under what conditions on the coefficients of log(ppnorig) and log(ppndest) can country populations snowball to infinity (in finite time or with infinite time) or vanish? How sensitive to the initial conditions P(i,0) are each country's proportion of world population P(i,t)/Σ_h/P(h,t) for large t, where the summation runs over all countries? How sensitive are country-specific population growth rates P(i,t+1)/P(i,t), for large t, to initial conditions? In short, what ergodic theorems hold (24)?

Methods

Data.

All 43,653 data records are provided in Dataset S1. Each record contains 12 variables: a unique serial number, the year in the Western calendar (1960–2004), the name of the country or region of origin of migrants, the log population of the origin in that year, the name of the country or region of destination of migrants, the log population of the destination in that year, the log number of migrants from origin to destination in that year, the log area (square kilometers) of the origin, the log area (square kilometers) of the destination, the log great circle distance (kilometers) from the capital of the origin to the capital of the destination, the source of the migration data, and “neighbor” (see SI Text). Records for which the value of any variable was missing were excluded.

Each country's definitions of what constituted a migrant, of the origin or destination of a migrant, and of the accounting year were used (Table S4). Differences among definitions and in the effectiveness of collecting migration data led to hundreds of discrepancies when both the origin and the destination reported migration data in the same year. In SI Text, sources are listed and methods of collecting migration data are discussed.

Data Analysis.

A GLM was fitted to a starting model with dependent variable log(migrants) and with all six basic independent variables [year minus 1985, log(ppnorig), log(areaorig), log(ppndest), log(areadest), and log(distance)] and all indicator variables (orig.indicator, dest.indicator, orig.is.datasource, dest.is.datasource). The stepwise regression algorithm stepAIC was applied to this linear model to obtain a final model. Details of data management and statistical software are in SI Text.

Acknowledgments.

Earlier versions of this article were presented to the Technical Working Group on Long-Range Population Projections, Population Division, United Nations Headquarters, on June 30, 2003; the United Nations Population Division Seminar on October 21, 2004; the Stanford Workshop on Formal and Quantitative Demography on August 15, 2005; the Policy Research Division of the Population Council on January 23, 2006; and the Estimates and Projection Section of the Population Division of the United Nations on June 6, 2008. We thank for helpful comments the many participants in those presentations and Adam E. Cohen, Jakub Bijak, Joshua R. Goldstein, Ryuichi Kaneko, Ronald D. Lee, Jim Oeppen, and Hania Zlotnik. We thank Mr. and Mrs. William T. Golden and family for their hospitality during this work. Priscilla K. Rogerson assisted in preparing data and the manuscript. J. Bijak and R. Kaneko helpfully reviewed two versions of the article. This work was supported by U.S. National Science Foundation Grants DEB-9981552 and DMS-0443803.

Supporting Information

Supporting Information (PDF)

Supporting Information

Download
3.34 MB

SD1.csv

Download
4.74 MB

ST3.xls

Download
49.00 KB

SD1.xls

References

1

United Nations Population Projections to 2300. ESA/P/WP.184, 30 June 2003 (UNPD, New York, 2003).

Google Scholar

2

M Fertig, CM Schmidt, Aggregate-Level Migration Studies as a Tool for Forecasting Future Migration Streams. International Migration, ed S Djajic (Routledge, London), pp. 110–136 (2001).

Google Scholar

3

N Howe, R Jackson Projecting Immigration: A Survey of the Current State of Practice and Theory. A Report of the CSIS Global Aging Initiative, with Contributions by Rebecca Strauss and Keisuke Nakashima (Center for Strategic and International Studies, Washington, DC, 2005).

Google Scholar

4

J Bijak (Central European Forum for Migration and Population Research, Warsaw, PhD Dissertation. (2008).

Google Scholar

5

, eds J Raymer, F Willekens (Wiley, Chichester International Migration in Europe: Data, Models and Estimates, 2008).

Google Scholar

6

A Rogers, Demographic modeling of the geography of migration and population: A multiregional perspective. Geog Analysis 40, 276–296 (2008).

Crossref

Google Scholar

7

DS Massey, et al., Theories of international migration: A review and appraisal. Popul Dev Rev 19, 431–466 (1993).

Crossref

Google Scholar

8

DS Massey, et al. Worlds in Motion: Understanding International Migration at the End of the Millennium (IUSSP International Studies in Demography) (Clarendon, Oxford, 1998).

Google Scholar

9

T Faist The Volume and Dynamics of International Migration and Transnational Social Spaces (Oxford Univ Press, Oxford, 2000).

Crossref

Google Scholar

10

PN Ritchey, Explanations of migration. Annu Rev Sociol 2, 363–404 (1976).

Crossref

Google Scholar

11

G Dorigo, W Tobler, Push-pull migration laws. Ann Assoc Am Geogr 73, 1–17 (1983).

Crossref

Google Scholar

12

GK Zipf, The P₁P₂/D hypothesis: On the inter-city movement of persons. Am Sociol Rev 11, 677–686 (1946).

Crossref

Google Scholar

13

GK Zipf Human Behavior and the Principle of Least Effort (Addison Wesley, Cambridge MA, 1949).

Google Scholar

14

P McCullagh, JA Nelder Generalized Linear Models (Chapman and Hall, 2d ed, London, 1989).

Crossref

Google Scholar

15

GE Schwarz, Estimating the dimension of a model. Ann Stat 6, 461–464 (1978).

Crossref

Google Scholar

16

Report to the Social Security Advisory Board (Social Security Administration, Washington, DC, 2003).

Google Scholar

17

J Bijak, D Kupiszewska, M Kupiszewski, Replacement migration revisited: Simulations of the effects of selected population and labour market strategies for the ageing Europe, 2002–2052. Popul Res Policy Rev 27, 321–342 (2008).

Crossref

Google Scholar

18

J Raymer International Migration in Europe: Data, Models and Estimates, eds J Raymer, F Willekens (Wiley, Chichester), pp. 209–234 (2008).

Google Scholar

19

A Rogers, LJ Castro Model migration schedules. Research Report-81–30 (International Institute for Applied Systems Analysis, Laxenburg, Austria, 1981).

Google Scholar

20

A Rogers, F Willekens, J Raymer, Imposing age and spatial structures on inadequate migration-flow datasets. Prof Geogr 55, 56–69 (2003).

Google Scholar

21

A Rogers, LJ Castro, M Lea, Model migration schedules: Three alternative linear parameter estimation methods. Mathematical Population Studies 12, 17–38 (2004).

Crossref

Google Scholar

22

JE Cohen, Constant global population with demographic heterogeneity. Demogr Res 18, 409–436 (2008).

Crossref

Google Scholar

23

J Raymer, A Rogers International Migration in Europe: Data, Models and Estimates, eds J Raymer, F Willekens (Wiley, Chichester), pp. 175–192 (2008).

Google Scholar

24

JE Cohen, Ergodic theorems of demography. Bull Amer Math Soc NS 1, 275–295 (1979).

Crossref

Google Scholar

Information & Authors

Information

Published in

Proceedings of the National Academy of Sciences

Vol. 105 | No. 40
October 7, 2008

PubMed: 18824693

Classifications

Copyright

Submission history

Received: June 18, 2008

Published online: October 7, 2008

Published in issue: October 7, 2008

Keywords

Acknowledgments

Earlier versions of this article were presented to the Technical Working Group on Long-Range Population Projections, Population Division, United Nations Headquarters, on June 30, 2003; the United Nations Population Division Seminar on October 21, 2004; the Stanford Workshop on Formal and Quantitative Demography on August 15, 2005; the Policy Research Division of the Population Council on January 23, 2006; and the Estimates and Projection Section of the Population Division of the United Nations on June 6, 2008. We thank for helpful comments the many participants in those presentations and Adam E. Cohen, Jakub Bijak, Joshua R. Goldstein, Ryuichi Kaneko, Ronald D. Lee, Jim Oeppen, and Hania Zlotnik. We thank Mr. and Mrs. William T. Golden and family for their hospitality during this work. Priscilla K. Rogerson assisted in preparing data and the manuscript. J. Bijak and R. Kaneko helpfully reviewed two versions of the article. This work was supported by U.S. National Science Foundation Grants DEB-9981552 and DMS-0443803.

Notes

This article contains supporting information online at www.pnas.org/cgi/content/full/0808185105/DCSupplemental.

Authors

Affiliations

Joel E. Cohen^† [email protected]

Laboratory of Populations, Rockefeller and Columbia Universities, 1230 York Avenue, Box 20, New York, NY 10065-6399;

View all articles by this author

Marta Roig

Population Division, Department of Economic and Social Affairs, 2, United Nations Plaza DC2-1984, United Nations, New York, NY 10017;

View all articles by this author

Daniel C. Reuman

Laboratory of Populations, Rockefeller and Columbia Universities, 1230 York Avenue, Box 20, New York, NY 10065-6399;

Imperial College London, Silwood Park Campus, Buckhurst Road, Ascot SL5 7PY Berkshire United Kingdom; and

View all articles by this author

Cai GoGwilt

321 East 54th Street Apartment 4B, New York, NY 10022

View all articles by this author

Notes

†

To whom correspondence should be addressed. E-mail: [email protected]

Contributed by Joel E. Cohen, August 18, 2008

Author contributions: J.E.C. designed research; J.E.C. performed research; J.E.C., D.C.R., and C.G. contributed new reagents/analytic tools; J.E.C. and M.R. analyzed data; and J.E.C., M.R., and D.C.R. wrote the paper.

Competing Interests

The authors declare no conflict of interest.

Metrics & Citations

Metrics

Note: The article usage is presented with a three- to four-day delay and will update daily once available. Due to ths delay, usage data will not appear immediately following publication. Citation information is sourced from Crossref Cited-by service.

Citation statements

Altmetrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited by

View Options

View options

PDF format

Download this article as a PDF file

DOWNLOAD PDF

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Personal login Institutional Login

Recommend to a librarian

Recommend PNAS to a Librarian

Save for later

Purchase options

Purchase this article to get full access to it.

Single Article Purchase

International migration beyond gravity: A statistical model for use in population projections

Featured Topics

Articles By Topic

Featured Topics

Articles By Topic

Featured Topic

Articles By Topic

Abstract

Sign up for PNAS alerts.

Results

Descriptive Bivariate Relationships.

Model Selection.

Residuals.

Model Coefficients.

Parameter Stability: How Much of the Past Is Relevant to the Future?

Discussion

Comparisons with Some Related Studies.

Use in Population Projection.

Open Research Questions.

Methods

Data.

Data Analysis.

Acknowledgments.

Supporting Information

References

Information

Published in

Classifications

Copyright

Submission history

Keywords

Acknowledgments

Notes

Authors

Affiliations

Notes

Competing Interests

Metrics

Citation statements

Altmetrics

Citations

Cited by

View options

PDF format

Get Access

Login options

Recommend to a librarian

Purchase options

Restore content access

Figures

Tables

Other

Share

Share article link

Share on social media

Further reading in this issue

Biomechanical ordering of dense cell populations

Generation and regulation of human CD4+ IL-17-producing T cells in ovarian cancer

Temperature increase of 21st century mitigation scenarios

Widespread seawater intrusions beneath the grounded ice of Thwaites Glacier, West Antarctica

Solving the 250-year-old mystery of the origin and global spread of the German cockroach, Blattella germanica

Protecting scientific integrity in an age of generative AI

Sign up for thePNAS Highlights newsletter

Generation and regulation of human CD4⁺ IL-17-producing T cells in ovarian cancer