1 Motivation

Literacy in science has been argued to be related to language use in a fundamental sense (Norris & Phillips, 2003; von Weizsäcker, 2004). “Nothing resembling what we know as western science would be possible without text” (Norris & Phillips, 2003). Hence, studying linguistic phenomena in disciplines such as the natural sciences can point to potential affordances and challenges for becoming literate in them. In a formal sense, language is defined as a symbolic alphabet that forms words that form sentences constrained by the rules of grammar (Nowak et al., 2002). Language can be characterized by nested structures (words embedded in sentences), by hierarchical order among the elements (e.g., phrase structures) and other universal features (de Beule, 2008). Hence, language is complex by design, and potentially exhibits complex systems behavior.

From infancy, learners are confronted with linguistic stimuli in their respective communities and learn inductively to generalize the input to produce language (Wallace, 1996). Language learning was consequently characterized and shown to be a probabilistic process, where linguistic properties of the constituents of the language eventually direct the learning of it Jurafsky and Martin (2014).

Language is an important medium for representation of knowledge such as facts and relationships among concepts and terms. Science knowledge can be characterized by its hierarchical structure and interconnectedness of concepts (Nousiainen & Koponen, 2012). Science curricula in the United States (National Research Council, 2012) and Germany (KMK, 2020) stress the existence of core concepts that are central to science learning. For example, matter, system, energy, and interaction (force) are outlined in the German state physics curriculum to be pertinent to most topics in physics (KMK, 2020). It is unclear, however, to what extent these knowledge structures also manifest in science-specific language. Given the complexity and probabilistic nature of language, it would be desirable to develop principled and quantitative approaches to studying science language (Agrawal et al., 2016).

This study seeks to employ natural language processing and network analysis techniques to analyze science language in a principled (hence, reproducible) way and extract linguistic properties and interdependencies of science-related terms. To do so, we analyze widely used and well administered bodies of science language data, namely German and English Wikipedia articles that were categorized as science-related. Articles in the orders of 10 k could be collected to extract the structure and interdependencies of terms in science. Networks were then formed based on the interconnections of terms within sentences in these articles and formed the basis for our analyses.

2 Physics language

Communication and representation within the science disciplines is largely reliant on language: “Within the philosophy of science, it has typically been assumed that the fundamental representational resources are linguistic, mathematics being understood as a kind of language” (Giere, 2004). Language allows humans to “transfer unlimited non-genetic information among individuals, and it gives rise to cultural evolution” (Nowak et al., 2002). Besides equations, graphs, and diagrams, language is one of the primary representational means to convey science ideas (Brookes & Etkina, 2009). More generally, Vygotsky’s sociocultural theory of cognitive development states that learning and development is a socially-mediated process, in which language forms a primary means to convey cultural beliefs, values, and knowledge onto others (Vygotsky, 1978; 1963). Language has been labeled “the most pervasive system of semiotic resources” (Lemke, 1998). Humans use language to make sense of their science-related experiences and communicate them with others (Halliday & Matthiessen, 2007; Brookes & Etkina, 2015). Language is the means for humans to become acquainted with science contents and humans use language to make sense of their science-related experiences (Brookes & Etkina, 2015). Learning science-specific language is then essential to becoming a member of a science community (Lemke, 1998).

Science language, as language in other domains, can be characterized to be an open dynamical system: it changes over time and it is open to external influences. For example, new concepts and terms are introduced with the advent of advanced theories (Touger, 1991). Advancing theories oftentimes is accompanied by a refinement in concepts and terms used. For example, the Medieval concept of “impetus” that a once resting object carries when thrown is refined with the advent of Newtonian physics. Momentum and kinetic energy replaced this concept entirely (Halloun & Hestenes, 1985). Hence, new terms are used and old terms are abandoned. Moreover, science language is infused with everyday language. Confusion in understanding science contents is related to interferences in language that arises from using science concepts in different domains and everyday language. “Heat” is technically a process variable in science, however, it is oftentimes used as a state function in what is called a caloric metaphor in everyday language (Brookes & Etkina, 2015).

Understanding and meaning making through language is bound by the context that language appears in. The distributional hypothesis states that one understands a word by the company (of words) it keeps (Jurafsky & Martin, 2014; Harris, 1954). And within cognitive semantics, the concept of ancillary knowledge (Redish & Kuo, 2015) states that we understand the meaning of terms by a contextual web of concepts. For example, the concept “current” is understood by it’s definitions as a stream of charged particles. However, the definition itself is only understood by the concepts “stream”, “charged”, and “particles” (Redish & Kuo, 2015). Certain concepts, then, are more central compared to other concepts and can be used as prototypes. Prototype theory posits that a bird such as a “robin” is more representative of the category bird, as, say a penguin (Rosch, 1975). Similarly, in the sciences there can be singled out core concepts that interconnect disciplines and can be hypothesized to be central in science-term networks where usage of terms is linked together. In German state curricula, conepts such as matter, system, energy, and interaction play a central role that underlie contents in physics (KMK, 2020). In the Next Generation Science Standards, some disciplinary core ideas for physics are force and motion, systems, energy, and matter (National Research Council, 2012). These concepts also function as organizing principles for curricula across many countries.

3 Modeling language

It proved intractable to specify all rules that govern language comprehension and production, and hence deterministically model language comprehension and production (Halevy et al., 2009). Information theory and complex systems theory were found to provide powerful frameworks to explain some phenomena related to language use, because of the complexity involved in any language-related phenomena. Human language is optimized to some degree to convey as much information without confusion, hence, certain words occur more frequently in order to enhance processing speed (Montemurro & Zanette, 2010). It is suspected that some form of “principle of least action” explains the complex systems behavior of language, where a vocabulary for efficient communication needs to be found such that few words are used more frequently. Most other words occur only rarely (Zanette, 2014).

One robust finding for language as a complex system is the power law behavior of the word occurence, called Zipf’s law (Font-Clos & Corral, 2015; Alstott et al., 2014). Complex systems such as language typically comprise elementary units, called tokens, that can be grouped (by means of similarity) into larger entities, called types Font-Clos and Corral (2015). For language, tokens are the individual instances (realizations) of words and types the abstract entities, i.e., an element in the vocabulary. The frequency of word occurence can be predicted based on this power law behavior. Language dynamics also follow principles derived from evolutionary principles (Lieberman et al., 2007; Nowak et al., 2002). For example, it has been shown for English language that the half-life of an irregular verb scales with the frequency with which it is used (Lieberman et al., 2007). As such, language is characterized by regularities at small and large scale which is important in modeling language-related phenomena.

Discipline-specific language can be expected to adhere to similar regularities. As such, learners in a discipline will be confronted with central terms more often and get acquainted with words by accumulating and integrating the different meanings in different contexts (Lemke, 1998). The theory of lexical concepts and cognitive models advances a usage-based account of meaning making from language, i.e., situated meaning-construction (Evans, 2006). In this context it is suggested that words have meaning potentials that are activated as a function of the context they appear in. Learners are confronted with these different contexts to various degrees (Palmer, 1997). However, it has been argued that learners are confronted with insufficient information to explicitly learn the meaning and rules of words and concepts, which has been called the “poverty of stimulus” (Nowak et al., 2002). It is thus quite perplexing that speakers who grow up in the same speech community reliably speak the same language (Nowak et al., 2002). Language learning is in large part inductive inference (Nowak et al., 2002).

For complex systems such as language, network structures have been found to provide means to model relevant mechanisms such as information flow (Brockmann, 2021). Networks, in its simplest form, are defined by nodes (also: vertices) and edges, which connect the nodes. Networks appear in many complex systems, such as websites and social networks. The diameter of the World-Wide Web was measured by counting the average shortest distance between any two nodes (Albert et al., 1999). The WWW appears to be only 18.59 in diameter. This means that any website can reach any other website in less than 19 steps. The social graph of Facebook was only 4.74 (Ugander et al., 2011). Moreover, real-world networks were found to follow powerlaws. Similar to Zipf’s law, this means for example that many nodes in a network have few connected edges (i.e., low degree) and a non-negligible fraction of a few nodes have a large number of connecting edges (Barabási & Albert, 1999). The few well-connected nodes then dominate and mediate information flow in the networks. Behavior that follows powerlaws is also called scaling behavior, because it did not depend on the magnification with which a system is observed (West, 2017).

In discipline-based educational research network analysis has been used, among others, to analyze immersion of students in communication networks. Researchers found that position within these networks is predictive of students’ performance in physics (Brunn & Brewe, 2013; Grunspan et al., 2014). Besides social networks, also the knowledge in a discipline can be represented in the form of networks (Koponen & Pehkonen, 2010). The natural sciences in particular represent disciplines where terms are logically connected and build upon each other. For example, to understand the Newtonian force concept in physics, the concepts of displacement, velocity, and accelerations have to be introduced first. Physics knowledge/curriculum structures are also hierarchical. This knowledge is stored in textbooks and curricula, and more and more in internet databases such as Wikipedia.

With regards to analyses of science language, most studies apply in-depth, qualitative research approaches such as content analysis (Brookes & Etkina, 2015; Carlsen, 2007). These approaches are based on human experts’ interpretations of the language data. Even though this assures meaningful analysis of the data, it is difficult to scale this approach to larger amounts of language data in science that will become increasingly available in the future (Baig et al., 2020). Computational approaches could facilitate a more data-centered, bottom-up approach to language analysis in science education research. Natural language processing emerged as a particularly powerful tool for systematically analyzing and modelling language data. Natural language processing encompasses a wide array of tools such as part-of-speech tagging or named entity recognition (Jurafsky & Martin, 2014). All these tools can enhance computational analyses of natural language.

4 Research questions

The present study utilizes natural language processing to facilitate network analysis of science-specific language. We seek to examine linguistic properties and interdepencies of science-related terms. The goal is to identify central terms in science-specific language and examine the properties of relations among the terms through a network analysis approach. The following research questions guide this study:

  • What are typical network parameter values for term networks in the natural science disciplines? In what ways are the networks for biology, chemistry, and physics similar or different?

  • Which terms in biology, chemistry, and physics emerge as central based on their network properties when analyzing a large corpus of science-specific texts, respectively?

  • In what ways do the contexts of the most central terms differ when used in the other considered disciplines?

5 Method

5.1 Science-related Wikipedia articles

For science, Wikipedia Footnote 1 has been proven to be a reliable source of knowledge (Giles, 2005; Agrawal et al., 2016; Ponzetto & Strube, 2007), almost as accurate as its commercial competitors (Giles, 2005). Consequently, Wikipedia has been used as a resource to automatically assist teachers in curriculum design (Agrawal et al., 2016) and to enhance natural language processing application such as coreference resolution (Ponzetto & Strube, 2007). Given the validity of science-related Wikipedia articles, we chose this as the text corpus to analyze science-related terms. We were furthermore interested in contrasting German and English science language to better understand how generalizable certain patterns are across these two western languages. Because we were interested in what science terms are central, the articles had to be cleaned to retrieve only plain text articles. Hence, mathematical formula, references, and urls were removed from the articles with the help of natural language processing tools.

Wikipedia articles are annotated in the form that categories are assigned to the articles. Only articles that were labeled as science-related (i.e., biology, chemistry, and physics) were retrieved and respectively analyzed. Table 1 displays information on the retrieved articles for physics. In this study, we will often focus physics language. In the online supplement we include the respective information for biology and chemistry.

Table 1 Characterization of the German and English Wikipedia articles for physics

As can be seen, German Wikipedia had overall more physics-related articles, however, physics articles in English Wikipedia were longer. Interestingly, for biology and chemistry, English Wikipedia had more artciles that were also longer. Sentence lengths in English articles were longer compared to German articles’ sentences for all science disciplines. Given that German language allows for long compound nouns, we can also see that German articles had an overall greater vocabulary compared to English Wikipedia articles for all disciplines (note: for biology, the relative size adjusted by number of articles would be greater as well). The type-token-ratio for all science disciplines was greater in German compared to English. This means that German language uses more specific terms per token.

5.2 Prelimiary analysis of science-related language in Wikipedia articles

To focus analyses on science-related terms, a subdataset was extracted in which only nouns were kept for the analyses. In physics ontology, typically entities, objects, and concepts are represented by nouns and processes are represented by verbs (Brookes & Etkina, 2015; Lemke, 1998). Hence, to map physics language and physics knowledge-structure, it is sensible (as a first step) to restrict analyses to nouns. Natural language processing techniques of part-of-speech tagging (as implemented in the spaCy-library for Python allows to perform this analysis in many different languages (Honnibal & Montani, 2017).

To linguistically characterize the sample articles and the noun dataset in more detail and analyze potential linguistic differences between German and English Wikipedia, we examined to what extent powerlaw behavior and Zipf’s law applied for the articles (Font-Clos & Corral, 2015; Clauset et al., 2009). Powerlaws are of great interest, because the distributions exhibit heavy tails, meaning that all values are expected to occur, allowing for scale-free behavior (Alstott et al., 2014; West, 2017). In the powerlaw \(\nu (t)\sim t^{-\alpha }\), t refers to the word rank, ν(t) to the word count, and α to the powerlaw exponent, necessarily below zero. If the behavior of the system follows Zipf’s law, the exponent should be − 1 in the abovementioned representation. A slightly different, though more common, representation considers the number of words with the same number of counts (ν(t)). Here, the exponent for the Zipf distribution is expected to be around − 2. With maximum likelihood methods viable tools became available to analyze powerlaw behaviors. Researchers developed open source software packages for evaluating empirical data with regards to powerlaw behavior (Alstott et al., 2014). In this software package, the empirical data is mapped to a probability density distribution. A minimum value for the x-axis where the powerlaw behavior typically starts is additionally fit to consider a power law for parts of the distribution.

It is furthermore informative to contrast the powerlaw against other distributions to infer the data generating process. The data generating process for a normal distribution is adding random variables X together. Hence, many observables in the real world follow a normal distribution. For a lognormal distribution, positive random variables are multiplied together. Creating a powerlaw requires more elaborate data generating processes that are oftentimes not well understood (Alstott et al., 2014). However, mere draws from a uniform distribution of characters plus a space can reproduce some of the regularities of language related to the powerlaw behavior (Li, 1992).

Figure 1 displays the empirical and fitted distributions in a log-log-plot. Except the lower tails in the right-hand side plots, the distributions resemble linear curves as expected when powerlaw behavior is present. The exponents are close to 2 for the German articles (see Table 2). For English articles they have a greater variability, however, still close to the exponent of 2 (though not always within the error bounds, σ). Table 2 further indicates that, compared with an exponential distribution (likelihood ratio R, and significance value p), a powerlaw distribution is a statistically significantly better description of the data compared to an exponential law. This holds true for all science disciplines (see online supplement). This comparison shows that the distributions are heavy-tailed (Alstott et al., 2014). However, compared with the lognormal distribution (Rlogn, plogn), we cannot confirm a statistically significant better fit of the powerlaw distribution in all cases. For example, the negative likelihood ratio for English physics articles indicates a better fit for the lognormal distribution, which is also a heavy-tailed distribution. The equal fit of lognormal and powerlaw seems to be common in empirical data analyses (Alstott et al., 2014). Figure 1 also indicates that the heavy-tail for the noun-only dataset is smaller compared to the entire dataset.

Fig. 1
figure 1

Graphical representation of Zipf’s law. Fitted curves are dashed. Upper left: all articles of German Wikipedia; upper right: all articles of English Wikipedia; lower left: only nouns in German Wikipedia; lower right: only nouns in English Wikipedia. Blue lines: probability density function; red lines: complementary cumulative probability density function

Table 2 Parameters for power law fits and comparison with other distributions

Another important phenomenon related to language is the positive correlation between word length and rank (starting with most frequent terms as 1): longer words tend to be less frequent (Zanette, 2014). We examined this relationship in the present datasets. For German, Pearson correlations were .23 and .23 for all words and only nouns, respectively. For English, correlations were .21 and .18 for all words and only nouns, respectively. Corellations for other disciplines were: .25,.26,.20,.19 (biology) and .25,.25,.23,.19 (chemistry). Hence, we can confirm that science-related language also exhibits this general relationship in languages.

6 Parameters of science-term networks

In RQ1 typical parameters of the different science-term networks for the disciplines will be displayed and compared. First, the different frequencies of nouns and the overall vocabulary size, i.e., number of nodes in the network, will be displayed. Alongside the number of nodes we will also display the number of edges between the nodes. A common property of networks is the density (Grunspan et al., 2014). The density is calculated as the proportion of realized edges to the number of possible edges. Another interesting property of the networks is the average shortest paths. We indicated that for social network graphs this distance is typically rather low, which refers to the property that with only few steps from any node (here: person) any other node can be reached. Finally, the transitivity will be calculated. Transitivity is a measure of cohesion (Grunspan et al., 2014). It measures the number of realized triads in relation to the number of possible triads in the network.

Another important phenomenon for real-world networks is the scale-free behavior of node degrees (Brockmann, 2021). As with the linguistic properties analysis above (Zipf’s law), the frequency of node degrees can be similarly described with a powerlaw. We will use a similar analysis as outlined above. The frequency of nodes for a certain degree will be plotted as histrograms and probability density distribtuions with their respective fitted curves. The analysis of the scaling parameter α and the comparison with exponential and lognormal distribution will reveal properties of the data distribution.

7 Identification of central terms through network analysis

The more frequent terms in a language are crucial for processes of language acquisition, language perception, and production. An intuitive way to analyze frequent terms in physics-specific languages would be to count occurences of types in the Wikipedia articles. However, most frequent words in English such as “the” would be uninformative. Hence, only nouns were considered. The nouns were split by their linguistic function as being a subject or object, given that this adds information to what extent a term is used as agent (subject) in a sentence. The Python library spaCy was used to generate these datasets.

The structure of sentences in English and German is organized in phrases, where the subject (in a noun phrase) determines the agent in a sentence which is related to objects via verbs. In our analysis of central terms, we therefore extracted every subject for the respective Wikipedia articles and linked them to their nouns in a sentence. To create a network representation, every subject and object was stored as a node in the network. Extracting subjects and objects was again performed with the spaCy library in Python. Each link between a subject and object was stored as an edge. A network representation can then be generated through complex modeling. For example, the spring layout utilizes the concept from physics where each edge is a spring and an equilibrium distribution is to be found through optimization techniques.

Retrieving central nodes, i.e., central terms, can be done in multiple ways. A simple approach is counting the incoming and outgoing edges. However, PageRank algorithm has been found to be more performant to detect important nodes. Based on the observation that a node with fewer links from otherwise more important other nodes should be ranked higher than a node with many links from irrelevant nodes (Page et al., 1999). Hence, we will use the Python library networkx’s implementation of PageRank to identify central terms in science language (RQ2).

In RQ3 we will use the most central terms in the physics articles and analyze how they are used in the different disciplines. We compared the use of physics terms in the disciplines biology, chemistry, and politics as contrasting cases. Given that English and German analyses yield similar results with regard to central terms, we will focus the English articles in this analysis. For the new disciplines, we also retrieved all articles from Wikipedia and the respective noun dataset. We then analyzed the links that each term from the physics terms had with other terms in the respective discipline. This analysis will eventually yield differences in contexts in which terms are used in the disciplines.

8 Findings

8.1 Parameters of the science-term networks

In RQ1 we calculated important parameters of the respective science-term networks (see Table 3). As a baseline comparison, we depict the number of nodes in the networks. It can be seen that the German networks have more nodes compared to the English networks. However, the English networks have more edges (i.e., links between the nodes) compared to the German networks. Hence, the density for the English networks is more than twice the density for the German physics network. In all networks, the average shortest path length is little above 3 for English and around 4 for German language. This means that any term can on average be reached from any other term with only few steps. The transitivity (cohesion) in the English networks is greater compared to the German networks. This is likely related to the greater density for the English networks that indicate that everything is closer tied together. Overall, the within-group differences in a language seem to be smaller compared to the within-group differences in a domain.

Table 3 Summary of network parameters for the different groups

The node degrees of these real-world term networks for science discriplines show scale-free behavior, i.e., they follow a powerlaw distribution (see Table 4). The scaling parameter α is around 2 for all networks. The R and p values indicate that the data is better approximated by a powerlaw distribution as compared to an exponential distribution. Hence, the upper tail is populated with nodes that have many connections (i.e., high degree). The comparison with the lognormal distribution (Rlogn, plogn) is less conclusive: sometimes the lognormal is a better fit, and sometimes the powerlaw. However, both distributions (powerlaw and lognormal) have a heavy tail.

Table 4 Powerlaw parameters for the science-term networks

The powerlaw behavior can be observed in the distributions as well (see Fig. 2). The histograms indicate that many nodes have a low degree. Following powerlaw scaling, few nodes have large degrees. This indicates that few terms are central in the networks and function as hubs that connect the different regions of the networks.

Fig. 2
figure 2

Powerlaw behavior of the science-term networks. Histograms on the left represent the frequency over number of degrees for the nodes in the respective networks. Probability density distributions on the right represent the probability density (red: cumulated, blue: probability; solid: empirical, dotted: \(x_{\min \limits }\) fixed to 1, dashed: \(x_{\min \limits }\) free to vary)

8.2 Central terms in networks

In RQ2 we evaluate what science terms emerge as central from analyzing filtered German and English Wikipedia. The resulting network for German physics-related Wikipedia articles can be seen in Fig. 3.Footnote 2 Only the most connected nodes are represented to make the network readable. The highlighted edges represent the strongest links between any two terms. The 20 most central terms are: Energie, Teilchen, Zeit, Physik, Theorie, Begriff, Arbeiten, Form, Eigenschaften, Teil, Elektronen, Arbeit, Entwicklung, Masse, System, Bereich, Körper, Materie, Größe, BeispielFootnote 3. googletrans-Python library.

Fig. 3
figure 3

Network representation of central physics terms in German Wikipedia

The terms refer to sociology/philosophy of science (Theorie [theory], Begriff [term], Untersuchungen [investigation], Wissenschaftler [scientist]Footnote 4) and to discipline-specific concepts (Energie [energy], Teilchen [particle], Zeit [time], Elektronen [electron], Kraft [force]). We will focus on the discipline-specific terms. Energy emerged as the most central term. It is linked to many other terms. For example, energy is linked to Teilchen (particle), Form (form), System (system), Masse (mass). This is well in line with our expectations. Forms (of energy) are a common approach to teaching energy. Furthermore, discussing particles (Teilchen) oftentimes involves energy. In elementary particle physics, energy is an important concept (alongside with momentum) to analyze experiments and detect potential new particles. Energy is also linked to system and system is then linked to state. These links express the importance of system identification when dealing with energy. Furthermore, linking it to state suggests that energy is a state function that is independent of trajectories (Brookes & Etkina, 2015).

Particle then is linked to matter. This link expresses the fact that matter is made up of particles. As an example of a particle, electron is linked to particle. It is noteworthy that electron emerged as a specific particle. The electron is a well-studied object in physics. For once, eletromagnetism in large parts is concerned with the behavior of electrons as a negatively-charged elementary particle (fermion). Furthermore, the study of the electron in the early days of quantum mechanics (e.g., the Dirac equation) and the observation that electrons as particles exhibit wave-like properties were crucial to advance physics. Electron is then linked to atoms, because atoms comprise electrons. Understanding the behavior of electrons in atoms enables the prediction of properties of molecule formation and chemical reactions.

Temperature is linked to velocity. This attributes to the fact that temperature is defined by the average velocity of microscopic particles in a system. Space and time are also linked. This might be attributed to the strong conncetion of these two concepts in the realm of relativity theory.

For English Wikipedia, the 20 most central physics terms were: theory, time, energy, system, example, number, field, particles, effect, work, model, process, physics, result, state, equation, method, experiment, form, part. As is evident from this list, many terms in English Wikipedia can be mapped to German Wikipedia. Also in English Wikipedia energy emerges as the most central physics concept (see Fig. 4). Energy is linked to system. Space and time are also linked in English Wikipedia as in the German Wikipedia.

Fig. 4
figure 4

Network representation of central physics concepts in English Wikipedia

However, many other connections also emerge in the English Wikipedia. For example, equation is highly interlinked with other terms. Laws are commonly linked to equations as for example Ohm’s law, or Newton’s second law that can be encapsulated in an equation. Terms is also linked to equations, which refers to the mathematical constituents of an equation. Furthermore, field and density are linked to equation. Field equations are common in gravitational theory, electromagnetism, and quantum field theory, among others. Density could be linked to equation via the continuity equation.

Finally, time is well interlinked in the English Wikipedia network. Besides space, time is also connected to process. This attributes to the fact that processes are inherently time-bound. Time is also linked to equation. This might be attributed to the fact that dynamical equations in physics model processes in time, i.e., time is a parameter in these equations. The link between time and concepts could indicate that time itself is oftentimes introduced as a concept. In general relativity theory, time is exposed as fundamentally bound to space. Recent quantum theories such as relational quantum mechanics and predecessors treat time as a fundamental ingredient in physics theories.

8.3 Contrasting term use by discipline

In RQ3 we now use the 20 most central terms (extracted by PageRank) in physics English Wikipedia and determine in what contexts they appear in other disciplines than physics. The disciplines biology, chemistry, and politics were considered, because biology, and chemistry are closely related to physics and politics is rather different in terms of concepts and used language. Table 5 shows the counts of each physics term in the other domains. It is noteworthy that all physics terms could be found in all other disciplines. The counts varied. For example, “number” was the most frequently encountered physics term in politics articles, whereas in biology and in chemistry “process” was the most frequent physics-derived term. The less used terms were particles, physics, and physics for politics, biology, and chemistry respectively.

Table 5 Word contexts of different disciplines with regards to the most central terms in physics

To analyze differences in contexts, the three most connected nouns for each physics term were determined (see politics/biology/chemistry 1/2/3 in Table 5). The term “system” in politics is linked to representation, government, and method. This relates to a political system and its function of representation. In biology, “system” is connected to cells, species, and systems. The cell is described as a structurally separable, autonomous, and self-sustaining system. Hence its close relation to system. Finally, in chemistry “system” is connected to equilibrium, state, and example. Equilibrium chemistry is concerned with systems where involved chemical entities do not change with time. This is typically an important assumption in order to model processes and phenomena mathematically. Even though the underlying concepts of system in all inspected disciplines share commonalities (e.g., a whole comprised of parts), the connected words are entirely different. Similarly, the term theory in politics refers to government, in biology to evolution, and in chemistry to orbitals (atoms). This indicates that language learners in different disciplines get acquainted with different contexts for the same term.

9 Discussion

This study sought to analyze science-specific language in a principled way with the help of natural language processing and network analysis methods. To retrieve a representative body of science-related language, Wikipedia articles were analyzed that were categorized with the labels biology, chemistry, and physics. The respective German and English versions of Wikipedia were analyzed as contrasting cases. German Wikipedia had more articles overall in physics, however, in English Wikipedia the articles were longer. In terms of linguistic properties, both languages followed Zipf-law behavior. This means that few terms appear very often. This is a well documented phenomenon for languages (Moreno-Sánchez et al., 2016).

In RQ1 we analyzed the general properties of the different science-term networks. English science-term networks appear to have fewer nodes compared to the German networks, however English networks have more connections between the nodes, hence, they are denser compared to the German networks. This raises the cohesion of the English science-term networks. The average shortest path between nodes in English and German science-term networks is approximately 3 and 4, respectively. This number is well in line with other networks such as social networks, e.g., Facebook has an average shortest path of around 5 (Brockmann, 2021). The distribution of node degrees was shown to follow a powerlaw distribution. This indicates that the science-term networks follow scale-free behavior and have heavy tails (Barabási & Albert, 1999). This means that few terms form central hubs in these networks and dominate the information flow. Arguably, these terms have to be specifically accounted for in educational efforts and curricula.

In the context of RQ2 we sought to identify these central terms. To do so, the subjects in each sentence were linked to their respective objects, similar to network analyses where people are linked to other people or websites are linked in search engines. The PageRank algorithm was used to extract the most important terms in German and English Wikipedia. The most central terms in German and English are almost identical for physics, and also for biology and chemistry, respectively. The terms were also linked to other terms in similar ways in both languages. For physics in particular, the analyzed terms referred to philosophy of science/sociology and physics-specific concepts. The physics-specific terms particulary match expectations as expressed in physics state curricula. It is interesting to note that no domains in physics appear as central terms, e.g., thermodynamics, mechanics, optics, etc. We hypothesize that probably the core concepts (interaction, system, energy, force) are more important also across domains.

In RQ3 we applied the most common terms which were found in physics to other domains in order to examine differences in contexts with reference to the disciplines. We found that all terms were also found in the other domains (politics, biology, chemistry). The contexts of the terms (i.e., the words with which they appear in a sentence), however, varied considerably across the disciplines and matched expected terms in the respective disciplines. For example, the term “theory” in biology was related to evolution, whereas in chemistry it was related to orbitals and atoms. This finding hints to the challenges of meaning-making in different disciplines: The same terms are used in different contexts with similar, yet non-identical meaning. This creates the cognitive challenge for learners to always consider the context when encountering a specific term. In fact, effects of framing and context for reasoning have been investigated in science education research (Palmer, 1997).

10 Limitations

Our study has several limitations that limit generalizability of our findings. We submit that oftentimes concepts are captured in a term, however, there are important concepts that are represented as more than one word. For example, in German, “Freier Fall” (free fall) refers to free fall of a moving body. Free fall is a specific physics concept, where assumptions such as no friction are made. These cases are not necessarily captured in our approach. Identification of bigrams or including the entire noun chunk of the subject can enhance this analysis. We also could not verify if there are representational biases in the Wikipedia articles. The marked differences between English and German Wikipedia articles might be attributable to the fact that German language more uniquely captures concepts in compound nouns which account for greater number of nodes and, potentially, less dense networks.

Our analyses focused on linguistic properties of terms in science language. However, cognition, expertise, and learning in science is in large part visual. Allegedly, Kekulé discovered the structure of benzene by visual means. Forming a visual representation of a science problem is an indicator of science expertise (Singh, 2008). Hence, the integration of visual and language-related modes of representation would be crucial to fully understand development of expertise in science. This interconnection of visual experience and language development in forming physics intuitions and conceptions is an important future direction for science education research.

Finally, we can only hypothesize that this scale-free behavior and the central terms will emerge similarly in non-western languages. We restricted our analyses to two western languages with well-developed and curated Wikipedia databases for the natural sciences. However, these languages cover only a fraction of the existing languages and speakers. It would be certainly worthwhile to replicate these analyses in other languages. All described libraries and computational tools will allow for these kinds of analyses in other non-western languages.

11 Conclusions

Applications of network analyses have been found useful in many domains such as education where large unstructured datasets could be systematically analyzed to identify active learning and knowledge acquisition (Grunspan et al., 2014; Brunn & Brewe, 2013). With the help of natural language processing, in particular, these network analyses can be extended to systematically analyze large corpora of language data. Our approach in particular showed how term-based networks can be extracted from large text corpora. Thus, we provide a template for knowledge-oriented networks that are expected to enhance social network analysis (Brass, 2022). Moreover, social network analysis in conjunction with language networks capture essential aspects of Vygotsky’s sociocultural theory of cognitive development insofar as they enable to account for the complex interdependencies of socially-mediated knowledge and language acquisition. The language networks particularly capture aspects of the interrelated knowledge in science. This can enhance curriculum design (Agrawal et al., 2016) and diagnostics of beliefs, values, and knowledge (Wulff et al., 2022).

Our analyses also indicate that complex and dynamic systems analyses can play an important role in the methods portfolio of education researchers (Brass, 2022; Hilpert & Marchand, 2018). Educational research related to learning and language can focus on micro, meso, or macro processes, e.g., individual learning, group learning in classes, or learning as a cultural phenomenon on the societal level. In any layer, learning appears to be complex and even interrelated to the other layers. E.g., learning on the individual level is impacted by societal discourses, but also individual cognitions such as beliefs and values. It is intricately difficult to extract relationships, laws, and even theories under these circumstances (Halevy et al., 2009). However, complex and dynamic systems analyses provide a means to extract underlying relationships and laws. This has been documented extensively for complex systems in the natural world. Crickets, birds, cells, or even human crowds were shown to behave like complex systems (Strogatz, 2003). Even though their behavior appears to be chaotic, complexity science revealed that simple laws govern these natural systems that give rise to complex behaviors (West, 2017; Strogatz, 2003; Brockmann, 2021; Wolfram, 2002). Our analyses showed that simple laws such as the powerlaw underlie the science-term networks extracted from Wikipedia. These laws allow educational researchers to delineate normative distributions and patterns of language use. In some way or another, experts’ language can be hypothesized to approximate these normative language distributions of terms, given the exposure of experts to science-related language over extended periods of time (Ericsson, 1998). Complex systems analyses can facilitate detection of outliers, i.e., novices who potentially lack important terms in the distributions. Such linguistic modules could be implemented in intelligent tutoring systems (Graesser et al., 2004), that are language-bound in large part and need any form of evidence to diagnose competences of the tutees.