The Wayback Machine - https://web.archive.org/web/20120420063339/http://knol.google.com:80/k/databases-history-early-development
Knol will be discontinued on Tue, 01 May 2012 07:00:00 GMT. Learn more.

Databases - History & Early Development

Annotated Bibliography of the History of Databases

The purpose of this Annotated Bibliography is to illustrate the development of database models from 1st generation (network model etc) to second generation (relational) to 3rd generation (e.g. semantic). The Bibliography is presented in order of the year when the paper was published.

Contents


1. Bachman, C.W. 1969. Data structure diagrams. ACM SIGMIS Database, 1(2): 4 – 10    

Charles Bachman, a researcher from General Electric in Phoenix, Arizona introduces the idea of Data Structure Diagrams for the first time.  He introduces Data Structure Diagrams as a graphical notation to illustrate data models. To lay the foundation of data structure theory, the author first defines the terms 'entity', 'entity class', 'entity set', and 'set class' and gives an example of employees and departments where each department is the 'owner' of a set of employees 'members', and independently, the employees are 'owners' of their spouses and children. A department, an employee member, and his spouse and child are three separate entity classes. A collective set of entity classes and their single owner is called a 'set class'. The concept of one-to-many owner-to-member ratio forms the basis of his theory. Data structure diagrams uses blocks, which represent an entity class, and arrows, which represent a set class and the relationship between owner and member. The author illustrates how data structures can be used to non-ambiguously define hierarchies, trees, networks and other structures and how data structures are ideal in application of physical databases. Nowadays, data structure diagrams are still used to model representations of databases at various levels of abstraction and it was Bachman's theory that laid the foundation for that.
  

2. Codd, E.F. 1970. A relational model of data for large shared data banks. Communications of the ACM, 13(6): 377 – 387.  

In this ground breaking thesis, Codd, an IBM researcher, introduces the theory that information systems should be data independent, i.e. that the user of a database system should not be concerned with the internal workings of the system. CODD proposes that users should be abstracted from the internal representation of the data, such that if the internal representation of that data were to change (e.g. because of system growth), the way that the user perceives the data should remain unchanged. To this end, the author proposes a 'relational view' of data, rather than pre-existing graphical or network views. Relational model represents data in its "natural structure" so that it is not specific to any particular data system. In describing data systems in terms of set theory, Codd explains that users should only interact a collection of time-varying 'relations', i.e. a user should only know the name of the 'relationship' together with its 'domain' (e.g. department is the domain of employees, and the employees are owned by the department). Codd defines a number of terms and their properties ('active domain', 'primary key', 'foreign key' and 'nonsimple domain') and uses them to give step by step description of how to convert the relational model to 'normal form'. Normal form simplifies the way the names of data items are presented to the user. Overall, Codd describes the theory of 'relational models' only, he doesn't describe how to implement it in a particular language or system but his theory represented the shifting of data models from 1st generation (e.g. network models) to 2nd generation.
 

3. Metaxides, A., Helgeson, W.B., Seth, R.E., Bryson, G.C., Coane, M.A., Dodd, G.G., Earnest, C.P., Engles, R.W., Harper, L.N., Hartley, P.A., Hopkin, D.J., Joyce, J.D., Knapp, S.C., Lucking, J.R., Muro, J.M., Persily, M.P., Ramm, M.A., Russell, J.F., Schubert, R.F., Sidlo, J.R., Smith, M.M. & Werner, G.T. 1971. Data base task group report to the CODASYL programming language committee

The DBTG report of 1971 consisted of five main sections, covering a wide range of topics such as "a proposal for a data description language for describing a database, a data description language for describing that part of the data base known to a program, and a data manipulation language". A data description language is a language that describes the part of a database of interest to a particular program. It is concerned with the names and descriptions of data items and how they relate to data-aggregates, records areas and sets. All these terms are clearly defined in the report. Data Manipulation language "is the language which the programmer uses to cause data to be transferred between his program and the database". First generation data models consisting of networks were proposed and standardized in this DBTG report.

4. Codd, E.F. 1972. Relational completeness of data base sublanguages. Courant Computer Science Symposium 6

Here again, Codd, one of the Masters of database design theory proposes a theoretical foundation for further research on database languages. The author intends to illustrate how to determine if a database sublanguage is complete, independently of its host language: languages that suspects will be developed for updating and interacting with future databases. The author also discusses the pros and cons of using calculus-based languages to algebra-based languages and concludes that using calculus as a base for database languages is more beneficial. Nowadays, database languages such as SQL are of paramount importance in database systems, and I would suspect that Codd's theory contributed to the development of such languages. 
  

5. Bachman, C.W. 1975. The data structure set model. Proceedings of the ACM SIGMOD, Debate on Data Model: Data Structure Set Versus Relation  

In this paper, Charles Bachman, than a researcher at Honeywell Group, tries to unify the relational model of data and the data-structure-set model. Bachman begins by outlining the fundamentals of both models and show how they are different yet congruent ways of modelling data. The author explains that the aim of all databases is for retrieving data in the best and most efficient way possible and both modelling techniques aim to "capture information about entities and model them in the abstract" The author demonstrates this by establishing a relationship between objects in a data-structure-set model and objects in a relational model. Whereas a data-structure-set has a collection of (/set of) records (called members records) with a rule (called the owner record) where at least one match key in each member record (secondary keys) 'matches' the primary key of their owner records, a relation has a domain (a set of values for a specific attribute) and roles (the ways in which attributes of that domain is used). Bachman concludes that domains and roles, or primary and secondary keys are almost identical. More theoretical observations are established by the author, before he finally concludes that while both valid ways to abstract data, that the relational model, while lacking practical implementations, is a well constructed theory, and vice versa for the data-structure-set model. The author concludes that both visually and ergonomically and more widely applied, the data-structure-set model is a more 'natural' way of modelling data systems; although the relational model has the support of a larger body of theoretical application.
 

6. Steel, T.B. Jr. 1975. Data base standardization. Proceedings of the 1975 ACM SIGMOD international conference on Management of data: 149 – 156

Steel, a researcher for Equitable Life Assurance Society, outlines the main advances of the ANSI/X3/SPARC Study Group on Data. The main result of this stage of the study group was to come up with a standard way of modelling the varying views/interfaces of data. The author presents a number of diagrams which represent different views of data and how data "passes" across them. One of the main outcomes of the study group was to crystallize existing elements of the two schema approach ('internal' & 'external' schemas) add to explicitly define an intermediate schema (a 'conceptual' schema). A 'conceptual schema' is a tangible document compiled and validated by an enterprise administrator which contains definitions  of all the data items of the project, required data constraints and relationships between the data items. The internal schema is concerned with how the data is stored (e.g. hierarchically, relationally etc) and with the general internal workings of the DBMS. The 'external schema' (or any number of them) conveys how the application programmer views the data. As long as the 'conceptual model' is thorough, it should be possible to derive the 'internal' and 'external' schemas from it. The author uses the term "symbolic abstraction" to describe the models of the different views of the data.    

7. Schmid, H.A. & Swenson, J.R. 1975. On the semantics of the relational data model. Proceedings of the 1975 ACM SIGMOD international conference on Management of data: pg. 211 - 223   

Whereas the 'relational model' was expounded on a mathematical basis by Codd in his 1970 paper, a "real world" implementation of this model is presented and illustrated by Schmid & Swenson in this paper. Whereas relational theory of data models allow for sound mathematical principles to be applied to databases, it is too formal for "real world" implementation. For example, when compressing a real world concept into a mathematical model, it is unclear whether a collection of independent and separate objects are related to each other, or whether some of the objects exist only in conjunction with (to describe) other objects. These two situations are distinguished by the authors as "associations" & "characteristics". A series of assumptions and definitions are established and according to the authors "a data base consists of complex independent object types, and associations among them" and a "complex independent object types s formed by a kernel that is an independent object type, and by all its characteristics, and the characteristics of these characteristics, and so on."
A series of assumptions and definitions are established and the graphical 'semantic' model is built based on all of these concepts. Finally it is demonstrated how to apply normalization from a 'semantic' point of view. Conversion to first normal form separates descriptions of 'complex independent object types' from 'associations' and from other 'complex independent object types'. Conversion to third normal form splits up 'functionally associated independent data types'. A set of insertion-deletion rules are defined based on the semantic model (e.g. to ensure that independent objects that are still in associations must not be deleted etc.) Finally, the authors suggest two ways in which the 'relational model' can be improved upon, based on the semantic model they have proposed.

The aim is to illustrate that some of the core concepts of database theory ('relation', 'normalization', 'functional dependency' & 'normalization') are too vague in defining relations in the real world, and that relational model theory can actually be applied to a concrete semantic data model. The authors illustrate that one of examples presented in Codd's paper (the difference between a collection of relations) begs the question of what it actually means to collect attributes into a relation).

8. Chamberlin, D.D. 1976. Relational Data-Base Management Systems. ACM Computing Surveys (CSUR), 8(1): 43 – 66

Chamberlin, a colleague of Codd, and a researcher at IBM discusses the basics of relational data models, the concept of 'normalization' and how relational model evolved, and outlines some of the ways databases are being studied and efficiently implemented by various researchers in the field. Whereas contemporary DBMS systems rely on the user's knowledge of how data is actually stored in the system, the trend as described by Codd is for data to be independent on how the data is stored internally in the system. Codd begins by defining the term 'relation' from a mathematical point of view and explains that a data model can be viewed as simply the complete set of relations in a system and that using n-ary relations for database management provides simplicity, data-independence and symmetry. Certain configurations of relations are more efficient for updating the data than others, and to convert a relation into that format, normalization is needed. Codd outlines the benefits and steps needed in converting relations to successive degrees of normalization. Finally, Codd expounds on the sound theoretical foundation of relational models and how the theory is being applied to languages (e.g. first-order predicate calculus) and database systems.

9. Chen, PPS., 1976. The Entity-Relationship Model-Toward a Unified View of Data. ACM Transactions on Database Systems (TODS), 1(1): 9 - 36

The entity-relationship model is an important idea in the ancestry of database design, and is something that we have used in our assignments to capture data. The genius of Entity-Relationship Modelling can be attributed to Peter Pin-Shan Chen, an erstwhile researcher at the Massachusetts Institute of Technology. The entity-relationship model is a data model that incorporates some of the meaning and context of data, called semantics. Not only does he provide a description and theory of this model, but also a practical diagrammatic method for capturing data. As the title suggests, the Entity-Relationship Model provides a solid foundation for the unification of many views of data - vis the network view, relational view and entity set view. I have discussed how this unification of data views was expounded in some later papers and facilitated the evolution of data models (views) to more advanced generations. In terms of representing data in a natural way by separating entities from relationships, the network model (1st generation) was great, but failed miserably in terms of data independence (separating data from how it is stored). The relational model has some great improvements as proposed by Codd, but failed at representing the real world. And the data-set view of data lacks the intuitiveness needed in a data model. If there was a way to reconcile these three view of data, it was achieved by Chen in this paper.
 

10. Codd, E.F. 1979. Extending the relational database model to capture more meaning. ACM Transactions on Database Systems

There was a growing interest in a semantic model of data (which aims to capture the context and meaning of data) in the years since E.F. Codd introduced his groundbreaking thesis on relational databases. Almost 10 years after his paper "A relational model of data for large shared data banks", Codd proposes to extend the relational database model to capture more meaning. Codd builds his theory from several of the semantic ideas and by proposing new rules for insertion, update and deletion, and by introducing new, more powerful algebraic operators.
 

11. Hammer, M. & McLeod, D. 1981. Database description with SDM: a semantic database model. ACM Transactions on Database Systems (TODS), 6(3): pp 351 - 386

In this paper, M. Hammer of Massachusetts Institute of Technology, and D. McLeod of University of Southern California pursued the study of semantic databases as a means to capture more of the meaning and context of data than is possible in a relational database. They use SDM, a "high-level semantics-based database description and structuring formalism for databases". The authors summarize how SMD succeeds over conventional database models:
"In brief, these conventional database models are too oriented toward computer data structures to allow for the natural expression of application semantics. SDM, on the other hand, is based on the high-level concepts of entities, attributes, and classes."
The authors discuss not only the theory and design principles of SMD, but also the actual language syntax and specifications of the language. SMD, the authors propose, can convey a more realistic view of the world. This paper is one of a number of interesting and important papers from that era which strive towards a more natural representation of data, and could be thought of as part of the evolution from 2nd to 3rd generation databases systems.
 

12. Codd, E.F. 1981. Relational database: a practical foundation for productivity. Communications of the ACM, 25(2): pp. 109 - 117

The 1981 ACM Turing Award was presented to Edgar F. Codd, on November 9, 1981 at the ACM Annual Conference in Los Angeles, California. This paper is based on a lecture given by Codd at the Award ceremony. I like it because it presents Codds thoughts on how database technology had progressed up to that point (vis data independence, structural simplicity, and relational processing ability) and where it could be improved upon. Codd discusses a number of benchmarks which show the sucess of the Relational database. He also alludes to the title of the speech, in that relational theory offers only "a practical foundation for productivity" rather than a total solution for productivity. Whereas relational database theory deals with "shared data component of application programs and end-user interactions", it falls short in other areas. He proposes that the failings of relational database could be rectified through using complementary technologies and programming languages.
  

13. Lien, Y.E. 1982. On the Equivalence of Database Models. Journal of the ACM (JACM), 29(2): 333 - 362

This paper is a historical document which illustrates the transformation between first generation and second generation data models. The authors, both from Bell Laboratories, establish an equivalence between network databases (1st generation proposed by Bachman) and a subclass of relational databases (2nd generation proposed by Codd). By doing this, the authors intend to further the field of database design algorithms. The authors conclude that their results are useful for designing a multimodel DBMS (hybrid of both networks and relations), for converting a network database to relational one, for comparing various DBMSs and for furthering understanding in the field of database design in general.
     

14. Brodie, M.L. & Schmidt, J.W. 1982. Final report of the ANSI/X3/SPARC DBS-SG relational database task group. ACM SIGMOD Record, 12(4): pp. 1 - 62

In 1975, ANSI produced an interim report called the "Interim Report: ANSI/X3/SPARC Study Group on Data Base Management Systems". Data models (or data schemas) are used to illustrate the requirements, specifications or functioning of a database at increasingly lower levels of abstraction. ANSISPARC Architecture is based on a three level model which conveys data at increasingly lower levels of abstraction: external level, conceptual level and internal level. These three schemas correspond to the different steps of database design: conceptual schema can be thought of as the end product of a logical design; external schema as the result of a view design; and an internal schema as the end product of a physical design. The various schemas represent different levels of abstractions and have different functions.
 
Codd, in his paper "A relational model of data for large shared data banks" introduced a similar idea (See: Codd, 1970). CODD proposed that users should be abstracted from the internal representation of the data, such that if the internal representation of that data were to change (eg. because of system growth, disk change), the way that the user perceives the data should remain unchanged. This is a similar idea to internal/external representations of data. But whereas Codd discussed the theory of relational database models in mathematical language, the ANSI/X3/SPARC study group expounded on the practical considerations and implementation (security etc) of such a system. The ANSI/X3/SPARC DBS-SG relational database task group came up with a solid, industry-wide standard for relational databases.
This report has three outcomes which were paramount in the development of database technology.
1. Identify the fundamental concepts of the Relational Data Model.
2. Characterization of the features of erstwhile existing and potential RDBMSs to determine the interface functions.
3. Investigatation into the role of the RDM and RDBMS in a DBMS architectural framework such as the ANSI/X3/SPARC prototypical architecture, and in a coherent family of DBMS standards.

15. Teorey, T.J., Yang, D. & Fry, J.P. 1986. A logical design methodology for relational databases using the extended entity-relationship model. ACM Computing Surveys (CSUR), 18(2): pp. 197 - 222

In this paper, the authors propose a systematic methodology for relational database design. The first step is to compile a conceptual design based on extended ER model and then to further apply extensions to the ER model, such as "ternary relationships, optional relationships and the generalization abstraction". The authors develop a set of transformations for entity relations, extended entity relations and relationship relations. The aim of this methodology is to produce databases that can be easily modified for future processing requirements, and for the databases to be accurate representations of the real world. The systematic strategy of separately modelling the basic data relationships, and then applying techniques such as normalization is something that the authors suggest would be conducive in creating a database design software tool, and is also something similar to the methodology we used in our course work.
 

16. More Information about Databases


Comments

Article rating:
Your rating:
All Rights Reserved.
Version: 9
Versions
Last edited: Dec 2, 2011 10:37 AM.

Activity for this knol

This week:

70pageviews

Totals:

7536pageviews
1comments