The Wayback Machine - https://web.archive.org/web/20070714145517/http://www.cultivate-int.org/issue1/gutenberg/
Cultivate Interactive Home Page *
*

Search

  Home | Current Issue | Index of Back Issues
  Issue 1 Home | Editorial | Features | Regular Columns | News & Events | Misc.

European Literature and Project Gutenberg

By Frank Boumphrey - July 2000

Frank Boumphrey reports on Gutenberg at the HTML Writers Guild (HWG). 'Project Gutenberg' is an initiative started by Michael Hart. Its mission was to convert the worlds great literature to ASCII etexts. To date this has produced over 4000 texts many of them of high quality. These texts are in English, although a large number of them are English translations of Classic European Books. Gutenberg at HWG is an initiative sponsered by the HTML Writers Guild to convert these, and other suitable etexts to XML. Because XML has full support for Unicode, this initative encourages the transcription of documents in their native language and Character sets.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Introduction

'Gutenberg at HWG' aims to make available to the general public the worlds great literature, in an easily accessible, easily readable electronic formats [1].

Background

There are numerous sources of electronic literature available, probably the best known of which are the texts transcribed by "Project Gutenberg".

Project Gutenberg was founded in 1971 by Michael Hart. By pure serendipity Michael was given a windfall of several megabytes of storage space (worth millions of dollars in those far off days!), and he decided to use this windfall to store the worlds great literature. His criteria were (and remain):

  1. The text must be copyright free
  2. There must be a Latin character version available.

For the most part, Gutenberg texts are rendered as 128 ASCII, with 65-70 characters to a line, and paragraphs are delineated as double line breaks. The Gutenberg texts also can use multiple source documents. To date over 4000 books and documents have been marked up in this format, and these are freely available by visiting the "Project Gutenberg" web site. It so happens that well over three quarters of the texts available are European classics, either English, or English translations.

Other notable efforts to make available texts in electronic form include texts marked up in SGML (mainly using the TEI DTD). Both Oxford University in the U.K. [2] and the University of Virginia in the USA have large repositories of literature available, both as SGML and as ASCII text.

There are many other scattered efforts, and there is indeed a great need for a source to co-ordinate and centralize all these efforts, so that they do be come truly available to every one.

It became obvious to us that if texts were marked up in a format that could be easily read on most desk-top computers then they would become more usable and thus functionally more available. The obvious choice of markup language to use was XML, which had several advantages over HTML. Our plan is to use our membership to markup existing documents.

The mission of 'Gutenberg at HWG' is to markup e-texts in XML or some other non-binary format in order that the content and structure of the worlds great literature, including literature written in non-ASCII character sets, is not only preserved, but is also functionally accessible for perusal, research and conversion to other media formats.

The HTML Writers Guild is a not for profit educational foundation with over 130,000 members, including several thousand in Europe. Although most of the available texts are English, or english translations, we would like to expand our activity to non-English languages. We already have a cadre of volunteers who are prepared to markup books and documents written in Italian, German, French, Spanish, and Polish.

Why Bother with E-texts

Before looking at the pro's and con's of various e-text efforts, we would do well to answer the question, "Why bother?"

By and large two negative attitudes prevail.

  1. The first which I call the "complacent attitude" goes along the lines of "All these books are available for the asking at any decent library, so why not just go and look up the texts there? Why the need for electronic versions?"
  2. The second, which I call the "elitist attitude" goes "If they want to research texts, or mark them up or whatever, why don't they learn SGML and TEI, and invest in a few good SGML tools and access or add to the texts which are already available."

The fact is, as anyone with teenagers understands, is that we are living in an electronic, 'instant gratification' age, and new-comers to literature just are not going to do either of the above. Even the texts set out in school curriculums for the most part do not get read, the 'scholars' usually employ some 'Hamlet in a Nutshell' kind of series to get their work done. The vast majority of students in developed countries are not going to go out of their way to find the classics of European literature. They are going to stumble across them, and then, if they are lucky, the magic may start eating into their soul!

If we are honest, most of us came to an appreciation of literature that way! I personally got fascinated by Poetry when I was forced to learn Walter de la Mere's "the Traveler" by heart as a punishment! I was lucky, and had access to a good library in order to pursue my new found love. Many are not so lucky. For them etexts are the answer.

It is only those already converted to the merits of classic literature who will find the time and the energy to go out of their way to access good literature.

There is indeed a need to make good literature available and easily accessible to everyone.

Drawbacks of Various e-formats

As mentioned previously, there already exists etexts in various formats, so we should ask the question: "What problems exist with the existing available literature, available in ASCII text, SGML, and other formats, and how does XML overcome these problems?

ASCII texts

The ASCII text of the books of Project Gutenberg are readable, and are accessible to anyone with a text reader. However ASCII text as a format for reading books has numerous drawbacks.

  1. The Presentation is Bland and difficult to read on screen.
  2. It is Anglocentric.
  3. It has Unstructured content.
  4. The markup (or lack thereof) is Non-descriptive of content.
  5. There are Accessibility issues.

The Presentation is bland and difficult to read on screen.

The problems of reading on the screen are well known. By and large the reading rate is about half of that for the printed word, and the retention rate is about a third. This is due to several factors. Poor resolution makes serifed texts (the preferred font for reading) difficult to read; the postcard format of the screen makes for difficulty and orientation; and the scrolling feature means that the readers eyes are constantly having to relock on a point of reference. All these problems are compounded by the use of unstyled ASCII text. Styling using, say, XML and a style sheet can alleviate if not completely overcome many of these problems. Hopefully in the near future custom designed readers will further alleviate screen reading problems so that reading on a screen becomes almost as easy as reading from the printed page. Although it is possible to print out a styled document, this is wasteful of environmental resources, but work is proceeding on developing a re-usable paper for these purposes.

It is Anglocentric

ASCII text, especially as used in the Gutenberg project only supports the English character set. Although it is possible to use the other Latin Character sets with ASCII 256, The Cyrillic and Greek sets are not supported, not to mention languages such as Arabic, Kanji, Hebrew etc. This is certainly a big draw back of a text based system.

It has Unstructured content

Although human readable, there is not easy way that a computer can analyse and parse the content of the document. This is a big draw back if the document is to be used for research.

The markup (or lack thereof) is Non-descriptive of content

Even minimal markup of semantic or structural content of a document increases its value and usability.

There are Accessibility issues

Although text documents can be used by text readers, the lack of markup makes for a plain translation. Marking up a document in XML means that intelligent user agents are able to better render a document for the client.

A further disadvantage of ASCII is that it cannot capture any diagrams or art work that may accompany the

Disadvantages of SGML

SGML has several disadvantages for the markup of popular texts, both for the potential reader, and for those who would markup the texts.

For the reader of a text, SGML parsers are not readily available, and they are expensive. An SGML parser/reader can cost nearly 2000 Euros! For the volunteer who is going to markup the text, SGML has a steep learning curve, and it is not likely that a part time marker would invest the necessary time and effort to learn it.

XML on the other hand is a powerful subset of SGML, with most of SGML's functionality, and only a fraction of its difficulty. Further more there are an abundance of XML tools available and most of these are free!

Advantages of XML

Marking up a text in XML immediately provides numerous advantages which can be summarized as follows.

  1. Structure is described
  2. Content is described
  3. Ease of presentation
  4. Ease of conversion
  5. XML with its Unicode support allows alternate character sets.
  6. Added benefit to disadvantaged users

Structure and Content is described

Both the structure of the document, and the nature of it's content are easily described by using XML markup. This is of tremendous benefit both to serious researchers and the casual reader.

Ease of presentation

XML can be combined with a style sheet, and an attractive and readable display will result. IE 5, Netscape 6, and Opera 4 all have support for XML.

Ease of conversion

It is easy to convert an XML document from one document type to another. It is possible to write a simple XSL style sheet to convert one XML document type to another, including XHTML, which means that the documents and texts can be read in down level browsers as well as the most recent ones.

XML with its Unicode support allows alternate character sets

XML uses wide 2 bit characters which means that it can be used to every character set in the world. This means that European documents can be displayed with their correct characterization and nuances.

Added benefit to disadvantaged users

XML markup is a boon to sight-impaired users. It allows them to easily book mark and 'browse' through a document.

HWG at Gutenberg Time Line

We are just starting out on with Project Gutenberg, and have just got beyond the 'Proof of Concept' stage. We already have an impressive cadre of volunteers, have produced several tutorials, and have marked up about 250 texts in XML or XHTML [1]

Our immediate goals include the following:

Organize a series of suitable XML document types.

Our initial concept was to use TEI-XML for markup, but this proved to be too complex for most of our volunteers. We have therefore developed some interim DTD's, and plan to develop a series of DTD's specifically for this project. We have also developed a series of tutorials for a simplified TEI subset that we are encouraging our markers to use.

Develop a Dublin Core RDF module to describe document content.

Murray Altheim of Sun Microsystems is spear heading this effort. This really needs to be done in collaboration with other existing Library groups.

Collaborate with other organisations with compatible mission statements.

We do not consider ourselves to be in competition with any one. We are particularly interested in collaborating with any organization with a similar mission statement to ours. In particular we are interested in collaboration with any national group that is interested in transcribing and marking up their heritage literature.

Organize volunteers and an advisory board.

Volunteers are the life blood of our organization. We are looking for volunteer markers as well as those expert in the library sciences for our advisory group. We are also looking for persons with administrative skills (perhaps a retired person?) that can commit to a half days work a week.

We are particularly interested in those who have the skills to mark up non-English documents. If anyone is interested in volunteering please e-mail Frank Boumphrey at Frank@hwg.org Link to an email address and put "Project Gutenberg Volunteer" in the subject line.

Develop a strong Infra structure and find suitable document storage space.

We have almost finished setting up a dedicated server for Gutenberg at HWG. We are also interested in mirroring on other sites, and will also mirror on our site similar endeavors. Of course we wish to maintain a series of links on our Web pages to other sites specializing in e-commerce.

Ensure continuity.

Too often an organization tends to center around the personality of its founding members, and when they leave the organization tends to wither. Making European and other classics available in electronic forms is for the benefit of future generations as well as our own, so we are particularly interested in producing a viable forward looking board, and strong collaboration with other groups to make sure that what has been started will continue

Conclusion

Gutenberg at HWG is an exciting project that is just getting under way. We would certainly like to co-operate with the European Cultural heritage program in every way that we can, and we are particularly keen to hear from volunteers who would like to participate in this project.

References

  1. The provisional Project Gutenberg Web site
    <http://www.hwg.org/opcenter/gutenberg/> Link to external resource
    The permanent Project Gutenberg Web site (coming in July or August 2000)
    <http://gutenberg.hwg.org> Link to external resource
  2. Oxford Text Archive Web site
    <http://ota.ahds.ac.uk/> Link to external resource

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Author Details

Frank Boumphrey

frank@hwg.org Link to an email address
<http://www.hwg.org/> Link to external resource

Frank Boumphrey is the Vice president of HTML Writers Guild. He is Director of Gutenberg at HWG.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

For citation purposes:
Boumphrey, F. "European Literature and Project Gutenberg", Cultivate Interactive, issue 1, 3 July 2000
URL: <http://www.cultivate-int.org/issue1/getenberg/>