Library of Congress Workshop on Etexts - Part 12
Library

Part 12

HOCKEY remarked several large projects, particularly in Europe, for the compilation of dictionaries, language studies, and language a.n.a.lysis, in which people have built up archives of text and have begun to recognize the need for an encoding format that will be reusable and multifunctional, that can be used not just to print the text, which may be a.s.sumed to be a byproduct of what one wants to do, but to structure it inside the computer so that it can be searched, built into a Hypertext system, etc.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ WEIBEL * OCLC's approach to preparing electronic text: retroconversion, keying of texts, more automated ways of developing data * Project ADAPT and the CORE Project * Intelligent character recognition does not exist *

Advantages of SGML * Data should be free of procedural markup; descriptive markup strongly advocated * OCLC's interface ill.u.s.trated *

Storage requirements and costs for putting a lot of information on line *

Stuart WEIBEL, senior research scientist, Online Computer Library Center, Inc. (OCLC), described OCLC's approach to preparing electronic text. He argued that the electronic world into which we are moving must accommodate not only the future but the past as well, and to some degree even the present. Thus, starting out at one end with retroconversion and keying of texts, one would like to move toward much more automated ways of developing data.

For example, Project ADAPT had to do with automatically converting doc.u.ment images into a structured doc.u.ment database with OCR text as indexing and also a little bit of automatic formatting and tagging of that text. The CORE project hosted by Cornell University, Bellcore, OCLC, the American Chemical Society, and Chemical Abstracts, const.i.tutes WEIBEL's princ.i.p.al concern at the moment. This project is an example of converting text for which one already has a machine-readable version into a format more suitable for electronic delivery and database searching.

(Since Michael LESK had previously described CORE, WEIBEL would say little concerning it.) Borrowing a chemical phrase, de novo synthesis, WEIBEL cited the Online Journal of Current Clinical Trials as an example of de novo electronic publishing, that is, a form in which the primary form of the information is electronic.

Project ADAPT, then, which OCLC completed a couple of years ago and in fact is about to resume, is a model in which one takes page images either in paper or microfilm and converts them automatically to a searchable electronic database, either on-line or local. The operating a.s.sumption is that accepting some blemishes in the data, especially for retroconversion of materials, will make it possible to accomplish more.

Not enough money is available to support perfect conversion.

WEIBEL related several steps taken to perform image preprocessing (processing on the image before performing optical character recognition), as well as image postprocessing. He denied the existence of intelligent character recognition and a.s.serted that what is wanted is page recognition, which is a long way off. OCLC has experimented with merging of multiple optical character recognition systems that will reduce errors from an unacceptable rate of 5 characters out of every l,000 to an unacceptable rate of 2 characters out of every l,000, but it is not good enough. It will never be perfect.

Concerning the CORE Project, WEIBEL observed that Bellcore is taking the topography files, extracting the page images, and converting those topography files to SGML markup. LESK hands that data off to OCLC, which builds that data into a Newton database, the same system that underlies the on-line system in virtually all of the reference products at OCLC.

The long-term goal is to make the systems interoperable so that not just Bellcore's system and OCLC's system can access this data, but other systems can as well, and the key to that is the Z39.50 common command language and the full-text extension. Z39.50 is fine for MARC records, but is not enough to do it for full text (that is, make full texts interoperable).

WEIBEL next outlined the critical role of SGML for a variety of purposes, for example, as noted by HOCKEY, in the world of extremely large databases, using highly structured data to perform field searches.

WEIBEL argued that by building the structure of the data in (i.e., the structure of the data originally on a printed page), it becomes easy to look at a journal article even if one cannot read the characters and know where the t.i.tle or author is, or what the sections of that doc.u.ment would be.

OCLC wants to make that structure explicit in the database, because it will be important for retrieval purposes.

The second big advantage of SGML is that it gives one the ability to build structure into the database that can be used for display purposes without contaminating the data with instructions about how to format things. The distinction lies between procedural markup, which tells one where to put dots on the page, and descriptive markup, which describes the elements of a doc.u.ment.

WEIBEL believes that there should be no procedural markup in the data at all, that the data should be completely unsullied by information about italics or boldness. That should be left up to the display device, whether that display device is a page printer or a screen display device.

By keeping one's database free of that kind of contamination, one can make decisions down the road, for example, reorganize the data in ways that are not cramped by built-in notions of what should be italic and what should be bold. WEIBEL strongly advocated descriptive markup. As an example, he ill.u.s.trated the index structure in the CORE data. With subsequent ill.u.s.trated examples of markup, WEIBEL acknowledged the common complaint that SGML is hard to read in its native form, although markup decreases considerably once one gets into the body. Without the markup, however, one would not have the structure in the data. One can pa.s.s markup through a LaTeX processor and convert it relatively easily to a printed version of the doc.u.ment.

WEIBEL next ill.u.s.trated an extremely cluttered screen dump of OCLC's system, in order to show as much as possible the inherent capability on the screen. (He noted parenthetically that he had become a supporter of X-Windows as a result of the progress of the CORE Project.) WEIBEL also ill.u.s.trated the two major parts of the interface: l) a control box that allows one to generate lists of items, which resembles a small table of contents based on key words one wishes to search, and 2) a doc.u.ment viewer, which is a separate process in and of itself. He demonstrated how to follow links through the electronic database simply by selecting the appropriate b.u.t.ton and bringing them up. He also noted problems that remain to be accommodated in the interface (e.g., as pointed out by LESK, what happens when users do not click on the icon for the figure).

Given the constraints of time, WEIBEL omitted a large number of ancillary items in order to say a few words concerning storage requirements and what will be required to put a lot of things on line. Since it is extremely expensive to reconvert all of this data, especially if it is just in paper form (and even if it is in electronic form in typesetting tapes), he advocated building journals electronically from the start. In that case, if one only has text graphics and indexing (which is all that one needs with de novo electronic publishing, because there is no need to go back and look at bit-maps of pages), one can get 10,000 journals of full text, or almost 6 million pages per year. These pages can be put in approximately 135 gigabytes of storage, which is not all that much, WEIBEL said. For twenty years, something less than three terabytes would be required. WEIBEL calculated the costs of storing this information as follows: If a gigabyte costs approximately $1,000, then a terabyte costs approximately $1 million to buy in terms of hardware. One also needs a building to put it in and a staff like OCLC to handle that information.

So, to support a terabyte, multiply by five, which gives $5 million per year for a supported terabyte of data.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ DISCUSSION * Tapes saved by ACS are the typography files originally supporting publication of the journal * Cost of building tagged text into the database *

During the question-and-answer period that followed WEIBEL's presentation, these clarifications emerged. The tapes saved by the American Chemical Society are the typography files that originally supported the publication of the journal. Although they are not tagged in SGML, they are tagged in very fine detail. Every single sentence is marked, all the registry numbers, all the publications issues, dates, and volumes. No cost figures on tagging material on a per-megabyte basis were available. Because ACS's typesetting system runs from tagged text, there is no extra cost per article. It was unknown what it costs ACS to keyboard the tagged text rather than just keyboard the text in the cheapest process. In other words, since one intends to publish things and will need to build tagged text into a typography system in any case, if one does that in such a way that it can drive not only typography but an electronic system (which is what ACS intends to do--move to SGML publishing), the marginal cost is zero. The marginal cost represents the cost of building tagged text into the database, which is small.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ SPERBERG-McQUEEN * Distinction between texts and computers * Implications of recognizing that all representation is encoding * Dealing with complicated representations of text entails the need for a grammar of doc.u.ments * Variety of forms of formal grammars * Text as a bit-mapped image does not represent a serious attempt to represent text in electronic form * SGML, the TEI, doc.u.ment-type declarations, and the reusability and longevity of data * TEI conformance explicitly allows extension or modification of the TEI tag set * Administrative background of the TEI * Several design goals for the TEI tag set * An absolutely fixed requirement of the TEI Guidelines * Challenges the TEI has attempted to face * Good texts not beyond economic feasibility * The issue of reproducibility or processability * The issue of mages as simulacra for the text redux * One's model of text determines what one's software can do with a text and has economic consequences *

Prior to speaking about SGML and markup, Michael SPERBERG-McQUEEN, editor, Text Encoding Initiative (TEI), University of Illinois-Chicago, first drew a distinction between texts and computers: Texts are abstract cultural and linguistic objects while computers are complicated physical devices, he said. Abstract objects cannot be placed inside physical devices; with computers one can only represent text and act upon those representations.

The recognition that all representation is encoding, SPERBERG-McQUEEN argued, leads to the recognition of two things: 1) The topic description for this session is slightly misleading, because there can be no discussion of pros and cons of text-coding unless what one means is pros and cons of working with text with computers. 2) No text can be represented in a computer without some sort of encoding; images are one way of encoding text, ASCII is another, SGML yet another. There is no encoding without some information loss, that is, there is no perfect reproduction of a text that allows one to do away with the original. Thus, the question becomes, What is the most useful representation of text for a serious work?

This depends on what kind of serious work one is talking about.

The projects demonstrated the previous day all involved highly complex information and fairly complex manipulation of the textual material.

In order to use that complicated information, one has to calculate it slowly or manually and store the result. It needs to be stored, therefore, as part of one's representation of the text. Thus, one needs to store the structure in the text. To deal with complicated representations of text, one needs somehow to control the complexity of the representation of a text; that means one needs a way of finding out whether a doc.u.ment and an electronic representation of a doc.u.ment is legal or not; and that means one needs a grammar of doc.u.ments.

SPERBERG-McQUEEN discussed the variety of forms of formal grammars, implicit and explicit, as applied to text, and their capabilities. He argued that these grammars correspond to different models of text that different developers have. For example, one implicit model of the text is that there is no internal structure, but just one thing after another, a few characters and then perhaps a start-t.i.tle command, and then a few more characters and an end-t.i.tle command. SPERBERG-McQUEEN also distinguished several kinds of text that have a sort of hierarchical structure that is not very well defined, which, typically, corresponds to grammars that are not very well defined, as well as hierarchies that are very well defined (e.g., the Thesaurus Linguae Graecae) and extremely complicated things such as SGML, which handle strictly hierarchical data very nicely.

SPERBERG-McQUEEN conceded that one other model not ill.u.s.trated on his two displays was the model of text as a bit-mapped image, an image of a page, and confessed to having been converted to a limited extent by the Workshop to the view that electronic images const.i.tute a promising, probably superior alternative to microfilming. But he was not convinced that electronic images represent a serious attempt to represent text in electronic form. Many of their problems stem from the fact that they are not direct attempts to represent the text but attempts to represent the page, thus making them representations of representations.

In this situation of increasingly complicated textual information and the need to control that complexity in a useful way (which begs the question of the need for good textual grammars), one has the introduction of SGML.

With SGML, one can develop specific doc.u.ment-type declarations for specific text types or, as with the TEI, attempts to generate general doc.u.ment-type declarations that can handle all sorts of text.

The TEI is an attempt to develop formats for text representation that will ensure the kind of reusability and longevity of data discussed earlier.

It offers a way to stay alive in the state of permanent technological revolution.

It has been a continuing challenge in the TEI to create doc.u.ment grammars that do some work in controlling the complexity of the textual object but also allowing one to represent the real text that one will find.

Fundamental to the notion of the TEI is that TEI conformance allows one the ability to extend or modify the TEI tag set so that it fits the text that one is attempting to represent.

SPERBERG-McQUEEN next outlined the administrative background of the TEI.

The TEI is an international project to develop and disseminate guidelines for the encoding and interchange of machine-readable text. It is sponsored by the a.s.sociation for Computers in the Humanities, the a.s.sociation for Computational Linguistics, and the a.s.sociation for Literary and Linguistic Computing. Representatives of numerous other professional societies sit on its advisory board. The TEI has a number of affiliated projects that have provided a.s.sistance by testing drafts of the guidelines.

Among the design goals for the TEI tag set, the scheme first of all must meet the needs of research, because the TEI came out of the research community, which did not feel adequately served by existing tag sets.

The tag set must be extensive as well as compatible with existing and emerging standards. In 1990, version 1.0 of the Guidelines was released (SPERBERG-McQUEEN ill.u.s.trated their contents).

SPERBERG-McQUEEN noted that one problem besetting electronic text has been the lack of adequate internal or external doc.u.mentation for many existing electronic texts. The TEI guidelines as currently formulated contain few fixed requirements, but one of them is this: There must always be a doc.u.ment header, an in-file SGML tag that provides 1) a bibliographic description of the electronic object one is talking about (that is, who included it, when, what for, and under which t.i.tle); and 2) the copy text from which it was derived, if any. If there was no copy text or if the copy text is unknown, then one states as much.

Version 2.0 of the Guidelines was scheduled to be completed in fall 1992 and a revised third version is to be presented to the TEI advisory board for its endors.e.m.e.nt this coming winter. The TEI itself exists to provide a markup language, not a marked-up text.

Among the challenges the TEI has attempted to face is the need for a markup language that will work for existing projects, that is, handle the level of markup that people are using now to tag only chapter, section, and paragraph divisions and not much else. At the same time, such a language also will be able to scale up gracefully to handle the highly detailed markup which many people foresee as the future destination of much electronic text, and which is not the future destination but the present home of numerous electronic texts in specialized areas.

SPERBERG-McQUEEN dismissed the lowest-common-denominator approach as unable to support the kind of applications that draw people who have never been in the public library regularly before, and make them come back. He advocated more interesting text and more intelligent text.

a.s.serting that it is not beyond economic feasibility to have good texts, SPERBERG-McQUEEN noted that the TEI Guidelines listing 200-odd tags contains tags that one is expected to enter every time the relevant textual feature occurs. It contains all the tags that people need now, and it is not expected that everyone will tag things in the same way.

The question of how people will tag the text is in large part a function of their reaction to what SPERBERG-McQUEEN termed the issue of reproducibility. What one needs to be able to reproduce are the things one wants to work with. Perhaps a more useful concept than that of reproducibility or recoverability is that of processability, that is, what can one get from an electronic text without reading it again in the original. He ill.u.s.trated this contention with a page from Jan Comenius's bilingual Introduction to Latin.

SPERBERG-McQUEEN returned at length to the issue of images as simulacra for the text, in order to reiterate his belief that in the long run more than images of pages of particular editions of the text are needed, because just as second-generation photocopies and second-generation microfilm degenerate, so second-generation representations tend to degenerate, and one tends to overstress some relatively trivial aspects of the text such as its layout on the page, which is not always significant, despite what the text critics might say, and slight other pieces of information such as the very important lexical ties between the English and Latin versions of Comenius's bilingual text, for example.

Moreover, in many crucial respects it is easy to fool oneself concerning what a scanned image of the text will accomplish. For example, in order to study the transmission of texts, information concerning the text carrier is necessary, which scanned images simply do not always handle.

Further, even the high-quality materials being produced at Cornell use much of the information that one would need if studying those books as physical objects. It is a choice that has been made. It is an arguably justifiable choice, but one does not know what color those pen strokes in the margin are or whether there was a stain on the page, because it has been filtered out. One does not know whether there were rips in the page because they do not show up, and on a couple of the marginal marks one loses half of the mark because the pen is very light and the scanner failed to pick it up, and so what is clearly a checkmark in the margin of the original becomes a little scoop in the margin of the facsimile.

Standard problems for facsimile editions, not new to electronics, but also true of light-lens photography, and are remarked here because it is important that we not fool ourselves that even if we produce a very nice image of this page with good contrast, we are not replacing the ma.n.u.script any more than microfilm has replaced the ma.n.u.script.

The TEI comes from the research community, where its first allegiance lies, but it is not just an academic exercise. It has relevance far beyond those who spend all of their time studying text, because one's model of text determines what one's software can do with a text. Good models lead to good software. Bad models lead to bad software. That has economic consequences, and it is these economic consequences that have led the European Community to help support the TEI, and that will lead, SPERBERG-McQUEEN hoped, some software vendors to realize that if they provide software with a better model of the text they can make a killing.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ DISCUSSION * Implications of different DTDs and tag sets * ODA versus SGML *

During the discussion that followed, several additional points were made.

Neither AAP (i.e., a.s.sociation of American Publishers) nor CALS (i.e., Computer-aided Acquisition and Logistics Support) has a doc.u.ment-type definition for ancient Greek drama, although the TEI will be able to handle that. Given this state of affairs and a.s.suming that the technical-journal producers and the commercial vendors decide to use the other two types, then an inst.i.tution like the Library of Congress, which might receive all of their publications, would have to be able to handle three different types of doc.u.ment definitions and tag sets and be able to distinguish among them.