Library of Congress Workshop on Etexts - Part 9
Library

Part 9

The most important factor that distinguished the vendors under consideration was their identification with the customer. The size and internal complexity of the company also was an important factor. POB was looking at large companies that had substantial resources. In the end, the process generated for Yale two compet.i.tive proposals, with Xerox's the clear winner. WATERS then described the components of the proposal, the design principles, and some of the costs estimated for the process.

Components are essentially four: a conversion subsystem, a network-accessible storage subsystem for 10,000 books (and POB expects 200 to 600 dpi storage), browsing stations distributed on the campus network, and network access to the image printers.

Among the design principles, POB wanted conversion at the highest possible resolution. a.s.suming TIFF files, TIFF files with Group 4 compression, TCP/IP, and ethernet network on campus, POB wanted a client-server approach with image doc.u.ments distributed to the workstations and made accessible through native workstation interfaces such as Windows. POB also insisted on a phased approach to implementation: 1) a stand-alone, single-user, low-cost entry into the business with a workstation focused on conversion and allowing POB to explore user access; 2) movement into a higher-volume conversion with network-accessible storage and multiple access stations; and 3) a high-volume conversion, full-capacity storage, and multiple browsing stations distributed throughout the campus.

The costs proposed for start-up a.s.sumed the existence of the Yale network and its two DocuTech image printers. Other start-up costs are estimated at $1 million over the three phases. At the end of the project, the annual operating costs estimated primarily for the software and hardware proposed come to about $60,000, but these exclude costs for labor needed in the conversion process, network and printer usage, and facilities management.

Finally, the selection process produced for Yale a more sophisticated view of the imaging markets: the management of complex doc.u.ments in image form is not a preservation problem, not a library problem, but a general problem in a broad, general industry. Preservation materials are useful for developing that market because of the qualities of the material. For example, much of it is out of copyright. The resolution of key issues such as the quality of scanning and image browsing also will affect development of that market.

The technology is readily available but changing rapidly. In this context of rapid change, several factors affect quality and cost, to which POB intends to pay particular attention, for example, the various levels of resolution that can be achieved. POB believes it can bring resolution up to 600 dpi, but an interpolation process from 400 to 600 is more likely. The variation quality in microfilm will prove to be a highly important factor. POB may reexamine the standards used to film in the first place by looking at this process as a follow-on to microfilming.

Other important factors include: the techniques available to the operator for handling material, the ways of integrating quality control into the digitizing work flow, and a work flow that includes indexing and storage. POB's requirement was to be able to deal with quality control at the point of scanning. Thus, thanks to Xerox, POB antic.i.p.ates having a mechanism which will allow it not only to scan in batch form, but to review the material as it goes through the scanner and control quality from the outset.

The standards for measuring quality and costs depend greatly on the uses of the material, including subsequent OCR, storage, printing, and browsing. But especially at issue for POB is the facility for browsing.

This facility, WATERS said, is perhaps the weakest aspect of imaging technology and the most in need of development.

A variety of factors affect the usability of complex doc.u.ments in image form, among them: 1) the ability of the system to handle the full range of doc.u.ment types, not just monographs but serials, multi-part monographs, and ma.n.u.scripts; 2) the location of the database of record for bibliographic information about the image doc.u.ment, which POB wants to enter once and in the most useful place, the on-line catalog; 3) a doc.u.ment identifier for referencing the bibliographic information in one place and the images in another; 4) the technique for making the basic internal structure of the doc.u.ment accessible to the reader; and finally, 5) the physical presentation on the CRT of those doc.u.ments. POB is ready to complete this phase now. One last decision involves deciding which material to scan.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ DISCUSSION * TIFF files const.i.tute de facto standard * NARA's experience with image conversion software and text conversion * RFC 1314 *

Considerable flux concerning available hardware and software solutions *

NAL through-put rate during scanning * Window management questions *

In the question-and-answer period that followed WATERS's presentation, the following points emerged:

* ZIDAR's statement about using TIFF files as a standard meant de facto standard. This is what most people use and typically exchange with other groups, across platforms, or even occasionally across display software.

* HOLMES commented on the unsuccessful experience of NARA in attempting to run image-conversion software or to exchange between applications: What are supposedly TIFF files go into other software that is supposed to be able to accept TIFF but cannot recognize the format and cannot deal with it, and thus renders the exchange useless. Re text conversion, he noted the different recognition rates obtained by subst.i.tuting the make and model of scanners in NARA's recent test of an "intelligent" character-recognition product for a new company. In the selection of hardware and software, HOLMES argued, software no longer const.i.tutes the overriding factor it did until about a year ago; rather it is perhaps important to look at both now.

* Danny Cohen and Alan Katz of the University of Southern California Information Sciences Inst.i.tute began circulating as an Internet RFC (RFC 1314) about a month ago a standard for a TIFF interchange format for Internet distribution of monochrome bit-mapped images, which LYNCH said he believed would be used as a de facto standard.

* FLEISCHHAUER's impression from hearing these reports and thinking about AM's experience was that there is considerable flux concerning available hardware and software solutions. HOOTON agreed and commented at the same time on ZIDAR's statement that the equipment employed affects the results produced. One cannot draw a complete conclusion by saying it is difficult or impossible to perform OCR from scanning microfilm, for example, with that device, that set of parameters, and system requirements, because numerous other people are accomplishing just that, using other components, perhaps.

HOOTON opined that both the hardware and the software were highly important. Most of the problems discussed today have been solved in numerous different ways by other people. Though it is good to be cognizant of various experiences, this is not to say that it will always be thus.

* At NAL, the through-put rate of the scanning process for paper, page by page, performing OCR, ranges from 300 to 600 pages per day; not performing OCR is considerably faster, although how much faster is not known. This is for scanning from bound books, which is much slower.

* WATERS commented on window management questions: DEC proposed an X-Windows solution which was problematical for two reasons. One was POB's requirement to be able to manipulate images on the workstation and bring them down to the workstation itself and the other was network usage.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ THOMA * Ill.u.s.tration of deficiencies in scanning and storage process *

Image quality in this process * Different costs entailed by better image quality * Techniques for overcoming various de-ficiencies: fixed thresholding, dynamic thresholding, dithering, image merge * Page edge effects *

George THOMA, chief, Communications Engineering Branch, National Library of Medicine (NLM), ill.u.s.trated several of the deficiencies discussed by the previous speakers. He introduced the topic of special problems by noting the advantages of electronic imaging. For example, it is regenerable because it is a coded file, and real-time quality control is possible with electronic capture, whereas in photographic capture it is not.

One of the difficulties discussed in the scanning and storage process was image quality which, without belaboring the obvious, means different things for maps, medical X-rays, or broadcast television. In the case of doc.u.ments, THOMA said, image quality boils down to legibility of the textual parts, and fidelity in the case of gray or color photo print-type material. Legibility boils down to scan density, the standard in most cases being 300 dpi. Increasing the resolution with scanners that perform 600 or 1200 dpi, however, comes at a cost.

Better image quality entails at least four different kinds of costs: 1) equipment costs, because the CCD (i.e., charge-couple device) with greater number of elements costs more; 2) time costs that translate to the actual capture costs, because manual labor is involved (the time is also dependent on the fact that more data has to be moved around in the machine in the scanning or network devices that perform the scanning as well as the storage); 3) media costs, because at high resolutions larger files have to be stored; and 4) transmission costs, because there is just more data to be transmitted.

But while resolution takes care of the issue of legibility in image quality, other deficiencies have to do with contrast and elements on the page scanned or the image that needed to be removed or clarified. Thus, THOMA proceeded to ill.u.s.trate various deficiencies, how they are manifested, and several techniques to overcome them.

Fixed thresholding was the first technique described, suitable for black-and-white text, when the contrast does not vary over the page. One can have many different threshold levels in scanning devices. Thus, THOMA offered an example of extremely poor contrast, which resulted from the fact that the stock was a heavy red. This is the sort of image that when microfilmed fails to provide any legibility whatsoever. Fixed thresholding is the way to change the black-to-red contrast to the desired black-to-white contrast.

Other examples included material that had been browned or yellowed by age. This was also a case of contrast deficiency, and correction was done by fixed thresholding. A final example boils down to the same thing, slight variability, but it is not significant. Fixed thresholding solves this problem as well. The microfilm equivalent is certainly legible, but it comes with dark areas. Though THOMA did not have a slide of the microfilm in this case, he did show the reproduced electronic image.

When one has variable contrast over a page or the lighting over the page area varies, especially in the case where a bound volume has light shining on it, the image must be processed by a dynamic thresholding scheme. One scheme, dynamic averaging, allows the threshold level not to be fixed but to be recomputed for every pixel from the neighboring characteristics. The neighbors of a pixel determine where the threshold should be set for that pixel.

THOMA showed an example of a page that had been made deficient by a variety of techniques, including a burn mark, coffee stains, and a yellow marker. Application of a fixed-thresholding scheme, THOMA argued, might take care of several deficiencies on the page but not all of them.

Performing the calculation for a dynamic threshold setting, however, removes most of the deficiencies so that at least the text is legible.

Another problem is representing a gray level with black-and-white pixels by a process known as dithering or electronic screening. But dithering does not provide good image quality for pure black-and-white textual material. THOMA ill.u.s.trated this point with examples. Although its suitability for photoprint is the reason for electronic screening or dithering, it cannot be used for every compound image. In the doc.u.ment that was distributed by CXP, THOMA noticed that the dithered image of the IEEE test chart evinced some deterioration in the text. He presented an extreme example of deterioration in the text in which compounded doc.u.ments had to be set right by other techniques. The technique ill.u.s.trated by the present example was an image merge in which the page is scanned twice and the settings go from fixed threshold to the dithering matrix; the resulting images are merged to give the best results with each technique.

THOMA ill.u.s.trated how dithering is also used in nonphotographic or nonprint materials with an example of a grayish page from a medical text, which was reproduced to show all of the gray that appeared in the original. Dithering provided a reproduction of all the gray in the original of another example from the same text.

THOMA finally ill.u.s.trated the problem of bordering, or page-edge, effects. Books and bound volumes that are placed on a photocopy machine or a scanner produce page-edge effects that are undesirable for two reasons: 1) the aesthetics of the image; after all, if the image is to be preserved, one does not necessarily want to keep all of its deficiencies; 2) compression (with the bordering problem THOMA ill.u.s.trated, the compression ratio deteriorated tremendously). One way to eliminate this more serious problem is to have the operator at the point of scanning window the part of the image that is desirable and automatically turn all of the pixels out of that picture to white.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ FLEISCHHAUER * AM's experience with scanning bound materials * Dithering *

Carl FLEISCHHAUER, coordinator, American Memory, Library of Congress, reported AM's experience with scanning bound materials, which he likened to the problems involved in using photocopying machines. Very few devices in the industry offer book-edge scanning, let alone book cradles.

The problem may be unsolvable, FLEISCHHAUER said, because a large enough market does not exist for a preservation-quality scanner. AM is using a Kurzweil scanner, which is a book-edge scanner now sold by Xerox.

Devoting the remainder of his brief presentation to dithering, FLEISCHHAUER related AM's experience with a contractor who was using unsophisticated equipment and software to reduce moire patterns from printed halftones. AM took the same image and used the dithering algorithm that forms part of the same Kurzweil Xerox scanner; it disguised moire patterns much more effectively.

FLEISCHHAUER also observed that dithering produces a binary file which is useful for numerous purposes, for example, printing it on a laser printer without having to "re-halftone" it. But it tends to defeat efficient compression, because the very thing that dithers to reduce moire patterns also tends to work against compression schemes. AM thought the difference in image quality was worth it.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ DISCUSSION * Relative use as a criterion for POB's selection of books to be converted into digital form *

During the discussion period, WATERS noted that one of the criteria for selecting books among the 10,000 to be converted into digital image form would be how much relative use they would receive--a subject still requiring evaluation. The challenge will be to understand whether coherent bodies of material will increase usage or whether POB should seek material that is being used, scan that, and make it more accessible.

POB might decide to digitize materials that are already heavily used, in order to make them more accessible and decrease wear on them. Another approach would be to provide a large body of intellectually coherent material that may be used more in digital form than it is currently used in microfilm. POB would seek material that was out of copyright.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ BARONAS * Origin and scope of AIIM * Types of doc.u.ments produced in AIIM's standards program * Domain of AIIM's standardization work * AIIM's structure * TC 171 and MS23 * Electronic image management standards *

Categories of EIM standardization where AIIM standards are being developed *

Jean BARONAS, senior manager, Department of Standards and Technology, a.s.sociation for Information and Image Management (AIIM), described the not-for-profit a.s.sociation and the national and international programs for standardization in which AIIM is active.

Accredited for twenty-five years as the nation's standards development organization for doc.u.ment image management, AIIM began life in a library community developing microfilm standards. Today the a.s.sociation maintains both its library and business-image management standardization activities--and has moved into electronic image-management standardization (EIM).