Guidelines for Electronic Preservation of Visual Materials

Aspects of Collection Analysis for Preservation Digitization

Material Form - Content - Electronic Form - Use

To clearly discuss the issues in digital imaging for preservation of visual materials, it is important to begin by developing a set of terms.

MATERIAL FORM

The material form of an item encompasses those aspects of an item which may affect the manner in which it is captured digitally, but which are not intrinsic to its primary use as a conveyor of information. In essence, the item's material form is an historical accident -- it was prescribed by the container technology which existed at the time the item's information was reduced to physical form.

An interesting dilemma when making decisions about digital capture from a physical item is in deciding which aspects of its materiality are of central interest in its capture. Depending on policy issues and assumptions made about the potential consumers of the digital information, the coffee stain on that manuscript may be either an accident of its current embodiment or a key historical fact. When one factors in a potential doubling of the conversion cost based on what is deemed important, it is clear that such decisions must be faced before capture begins.

We break down the concept of material form into some component elements.

artifact type

The artifact type is a convenient label which describes a class of objects which served a particular purpose or which embodied a particular social or economic practice. These practices often dictated a particular material form for the item. Such classes may be more or less general. Some artifact types include:

book
manuscript
magazine
newspaper
photograph
cuneiform tablet
tax form

physical peculiarities/features

Physical peculiarities/features describes the particular properties of an item which, while incidental, may have direct bearing on the ease with which the item may be scanned. Some examples include:

bound
transparent
brittle
specularly reflective
made of clay

overall size

The size of an item is a physical peculiarity which deserves special mention, since it determines the total number of pixels required of the scanning device once the spatial resolution has been chosen.

When faced with a population of documents having a few "outlyers" -- items which are exceptionally small or large -- a system architect may be faced with either acquiring a separate (and potentially more expensive) scanner for those few items, or with breaking the scans of the item into tiles or other subdivisions, or with altering the capture characteristics (e.g. lowering the spatial resolution) for those items. While the latter course is the least defensible on technical grounds (for an archival image), it seems to be rather widely used, perhaps since it is analogous to the microfilming operation of running a camera up and down a copystand's vertical stanchion. Note that this approach may make some sense for access-quality images.

CONTENT

Two types of content may be discussed. The first, material content, addresses the actual form of the information features embodied in an item. The second, electronic content, is the digital equivalent which has been deemed to be sufficient or appropriate to the preservation of the item's corresponding physical content. This mapping,

Material Content ---- Electronic Content

is the crux of the problem this report attempts to address.

Material content

Material content comprises the visible information features present in a physical item. These may be pieces of typeset or hanndwritten text, line art or engravings, halftones, or continuous tone regions. The several material content types present on an item may be in separate regions of the page or they may be inter-mixed or overlapping.

feature sizes

A key characteristic of a piece of material content is the feature sizes which are present in it. In text a characteristic size might be the stroke width of a sans serif character such as the letter "l" (ell) or the radius of the finest serif tip seen in the font.

In a traditional halftone, the minimum feature size is the diameter of the smallest dot seen in the lighter highlight parts of the halftone.

In a continuous tone region, the minimum feature size would be the diameter of a Gaussian blur spot in the highest-definition portion of the photographic print.

feature tonal content

The tonal content of a feature is the range of differing shades or colors see across all instances of the feature within the material content region.

A continuous tone region would exhibit a high gray tonal content.

A text region has a low tonal content, exhibiting only a few predominant shades or colors. Such a region is often described as bitonal, since it has only two intrinsically meaningful tones (shades of gray or shades of color). Any additional tones seen in such an image are created by samples which straddle these bitonal edges.

feature contrasts -- An aspect of feature tonal content is feature contrast. Whereas the tonal range is found across an entire material content region, feature contrasts are locally determined and relate to the local tonal difference between the feature and its immediately adjacent background. While text is commonly high contrast, features such as handwritten marginalia may have very low contrast.
feature colors -- A special case of tonal variation is color variation. Note that a feature such as text may be bitonal and still possess color content. The printing term for this is spot color. Features in color photographs can have a wide range of color variation.

feature types/configurations

Besides having certain sizes or contrast ratios, different types of features have specific shapes or configurations which dictate how they would best be treated.

text -- Text is characterized by compact markings with regular, relatively wide spacing between long, thin strokes of relatively uniform width. The strokes have consistent curvatures and predictable, gradual changes of orientation. Fonts may have serifs or be sans serif, the latter having a more uniform stroke width.
halftones -- Halftones are traditionally dark dots of varying size laid out on a regularly-spaced grid. This grid is most often oriented at an angle to the scanning axes. A variety of historically significant processes have been used to produce several different styles of traditional halftones. Their configurations of foreground and background are distinctive enough to perhaps allow determination of the process type from automated analysis of a high-resolution scan.

Several more modern styles of halftone now exist which are different than this traditional clustered, ordered halftone. These distributed, pseudo-random halftones have black dots all of the same minimal size which occur in more or less proximity to one another to achieve different simulated gray shades. They tend, however, to be laid out such that they do not touch one another, since this produces undesirable visual effects.
line art and engravings -- Line art, like text, is characterized by relatively uniform-width markings of long thin strokes. Line art may tend to have greater stroke width variation than text, depending on the artist's tool used as a marking device.
continuous tones -- Continuous tone regions have very smooth variations between adjacent reflective values. They typically have no shades which might be termed foreground or background as in a bitonal region.

It is more difficult to identify a minimum feature size in a continuous tone region, since this may most typically occur in a fine texture, rather than at an isolated stroke. The measurement of the minimum characteristic size of such a texture is complicated by the need to look for reflective maxima or minima across which to perform the measurement, rather than by using the clearly isolated features seen in bitonal regions. Such a technique can perhaps be automated by transform domain techniques using the data which would be available in a JPEG compression system.

Electronic content

For clarity, we have introduced a distinction between material content -- that which exists a priori in the artifact itself, and electronic content -- the format of the digital image chosen by the imaging system architect or user to best preserve the material content with reasonable fidelity.

Another concept we have stressed is that there may be multiple material content areas present on the artifact (typically a page). Each of these may best have a different electronic content type used to convey it, although practical considerations sometimes prevent this approach.

We further distinguish the electronic content from the electronic form. To some extent, the electronic content embodies the primary choices made about the quality of the digital image and the electronic form merely serves as a container for the resulting data, providing compression or another encoding means and an orderly file format for unambiguously interchanging that data.

With the introduction of lossy compression schemes, this distinction is no longer entirely justified; a large impact on the image quality can be seen by a choice of inappropriately high loss in the compression step.

Nevertheless, the distinction is still useful. It is relatively difficult to reverse inappropriate choices in the choice of the electronic content (at scanning and processing time), whereas it is relatively easy to modify the electronic form of the image file at a later date.

spatial resolution

Spatial resolution characterizes the spacing between the centers of sampling points in the digital image and thus determines the minimum feature size discernible in the image.

In the abstract, a digital image is an array of luminance and/or chrominance values received from a (typically) rectangular array of perfectly abutting square sampling areas.

In practice, however, the spot size focused on the sampling area is larger than the area, if only due to the fact that optical beams are elliptical in cross section. Typically, scanners use linear arrays of sampling sensors, with the coverage of the other axis achieved by motion either of the page or of some mirror arrangement. If this motion is not properly chosen, a line of samples may overlap the previous line of samples by as much as 1/2 or 3/4 of a sample. Similarly, optical focus, electronic amplification characteristics and the mathematics of sampling functions all can contribute to a degradation of the actual image acuity achieved.

Despite all these flaws, the scanning device will still be described as having the same spatial resolution, computed by dividing the total number of pixels along the sensor line by the length of the line's projection onto the document. So, clearly the spatial resolution is at best a measure of the potential quality of the digital image and of its ability to preserve fine detail. Characterization of the scanner by means of test charts must be used to see if that potential is met.

An additional complication is the concept of interpolated spatial resolution . The spatial resolution described above can be called the optical resolution, relating as it does to the count of the actual optical sensing devices. Some scanners fit a curve (more or less well) to the luminance values obtained from the sensors. This curve is then itself sampled at some new number of places, producing the grayscale interpolated resolution.

While this practice, if performed intelligently, may result in some smoothing of perceived jags in the image, it does not increase the ability of the image to preserve the finest details and is therefore to be identified clearly for what it is. It may be useful to match the dot pitch of a printing device by this means if the system is designed primarily for printing (as in a digital photocopier). This obviates performing similar manipulations at the printer (using a lower quality binary interpolation technique), but the value of such a technique is suspect in a preservation application.

It is worth noting that some scanners use a more simple-minded technique to produce more pixels from a sensor array. This is called pixel replication. It involves simple repetition of grayscale or even binary pixel values to get the requisite number of pixels, and it produces spatially inaccurate results.

tonal resolution

The number of distinguishable shades or colors possible in a piece of electronic image content is set by its tonal resolution.

binary -- Binary images, as the name implies, support only two tonal values by use of a single bit per pixel. Note at the term binary best applies to the electronic content, whereas the term bitonal best applies to the original material content.
grayscale (number of levels) -- Grayscale images are most typically stored using 8 bit luminance values which potentially permit 256 shades to be distinguished.

Note that grayscale data permits more accurate preservation of bitonal regions like text by correctly handling the case where the sample point lies half on and half off of a stroke. Rendering such a sample as a midrange gray provides the eye with more information to properly reconstruct the character.
color -- Color images may be paletted, where a select few colors have been found in the image, or full-color images. Paletted images typically have 8, 15 or 16 bits per pixel, permitting distinction between 256, 32,768, or 65536 shades. The accuracy of their color rendition depends on the color quantization algorithm used to determine what are the prevalent color in the image.

Full color images most often have 24 or 32 bits available to represent either 3 or 4 color components, each of which most typically uses 8 bits.

ELECTRONIC FORM

The electronic form of a digital image encompasses its data encoding or compression means and its file format.

data encoding/compression

Image compression is a special instance of the general idea of encoding. Image encoding is a set of techniques for putting image brightness and color values into computer readable form. Each compression technique is optimized for a specific electronic content type and may be either proprietary or standard.

Compression may be lossy or lossless. Lossless compression is most typically applied to binary images which have had their loss introduced in the earlier thresholding step.

file format

The file format is the outer wrapper or transport portion of the electronic content. It carries the encoded image data as well as any ancillary data needed to use the image, such as an identification of the encoding method, any parameters needed to perform decoding such as line length. File formats may be either proprietary or standard.

USE

The uses to which an image is to be put are a key aspect of digital imaging for preservation. We distinguish preservation quality images from access quality images.

Preservation Copy

The initial image capture sets an upper limit on the image quality that is available for any subsequent use of the image. With suitable image compression, the cost of image storage (media cost) for the preservation quality image is already likely to be lower than the labor cost for the scanning operation.

The purpose of preservation quality images is the reconstruction (to the extent possible) of a faithful copy of the original source document. Another anticipated use is the production of access quality images, which can be optimized for interactive retrieval over a network.

The longevity of the preservation copy can be viewed as being dependent not only on the longevity of the archival media (which can be refreshed by making perfect digital copies to achieve arbitrary longevity of the information it contains), but even more so on the quality of the initial image capture being sufficient to support all future needs.

replacement preservation copy

The decision on whether the original artifact should be retained at the Library can be based on its value as an artifact, once its information has been preserved and is independently accessible. Of course, if there is "information" in the document that cannot reasonably be captured during scanning (such as a distinctive watermark in the paper, or a distinctive type of paper itself), retention of the document may be necessary to support some researchers.

surrogate preservation copy

Even in the case where the original artifact is retained, the existence of preservation and access copies will aid in the preservation of the original, because the need to physically handle the original will be restricted to a smaller group of researchers.

For the preservation quality image, the cost of any "excess" quality shows up only once, as increased archival media cost, so it makes sense to look towards the future and anticipate the highest quality that will ever be desired, rather than planning on labor-intensive re-scanning in the future.

access copy

The above is in contrast to the economics for access quality images, where any increase in the file size increases the cost of every retrieval. The ability to automatically produce access quality images from the preservation images, with the decision as to the appropriate format for the access quality images dependent on the technology of the time, allows the captured images to "keep up" with advancing technology without requiring re-scanning.

types of access

The proper creation of a preservation copy in digital form can also facilitate the accessibility of the information, possibly with multiple levels of access quality to support multiple research activities.

In the analog realm, the standard for a suitably faithful access copy is commonly an image captured on 35mm microfilm at a low reproduction ratio.

One may note that microfilm images are normally monochrome and high contrast. The high contrast of microfilm increases the perceived quality of most documents by making the writing (which should be dark) even darker, while making the background paper (which should be light) even lighter.

However, this contrast enhancement will also reduce the contrast of light writing (by making it lighter), and it will increase the contrast of some undesired features (such as strike-through on onion skin paper). User satisfaction with such images is reasonably high, however, indicating that the quality loss that occurs on some documents is not severe enough to impact their usefulness.

On this basis, it is reasonable to propose that many access quality images may be binary despite the fact that their corresponding (and parent) preservation copy might be grayscale.

The following types of access can be distinguished. For a given collection, a currently appropriate access copy and access strategy should be devised which meets user needs within the confines of current technology.

reference use vs. reading of literature
browsing vs. sustained reading
document location, full-text searching
photocopying of selected passages vs. full document delivery
frequent vs. occasional use
reformatting

Table of Contents - Executive Summary - Introduction - Aspects of Collection Analysis - Guidelines - Scanning & Compression - Appendixes

Home >> Resources >> Research Publications >> Guidelines for Electronic Preservation of Visual Materials