Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

Microsoft Office Word 97-2003 Binary File Format (.doc)

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name Microsoft Office Word 97-2003 Binary File Format (.doc).
Description

The Microsoft Word Binary File format, with the .doc extension and referred to here as DOC, was the default format used for documents in Microsoft Word from Word 97 (released in 1997) through Microsoft Office 2003. Although it cannot support all functionality of the Word application introduced since Word 2007, the DOC format has continued to be available as an alternative to the DOCX/OOXML format, standardized in ISO/IEC 29500, for saving document files in Word. As of late 2020, the documentation for File formats that are supported in Word, from Microsoft, lists "Word 97-2003 Document." [Note: In other contexts, the same format has been called "Word 97-2004 Document" or "Word 97-2007 Document."]

According to the Wikipedia entry for Microsoft Word, the .doc extension has been used for four distinct file formats: (a) Word for DOS; (b) Word for Windows 1 and 2 and Word 3 and 4 for Mac OS; (c) Word 6 and Word 95 for Windows and Word 6 for Mac OS; (d) Word 97 and later for Windows and Word 98 and later for Mac OS. This format description is for the last of these formats. For convenience, the term "DOC" will be used here to refer specifically to this variant of the Microsoft Word files with .doc as extension.

Although the DOC format is proprietary, it has been covered by Microsoft's Open Specification Promise since 2007. The specification released in 2007 is available as Microsoft Office Word 97-2007 Binary File Format Specification [*.doc]. The structure for the DOC format has been documented and kept up-to-date in [MS-DOC].

Since the release of Word 6.0, in 1993, the structure of a Word document with the .doc extension has been an OLE (object linking and embedding) Compound File Binary file as specified in [MS-CFB]. In 1997, the detailed structure of the CFB file used for Word documents was modified. The CFB format provides a file-system-like structure within a file for the storage of arbitrary, application-specific streams of data. It consists of storages, streams, and substreams. A DOC file begins with a CFB header and must include a CFB root directory (identified by the name "Root Entry" in UTF-16). The root directory has entries for each stream or storage object at the top level of the compound file hierarchy. Each object entry has a name (also encoded in UTF-16, although most of the document content is usually stored in 1-byte characters) and points to the location in the file for the named object. Mandatory streams in a DOC file include a stream with the name "WordDocument" (also referred to as the "main stream") and a "table" stream with name "1Table" or "0Table". The content of the WordDocument stream follows the CFB header and begins with a File Information Block (Fib), which contains information about the document, including a code identifying the DOC file as a Word Document, and specifies the file pointers to various portions that make up the document. Streams that are not required by the specification, but are typically present in files written by Microsoft Word, include a SummaryInformation stream (with basic file-level metadata) and a DocumentSummaryInformation stream. A Word file in the DOC format begins as follows, with all values given as they occur in the physical file, for example when viewed using a Hex dump utility:

  • CFB header (usually 512 bytes):
    • Header Signature for the CFB format with 8-byte Hex value D0CF11E0A1B11AE1. Gary Kessler