Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

bzip2

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name bzip2
Description

The bzip2 file format is a freely available, patent-free data compression program created by Julian Seward. It is both the name of the format and the program used to create it. The program is designed for compressing single files only. It was created as a successor to its predecessor, bzip, to avoid potential patent issues. See: History for more information.

Different versions of bzip2 maintain file format compatibility. Newer versions can work with files created by older versions, ensuring a level of stability. However, the format creator acknowledges limitations in the compressed file format and provides source code for decompressing older files created by bzip-0.21.

A bz2 stream consists of a 4-byte header, followed by zero or more compressed blocks. An end-of-stream marker contains a 32-bit CRC for the plaintext whole stream processed. The compressed blocks are bit-aligned, and no padding occurs.

Production phase May be used at any lifecycle phase for bundling/packaging files together for exchange, storage, or distribution.

Local use Explanation of format description terms

LC experience or existing holdings The Library of Congress has a small number of bzip files across its varied collections.
LC preference Bzip is not includes in the Library of Congress Recommended Formats Statement.

Sustainability factors Explanation of format description terms

Disclosure No formal specification for the bzip2 file format exists. Comments welcome.
    Documentation

Two unofficial documentation resources are commonly cited.

Adoption

Widely adopted.

The bzip2 file format “ships standard on many Unix/Linux systems.”

Often compared to gzip and ZIP File Format (PKWARE).

    Licensing and patents

The bzip2 homepage states that the license is a GNU’s Not Unix (GNU) General Public License (GPL). It is unclear which version of GNU GPL would apply. Other sources state conflicting information about bzip2’s license, stating it is a Berkeley Software Distribution (BSD) style license. Comments welcome.

Transparency Depends upon algorithms and tools to read. Would require sophistication to build tools from scratch.
Self-documentation

Identifies self as a bzip2-compressed file with magic numbers (see magic numbers section). There is no specific language for the inclusion of other metadata. However, documentation is sparse. Comments welcome.

Accessibility Features

No specific features in the file format. Features to support accessibility would be found in the bundled and compressed files (such as embedded captions and subtitles in audiovisual content, tagged and structured text in textual documents, and alt text for images). Aggregate files can also contain separate files for transcripts, timed text or captions as part of the bundled package. See Relationships to other formats for details.

External dependencies None, beyond the availability of software to extract and decompress the files contained in a bzip2 file.
Technical protection considerations Does not support encryption.

Quality and functionality factors Explanation of format description terms

Aggregate
Compression According to the bzip2 software official manual, bzip2 files are compressed using the Burrows-Wheeler block-sorting text compression algorithm, and Huffman coding.
Support for Error Dectection Unknown. Comments welcome.

File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension bz2
Used for bzip2. See Wikidata: https://www.wikidata.org/wiki/Q27866052
Internet Media Type application/x-bzip2
See the Mozilla list of common MIME types. Not listed in IANA.
Magic numbers Hex: 42 5a 68
ASCII: BZh

For more details see:

Note this header, when converted from Hexadecimal to ASCII, is "BZh". “BZ” stands for “bzip”, and the "h" is for "Huffman coding," the compression algorithm used with bzip2. Some sources, such as Wikipedia, will cite the magic numbers as “BZh” instead of the hexadecimal.

Pronom PUID x-fmt/268
See https://www.nationalarchives.gov.uk/PRONOM/x-fmt/268
Wikidata Title ID Q27866052
See https://www.wikidata.org/wiki/Q27866052

Notes Explanation of format description terms

General

The bzip2 program, and by extension the bzip2 file format, is based on its predecessor bzip. Despite similarities in appearance and name to bzip, bzip2 is rewritten and re-engineered. It was developed to address potential patent issues with bzip. The format created by bzip2 is not compatible with bzip, and efforts to make them compatible were avoided to maintain the purpose of patent avoidance. Seward expressed commitment to backwards compatibility for future changes. The predecessor program bzip is no longer available.

History

Julian Seward released bzip2, version 0.15, in July 1996. The compressor’s popularity grew over the next several years due to its stability.

Julian Seward released version 1.0 in late 2000.

In June 2019 Federico Mena became the new maintainer of bzip2.

In 2019, Mark Wielaard began maintaining a bzip2 stable repository at Sourceware. In June 2021 Micah Snyder became the new maintainer of the Sourceware repository.


Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 04/30/2024