Format Descriptions >> Format Description Categories >> Browse Alphabetical List >> Mapping FDDs to PRONOM and Wikidata
Mapping FDDs to PRONOM and Wikidata Unique Identifiers
Background
This site uses a simple protocol to define unique identifiers, with each format description document (known as an "fdd") being assigned a six digit number starting with 000001 and a prefix of "fdd" (i.e., fdd000001 for WAVE Audio File Format). New fdds are assigned the next available number in a sequential list. These unique identifiers are also used in the URL address for each fdd. So fdd000001 has the URL https://www.loc.gov/preservation/digital/formats/fdd/fdd000001.shtml for its HTML version and https://www.loc.gov/preservation/digital/formats/fddXML/fdd000001.xml for its XML version. For more about format descriptions in XML, see Digital Formats Descriptions as XML.
PRONOM, the file format registry hosted by The National Archives UK to support the file format identification tool DROID, uses PUIDs (PRONOM Unique Identifiers) such as fmt/6 while Wikidata's linked data aggregator site for file format information from a variety of data sources uses a unique number with a "Q" prefix such as Q217570.
Starting around 2018, we have included information on equivalent matches found between the Library of Congress fdds to PRONOM or Wikidata. This information can be found in the File Type Signifiers section of the fdd so format researchers can have access to multiple sources of data. For example, WebM (fdd 518), a non-propriety, royal free open source format developed and maintained by Google optimized for web-based media content, has exact matches in both PRONOM (fmt/573) and Wikidata (Q309440). Each of the three resources describe this format at the same level of granularity and specificity.
This isn't always the case however. Sometimes, the match isn't exact. Take for instance TIFF 6 (fdd 022). Wikidata (Q27231633) has an exact match for this but PRONOM (fmt/353) doesn't distinguish between the different versions of TIFF. This would not be an exact match, which is important to document because there are important differences between all the versions of TIFF which a file format researcher or digital preservation practitioner would want to understand.
And in some cases, there's no match at all. This can just be a case of PRONOM or Wikidata not making an entry for that format yet, as with DivX Video Codec (fdd 069), or there just may not be a match at the same level of hierarchy.
This site has a class of fdds called "family fdds" used to describe the common characteristics of a related group of formats, such as PDF (Portable Document Format) Family (fdd 030) but go into more detail in separate fdds focused on subtypes or versions, such as PDF 2.0, ISO 32000-2 (2017, 2020) (fdd 474). There's no PRONOM match for fdd 030 because that resource doesn't describe formats at the aggregate "family" level but there is one (fmt/1129) for the subtype PDF 2.0.
There's another class of content called "combo packs" which describe a specific encoding in a wrapper or container such as QuickTime File Format with V210 Video Encoding (fdd 368). There simply aren't equivalent entries in PRONOM or Wikidata for this type of fdd.
Because this mapping process is relatively recent, there are many fdds, especially older fdds, that do not yet include PRONOM or Wikidata information and this data will be filled in over time.
FDD - PUID - Wikidata mappings
This spreadsheet provides a mapping of Library of Congress fdds to PRONOM PUID and Wikidata QID identifiers and is updated monthly to reflect new or revised matches. The spreadsheets are saved as CSV
files with the file names representing the date of the data pull in YYYYMMDD format.
This zip file contains the README instructions and Python scripts to run the mapping on the XML versions of the fdds:
|