Different types of data are acquired, processed and stored (preserved and/or archived) in different ways and can be discipline specific. When starting a new project and creating a Data Management Plan - DMP, one of the first considerations to make should be to decide, in advance, which file formats to use. Many proprietary file formats are “containers” for standard file formats. By packaging them into these containers, a software and/or hardware developer can provide additional functionality, usually by streamlining a process, to analyse data acquired on their platform. However, this has the negative consequence of making these data less interoperable.
Moreover, file formats can be either lossless or lossy: that is, whether data is uncompressed (such as TIFF for images) or compressed (such as JPEG for images) to remove redundant information and thus reduce file size. It is common practice to do analyses on lossy data but this does not necessarily mean that these data should be the ones that should be kept for long-term storage. In this context, it is highly likely that the most important file to consider for long-term storage through its curation lifecycle is either the first file (that which was initially captured from an instrument) or a direct lossless standard file format version from this one (see also guide on raw data (to be available soon)).
The key consideration to make is longevity and interoperability and to be as FAIR as much as possible - proprietary versus standard formats: equipment used, brand of software/hardware and research discipline may all be contributing factors. There is no guarantee that existing proprietary file formats will exist in the future. For example, many Microsoft file formats such as Word and Excel may be in common use now, but this does not negate the possibility that they will become obsolete in the future. Software and file format obsolescence will be even more pronounced for bespoke file formats that were created as part of an individual project.
As an example, in the biomedical imaging field, a realisation of the huge variety of file formats that exist led to an initiative to make these interoperable. As part of the OMERO project, Bioformats is a software plugin which allows the conversion of multiple established proprietary and standard file formats. Image analysis software such as ImageJ (free and open source) have adopted Bioformats as a plugin to allow users to read and write their image data without having to consider their origin. However, such tools may not always exist for different disciplines, and a researcher should consider storing their acquired data in a standard format at the earliest available opportunity. Many (most?) commercial and open source software packages allow conversion of data into standard formats and this should be exploited.
During the course of the digital revolution, a number of file formats have been recognised to be the file formats of choice for longevity and interoperability.
As an example, the following table describes a variety of file formats for different disciplines that are either recommended or acceptable (from the UK Data Service):
When writing a DMP, researchers are advised to refer to tables such as this to help decide the best file formats to use for their project and to state this clearly.
Type of data | Recommended formats | Acceptable formats |
Tabular data with extensive metadata variable labels, code labels, and defined missing values |
|
proprietary formats of statistical packages: SPSS (.sav), Stata (.dta), MS Access (.mdb/.accdb) |
Tabular data with minimal metadata column headings, variable names |
|
|
Geospatial data vector and raster data |
|
|
Textual data |
|
|
Image data | TIFF 6.0 uncompressed (.tif) |
|
Audio data | Free Lossless Audio Codec (FLAC) (.flac) |
|
Video data |
|
AVCHD video (.avchd) |
Documentation and scripts |
|
|