Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Research Data Management (RDM): File Formats

UCT Libraries Research Data Services provide guidance and support for all aspects of the data lifecycle, from planning your data management strategy during the proposal phase through preserving your data at the conclusion of your project.

Choosing File Formats

Choosing File Formats

In planning a research project, it is important that you consider which file formats you will use to store your data. In some cases, this will be dictated by the software you are using or the conventions of your discipline. In other cases you may have to make a choice between several options.

These are likely to be some of the key factors in your decision-making:

  • what software and formats you or your colleagues have used in past projects
  • any discipline-specific norms (and any peer support that comes with them)
  • what software is compatible with hardware you already have
  • whether you have funding for new software
  • how you plan to analyse, sort, or store your data

But you should also consider:

  • what formats will be easiest to share with colleagues for future projects
  • what formats are at risk of obsolescence, because of new versions or their dependence on particular software
  • what formats will allow to open and read your data in the future
  • what formats will be the easiest to annotate with metadata so that you and others can interpret them days, months, or years in the future

In some cases, it might be the best to use one format for data collection and analysis, and converting your data to another format for archiving once your project is complete.

Best formats for preservation

If you are not aware of any disciplinary standards these are some good file formats for the preservation of the most common data types:

  • Textual data: XML, TXT, HTML, PDF/A (Archival PDF)
  • Tabular data (including spreadsheets): CSV
  • Databases: XML, CSV
  • Images: TIFF, PNG, JPEG (note: JPEGS are a 'lossy' format which lose information when re-saved, so only use them if you are not concerned about image quality)
  • Audio: FLAC, WAV, MP3

source: Data Management Guide - University of Cambridge

File Formats

Ideal File Format Types:

Selecting which file format to save your research has long term usage and access implications; for example, if the file format that you use is proprietary its long term accessibility and subsequent usage is unpredictable as it depends on the success and longevity of the business.  The reality of technology changing is real and as a result, researchers should plan for both hardware and software obsolescence and should plan to make file format decisions that will ensure long term usage and accessibility.  The following are some guidelines to help you in choosing an appropriate file format for your research:

  • Non-proprietary
  • Uncompressed
  • Unencrypted
  • Commonly used by the general research community
  • Open, documented standards
  • Using standard character encodings (ASCII, UTF-8)

Preferred File Formats:

  • Text: XML, PDF/A, HTML, ASCII, UTF-8 (not Word)
  • Tabular Data: CSV (not Excel)
  • Still Images: TIFF, JPEG 2000, PDF, PNG, BMP (not GIF or JPG)
  • Moving Images: MOV, MPEG, AVI, MXF (not Quicktime)
  • Sounds: WAVE, AIFF, MP3, MXF
  • Databases: XML, CSV
  • Statistics: ASCII, DTA, POR, SAS, SAV
  • Containers: TAR, GZIP, ZIP
  • Geospatial: SHP, DBF, GeoTIFF, NetCDF
  • Web Archive: WARC

Oregon State University has a table of other acceptable formats on top of the preferred file formats.

Recommended and Acceptable File Formats

Type of data Recommended formats Acceptable formats

Tabular data with extensive metadata

variable labels, code labels, and defined missing values

  • SPSS portable format (.por)
  • delimited text and command ('setup') file (SPSS, Stata, SAS, etc.)
  • structured text or mark-up file of metadata information, e.g. DDI XML file

proprietary formats of statistical packages: SPSS (.sav), Stata  (.dta), MS Access (.mdb/.accdb)

Tabular data with minimal metadata  column headings, variable names
  • comma-separated values (.csv)
  • tab-delimited file (.tab)
  • delimited text with SQL data definition statements
  • delimited text (.txt) with characters not present in data used as  delimiters
  • widely-used formats: MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase (.dbf), OpenDocument Spreadsheet (.ods)
Geospatial data  vector and raster data
  • ESRI Shapefile (.shp, .shx, .dbf, .prj, .sbx, .sbn optional)  geo-referenced TIFF (.tif, .tfw)
  • CAD data (.dwg)
  • tabular GIS attribute data  Geography Markup Language (.gml)
  • ESRI Geodatabase format (.mdb)
  • MapInfo Interchange Format (.mif) for vector data
  • Keyhole Mark-up Language (.kml)
  • Adobe Illustrator (.ai), CAD data (.dxf or .svg)
  • binary formats of GIS and CAD packages
Textual data
  • Rich Text Format (.rtf)  plain text, ASCI (.txt)
  • eXtensible Mark-up Language (.xml) text according to an appropriate Document Type Definition  (DTD) or schema
  • Hypertext Mark-up Language (.html)
  • widely-used formats: MS Word (.doc/.docx)
  • some software-specific formats: NUD*IST, NVivo and ATLAS.ti
Image data TIFF 6.0 uncompressed (.tif)
  • JPEG (.jpeg, .jpg, .jp2) if original created in this format
  • GIF (.gif)
  • TIFF other versions (.tif, .tiff)
  • RAW image format (.raw)
  • Photoshop files (.psd)
  • BMP (.bmp)
  • PNG (.png)
  • Adobe Portable Document Format (PDF/A, PDF) (.pdf)
Audio data Free Lossless Audio Codec (FLAC) (.flac)
  • MPEG-1 Audio Layer 3 (.mp3) if original created in this format
  • Audio Interchange File Format (.aif)
  • Waveform Audio Format (.wav)
Video data
  • MPEG-4 (.mp4)
  • OGG video (.ogv, .ogg)
  • motion JPEG 2000 (.mj2)
AVCHD video (.avchd)
Documentation and scripts
  • Rich Text Format (.rtf)
  • PDF/UA, PDF/A or PDF (.pdf)
  • XHTML or HTML (.xhtml, .htm)
  • OpenDocument Text (.odt)
  • plain text (.txt)
  • widely-used formats: MS Word (.doc/.docx), MS Excel  (.xls/.xlsx)
  • XML marked-up text (.xml) according to an appropriate DTD or  schema, e.g. XHMTL 1.0