The version of this document of 9 December 2016 has been published as:
Katz DS, Niemeyer KE, Smith AM, Anderson WL, Boettiger C, Hinsen K, Hooft R, Hucka M, Lee A, Löffler F, Pollard T, Rios F. (2016) Software vs. data in the context of citation. PeerJ Preprints 4:e2630v1 https://doi.org/10.7287/peerj.preprints.2630v1
Author/Editors: Daniel S. Katz, Kyle E. Niemeyer, Arfon M. Smith
Additional Authors: William L. Anderson, Carl Boettiger, Konrad Hinsen, Rob Hooft, Mike Hucka, Allen Lee, Frank Löffler, Tom Pollard, Bernadette M. Randles, Fernando Rios
This repository is intended to be used to discuss and document the differences between software and data in the context of citation in the research record.
It has been created in the process of the FORCE11 Software Citation Working GroupFORCE11 Software Citation Working Group writing the FORCE11 Software Citation PrinciplesSmith et al. 2016a, and then the editors submitting them to PeerJ Computer ScienceSmith et al. 2016b and responding to reviewer comments.
We start with the idea that software, while similar to data in terms of not traditionally having been cited in publications, is also different than data. For the purposes of this document, software is defined as programs that tell a computer which actions to perform. Software is made up of code, and might be in the form of packages, libraries, scripts, compiled code, proprietary packages, services, or any other form of instructions for the computer to interpret. Software also encompasses additional components such as comments, instructions, and other information necessary to run the software. In the context of research (e.g., in science), the term "data" usually refers to electronic records of observations made in the course of a research study ("raw data") or to information derived from such observations by some form of processing ("processed data"), as well as the output of simulation or modeling software ("simulated data"). In the following, we use the term "data" in this specific sense.
The confusion about the distinction between software and data comes in part from the much wider sense that the term "data" has in computing and information science, where it refers to anything that can be processed by a computer. In that sense, software is just a special kind of data.
The remainder of this document gives examples of these differences.
If you want to add a new difference, please do via a pull request. Similarly, if you want to add a citation or add a new explanation, please also do this via a pull request. If you want to discuss a difference (for example, you don't think it's correct), please open a new issue or discuss via an existing issue. If you do add text in a pull request, also add yourself as an additional author in that same request, following the existing format and keeping the additional author list in alphabetic order by surname. (And add a comma after all authors but the last one.)
Explanation if needed,
including or followed by:
Evidence: Citations
A commonsense definition of software is that it is "a set of instructions that direct a computer to do a specific task"Chun 2004. On the other hand, data is simply a collection of facts or measurements (real or simulated). In other words, software is functionally active, while data is passive. Of course, software (in form) can be considered data as well, especially to functional programmers familiar with LISP and other languages with homoiconicityKay 1969. However, from the point of view of conducting research with software, the main difference is that software is associated with action: knowledge creation, information transformation, visualization, etc. An action can be thought of a functional transformation between two states of data: a "before" (e.g., input files, parameter settings, unstructured or tacit information) to an "after" state (e.g., output files, transformed data, structured knowledge). That is, software generally performs a function upon something (e.g., software processes data), while data generally has a function performed upon it (e.g., data is processed by software). If we accept the definitions of software and data given at the beginning of this section, then (at least in scientific research), the difference between data and software can be summarized by the statement of Matthews et al. 2010: "we are more interested in what software does rather than what software is."
Software exists to perform a task, while data does not. Software is fundamentally a logical construct, while data is fundamentally an empirical observation. Software can be used to express or explain processes and concepts, oftentimes with data as input. These differences have important consequences for how each may be re-used in the future: software may be used by any researchers seeking to apply the same methods, data by any researchers seeking evidence about the same facts.
In particular, software is generally subject to copyright protection as a creative work that can continue to evolve over time, while scientific data is frequently considered outside the domain of copyright as it is comprised of contextual facts about the world (you cannot copyright the height of Mt. Everest.) Major scientific data repositories (e.g. Dryad, figshare) automatically apply licenses suited to data that may not be suited to software.
Evidence: Can I apply a Creative Commons license to software?Creative Commons; Non-software licensesChoose a License
Software suffers from a different type of bit rot than data: It is frequently built to use other software, leading to complex dependencies, and these dependent software packages also frequently change.
In general, software must be constantly maintained and updated in order to continue to function as both the hardware and software environments on which it depends change. Operating systems, software and system libraries, programming language toolchains and other compile-time and run-time dependencies all evolve as their respective maintainers and developers find and fix bugs, and as user requirements demand new features and capabilities. This is sometimes called "software rot"Raymond 1996 and other times called "bit rot." On the other hand, bit rot for data, or data degradationWikipedia, is generally thought of as changes in the underlying hardware or storage media that holds the bits, or changes in the software capable of interpreting the data. This definition of bit rot also affects software since software is actually stored as a set of bits on a filesystem, but software bit rot is generally thought of as a higher level concern than data- or file-level bit rot.
The lifetime of software can reach 20 years or more, especially for well-maintained projects. The life of software can end if the task it was supposed to do is not needed anymore, or if another software does it in a better way. Data, on the other hand, often represents the results of an experiment. It might become less interesting with time, but it cannot be replaced as it is connected to one particular experiment at that particular time. In this sense, software is replaceable (by other software), while data is usually not.
A 1995 NRC Report Preserving Scientific Data on Our Physical UniverseNRC 1995 provides the following recommendations regarding retention criteria and the appraisal process (p. 40): "As a general rule, all observational data that are nonredundant, useful, and documented well enough for most primary uses should be permanently maintained. Laboratory data sets are candidates for long-term preservation if there is no realistic chance of repeating the experiment, or if the cost and intellectual effort required to collect and validate the data were so great that the long-term retention is clearly justified. For both observational and experimental data, the following retention criteria should be used to determine whether a data set should be saved: uniqueness, adequacy of documentation (metadata), availability of hardware to read the data records, cost of replacement, and evaluation by peer review. Complete metadata should define the content, format or representation, structure, and context of a data set."
While software is often replaced by newer software, computational models and data analyses can be important digital artifacts that should be preservedRollins et al. 2014 along with datasets in order to properly verify or reproducePeng 2011 published findings. Long-term preservation of the software used in computational science is a wicked problem as outlined in the final report from the Preserving.exe: Toward a National Strategy for Preservation Software 2013 meetingPreserving.exe 2013. At that time, best practices to facilitate reproducibility of computational science involve archiving of the following, in durable, plaintext formats:
- the software itself, in source code form in a trusted digital repository
- structured or unstructured narrative documentation (e.g., the ODD protocol Grimm 2013) specifically covering key components of the software
- descriptive provenance metadata on the software dependencies needed to compile and run the software as well as any input data dependencies
though these practices may change as virtualization and containerization become more common.
[Choose a License] Choose an open source license, "Non-software licenses," http://choosealicense.com/non-software/ Accessed: 2016-08-16. ↩
[Chun 2004] W. H. K. Chun, "On software, or the persistence of visual knowledge," Grey Room, vol. 18, pp. 26–51, 2004. doi:10.1162/1526381043320741 ↩
[Creative Commons] Creative Commons, FAQ, "Can I apply a Creative Commons license to software?", https://wiki.creativecommons.org/index.php/Frequently_Asked_Questions#Can_I_apply_a_Creative_Commons_license_to_software.3F Accessed: 2016-08-16. ↩
[FORCE11 Software Citation Working Group] FORCE11 Software Citation Working Group, GitHub repository, https://github.com/force11/force11-scwg. Accessed: 2016-07-10. ↩
[Grimm et al. 2013] Volker Grimm, Gary Polhill, Julia Touza, Documenting Social Simulation Models: The ODD Protocol as a Standard. In Simulating Social Complexity: A Handbook, pp. 117-133, 2013. http://dx.doi.org/10.1007/978-3-540-93813-2_7 ↩
[Matthews et al. 2010] B. Matthews, A. Shaon, J. Bicarregui, and C. Jones, “A framework for software preservation,” International Journal of Digital Curation, vol. 5, no. 1, pp. 91–105, 2010. doi:10.2218/ijdc.v5i1.145 ↩
[NRC 1995] National Research Council, Preserving Scientific Data on Our Physical Universe: A New Strategy for Archiving the Nation's Scientific Information Resources, 1995. http://www.nap.edu/catalog/4871.html ↩
[Peng 2011] Roger D. Peng, Reproducible Research in Computational Science, Science, vol 334, issue 6060, pp. 1226-1227, 2011. http://dx.doi.org/10.1126/science.1213847 ↩
[Preserving.exe 2013] Library of Congress, Preserving.exe: Toward a National Strategy for Software Preservation, 2013. http://www.digitalpreservation.gov/multimedia/documents/PreservingEXE_report_final101813.pdf ↩
[Rollins et al. 2014] Nathan D. Rollins, C. Michael Barton, Sean Bergin, Marco A. Janssen, Allen Lee, A Computational Model Library for publishing model documentation and code, Environmental Modelling and Software, vol 61, pp. 59-64, 2014. http://dx.doi.org/10.1016/j.envsoft.2014.06.022 ↩
[Smith et al. 2016a] A. M. Smith, D. S. Katz, K. E. Niemeyer, and FORCE11 Software Citation Working Group “Software Citation Principles,” FORCE2016 Website, https://www.force11.org/software-citation-principles, 2016. Accessed: 2016-07-10. ↩
[Smith et al. 2016b] A. M. Smith, D. S. Katz, K. E. Niemeyer, and FORCE11 Software Citation Working Group, “Software Citation Principles,” PeerJ Computer Science 2:e86, 2016. https://doi.org/10.7717/peerj-cs.86 ↩
[Wikipedia] Wikipedia, “Data degradation”. https://en.wikipedia.org/wiki/Data_degradation Accessed: 2016-11-23. ↩
[Kay 1969] Kay, A. C. The Reactive Engine. The University of Utah, AAI7003806, 1969. ↩
[Raymond 1996] Raymond, Eric S. The New Hacker's Dictionary. MIT Press, 1996. ↩