Effective Documentation of Research#
Research papers in academia typically follow a structured format, beginning with a domain-specific topic and a comprehensive literature review to establish research questions or hypotheses. We won’t go into detail about how these efforts are done, but suffice to say that the general process follows a common trajectory:
The study is designed.
Data is collected and analyzed.
Results are presented and discussed in relation to existing literature.
The conclusion summarizes findings and suggests future research directions.
Although these forms of publication nearly always include results represented in great detail, with tables, figures, and visuals to enhance understanding, they do not always include supplementary materials to invite replication and validation. In effect, this may inhibit reproducibility and transparency, which stymies collaboration and reinforces redundancy.
Supplementary materials provide additional context, data, or analyses beyond what can be included in the main paper, enhancing transparency and reproducibility. These materials often contain crucial details, such as raw data, code, or additional analyses, that are necessary for fully understanding and replicating the study.
Without access to these materials, other researchers may struggle to reproduce the results or verify the findings, leading to uncertainty about the robustness of the research. Supplementary materials can help augment understanding of the research process, methodologies, and any limitations or caveats associated with the study.
In the absence of published materials to reproduce results, it becomes challenging for readers to evaluate the validity and reliability of the research, potentially undermining trust in the scientific process. Furthermore, collaborative efforts in science rely on the ability to access and build upon existing research. Supplementary materials enable researchers to engage with and expand upon previous work by providing access to preceding relevant data and methodologies. Scientific collaboration directly benefits from inclusion of supplementary materials, which in turn supports scientific progress and innovation.
There are several ways that we structure supplementary materials when doing research within industry:
GitHub repositories#
A GitHub repository serves as a central hub for hosting and sharing code and documentation. To make it useful for others to use, the repository should be organized into clear and intuitive directories, with descriptive names and README files providing an overview of the contents and instructions for usage. Code files should be well-commented, following best practices and coding standards (which you can find in great detail in the chapter on “Open Access to Data and Code”). Issues and discussions facilitate communication and problem-solving among collaborators. Version control and continuous integration (CI) tools automate testing and deployment processes, ensuring code quality and reliability.
Data hosting platforms#
Publishing datasets should be done outside of GitHub itself such that repositories are not encumbered by the often high volume of scientific datasets at build and clone time. There are a number of discipline-specific and general-purpose data repositories where datasets can be published for public access.
Zenodo: A general-purpose repository operated by CERN, which allows researchers to deposit datasets, software, and other research outputs.
Open Science Framework (OSF): A free, open-source web platform developed by the Center for Open Science (COS). It serves as a collaborative research management tool designed to support the entire research lifecycle. OSF provides researchers with a centralized space to organize and manage their research projects, data, and materials.
Governmental and Intergovernmental Platforms: Some governmental agencies and intergovernmental organizations provide platforms for hosting and sharing scientific datasets.
NOAA’s National Centers for Environmental Information (NCEI): Provides access to a wide range of environmental datasets, including climate, weather, oceanographic, and geophysical data.
NASA Earthdata: Offers access to a variety of Earth science datasets, including satellite imagery, climate data, and atmospheric observations.
European Union Open Data Portal: Hosts datasets from various EU institutions and agencies, covering topics such as agriculture, environment, health, and research.
Cloud Storage Platforms: Cloud storage services such as Amazon S3, Google Cloud Storage, and Microsoft Azure Blob Storage can also be used to host scientific datasets publicly. Researchers can upload their datasets to these platforms and configure access permissions to make them publicly accessible.
Documentation websites#
It’s best practice to host documentation through a human readable, navigable and easily discoverable interface. A common method is to build dedicated websites that render the documentation in an intuitive and digestible format. These websites are often hosted in an affiliated GitHub repository itself, using GitHub Pages. Some great utilities for building these websites when documenting a project or workflow include:
Jupyter Book (discussed in the chapter on “Use of Notebooks” and as a reminder, what was used to build this website itself)
Quarto which is especially useful for those working in the R programming language
If documenting something like a software package, Read the Docs is a great option.
In effect, dedicated documentation websites for methods, code and datasets serve as comprehensive guides that provide users with information on how to use, understand, and/or contribute to a project. With these, the goal is to offer clear instructions and examples on how to set up, configure, and use the provided code and data. In some cases, one might include things like troubleshooting guides and common error messages to help downstream users resolve issues they might encounter while using the code. As well, best practices involve including API references that document the key functions, classes, and modules. It’s nice to include some explanations on the logic behind certain decisions as well. If contributions are anticipated or solicited to the published work, it’s helpful to include guidance on the ideal contribution process, coding standards and testing procedures. Lastly, it’s important to make sure that versioning, for example in a changelog, is reflected clearly. Suffice to say, a well-documented project is more likely to be adopted by other researchers and collaborators at large. Documentation websites serve as tools that showcase the project’s capabilities, benefits, and use cases, attracting a wider audience and increasing its visibility within the community.
Some examples of documentation websites built in this capacity include: