What is the data lifecycle and how ARCC can play a role in it?

In this section of the workshop we will discuss data management for research workflows, why it’s important, and introduce how you can use ARCC resources to manage your data. This page give background information for future topics, if you are looking for specific examples, please head back to the main Data Management page to navigate to other pages.

It is also important to note that the content of these pages are general suggestions ans guidelines to assist in your research workflows. The content is NOT rules or requirements for using ARCC resources during your research project.



Research Data Life-cycle

The Life-cycle of data in a research project can be broken down into multiple phases. These phases can be thought of as distinct phases, but often blend into each other with little to distinguish the differences between them. Below we provide details and guidelines for each phase.

 

ResearchDataLife.png

The Planning Phase

Data Management Planning is an often overlooked, but critical phase of the Research Data Management Life-cycle. Not only will it be useful for the execution of your research project, a formalized plan is often required by funding agencies such as the National Science Foundation (NSF) and National Institutes of Health (NIH) among many others. The planning phase of the research data life-cycle usually comes after a research project has been conceptualized but before a the project is underway (or even funded), but can always be re-visited in an informal manner. It is important to consider a variety of things during this phase as well as establish goals for your data:

  • What kind of data is required to answer our research question?

  • What file formats will be collected?

  • Is there a particular software needed in the other phases that requires the data to be formatted in a particular way?

  • Are there any federal compliance requirements?

  • How will the data be stored and protected prior to analysis?

  • Will the data be preserved or discarded after the project is complete?


How ARCC Can Help With Planning

ARCC is a good resource for many of the phases in the Research Data Management Life-cycle, but in the planning phase, is a bit more limited in scope. That said, there are some things researchers can interact with ARCC on in forming this plan:

  • The ARCC documentation and polices can provide researchers with much of the background information required about resources available

  • ARCC resources are described in a Facilities Statement

  • ARCC is always willing to meet with researchers to discuss any Data Management issues by scheduling through our ticketing system

  • We work closely with UWyo Libraries, who are well versed in Data Management and can refer to them for more nuanced questions

    • They also administer the UWyo instance of a Data Management Planning Tool called DMPTool, which can be very useful for writing data management plans

    • They also have resources available for publishing research data, which will be discussed later in this module


The Collection Phase

There are multiple types of data and collection of these data vary greatly depending on the kind of research being done. Below is a table of some types of data that could apply top any management scenario. Please note this is not a comprehensive list and many more types of data that exist.

Classic

Simulated/automated

Social

Classic

Simulated/automated

Social

  • Text files

  • Tabular (Spreadsheets, Databases, etc.)

  • Matrices

  • Observations/field notes

  • Computer Models

  • Instruments (Microscopes, Weather Stations, Satellite Imagery, etc.)

  • Audio/video recordings

  • Surveys

  • Interviews

  • Focus groups

  • Exit Polling

During this phase, it is important to keep data that are being collected organized and named with appropriate conventions to assist with the next phases, and examples will be discussed in other modules.


How ARCC Can Help With the Collection Phase

Since ARCC does not advise on how research should be done, how data are collected is not usually an area of expertise we provide. However, we can provide advice on how the data maybe used in later phases of the Research Data Management Life-cycle, that you may want to be mindful of while you are in the collection phase. If you are unsure about anything that you may run into, please remember that ARCC provides the following that may assist you:

  • The ARCC documentation and polices can provide researchers with much of the background information required about resources available

  • ARCC resources are described in a Facilities Statement

  • ARCC is always willing to meet with researchers to discuss any Data Management issues by scheduling through our ticketing system


The Storage Phase

Once your research data are collected, you will need a place to keep them before moving onto the next phases. This phase is often the longest of the phases and sometimes overlaps many of the others. While seemingly trivial, the storage phase is vital to the Data Management Life-cycle. Here are a few nuances to be aware of before we discuss the systems and services ARCC provides that can assist in this phase, and it is important to ask yourself a few questions before making a decision on where your data will be stored:

  • Does the data fall under any federal compliance or other security restrictions?

  • How are the data to be accessed and how frequently?

  • Do the data need to be backed up or version controlled?

  • Do other collaborators require access and are they local to your institution or not?


How ARCC Can Help With the Storage Phase

Research data storage is a core service that ARCC provides and we have several storage options available for you that will be discussed in subsequent modules, but to state it briefly there are three core storage systems that ARCC provides that fit different phases of the Research Data Life-cycle each filling different roles detailed in the table below:

Pathfinder (Storage)

  • Free for UWyo researchers up to a default limit

  • Accessible via the UWyo network or VPN

  • Includes backups and snapshots

  • Home (for configuration and profiles)

  • Project (for shared data during analysis)

  • gscratch (for actively read/write during analysis)

    • MedicineBow is NOT backed up, but includes snapshots

  • Cloud-like backend

  • Web-enabled S3 buckets for data storage, data transfer, etc.

  • Is NOT backed up

Transferring data to and from these systems is discussed in another workshop. Please also be aware that none of these systems meet any federal compliance requirements.


The Analysis Phase

The analysis phase can include a variety of methodologies and tools to complete. This phase also often includes different stages and versions of data. Here are a few questions to ask yourself before entering this phase of the Research Data Life-cycle:

  • How large are the data that I am working with?

  • Will I need a powerful system such as a High Performance Computing system to complete this work?

  • What software will I need to perform the analysis?

  • Will there be new data generated as a result of this work (simulated data for model training, summarized subset of raw data etc.)

  • Will this work change my raw data and do I need to keep a copy of either the raw data or results?

  • How will I manage the changes that will happen during this phase and maintain a record of them?


How ARCC Can Help With the Analysis Phase

High Performance Computing is another core ARCC service and we offer an assortment of support for this type of work. Along with the MedicineBow HPC system, we provide documentation, troubleshooting consultations, software management, and workshops among the system administration of the system. Additionally, we provide facilitation of and technical support for NCAR Wyoming Supercomputing Center’s Derecho system.

If neither of these systems meets your needs for the analysis phase and you still require assistance, please reach out to us via our service portal to discuss want your requirements are and potential options.

Another service that may be of use during this phase that ARCC provides is GitLab for collaborative code development and version control. We do also recommend maintaining a README file that is associated with your work to record additional metadata that will be useful for the publishing phase of the Research Data Life-cycle. Metadata and README files are discussed in the next module.


The Publishing Phase

This phase of the Research Data Life-cycle usually occurs after the work has been completed but before other work (such as a manuscript) is published. What exactly it involves depends on the requirements of the various funding agencies and/or scientific journals that you are working with. For example, if your work was funded by the NSF the resulting data of your work must be made publicly available, and if you are wanting to publish in the Journal of Science, your data has to be available before your manuscript will be published itself. Good scholarly metadata (described in the next section) will be key to completing this phase. Other key concepts in this phase are:

  • Discipline specific data repositories

  • General or institutional data repositories

  • Digital identifiers, such as a Digital Object Identifiers (DOIs)

  • Personal scholarly identifiers, such as an ORCID


How ARCC Can Help With the Publishing Phase

ARCC supports the Wyoming Data Repository for publishing research data along with the Data Librarians at The University of Wyoming Libraries. The Data Librarians will be the primary points of contact during this phase and can seek ARCC’s assistance if needed. Additionally, some larger datasets will require ARCC to host or move for the researcher. Lastly, if the data to be published are already stored on one of ARCC’s systems, ARCC can assist in getting it moved to the appropriate place for publishing.

 

WyDataReplogo-small.png

 


Next Steps