Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 14 Next »

Another critical aspect of Data Management is consideration of which datasets are the most valuable and what protections need to be in place for them. Discussed in this section, are the different stages data could be in, what to consider before choosing a storage option, a comparison of the storage offerings from ARCC, and long-term planning for the data.



Stages of Data

During a research project, data takes on different stages of use each with a different storage requirement. Some research projects will use all of these stages, some will only use a few.

  1. Potential Data - Data that are not yet collected, but there is a plan to store them.

  2. Raw Data - This transitional stage includes everything that is collected from the potential stage into a place for processing or pre-processing.

  3. Prepared Data - This stage describes the pre-processing of the raw data that prepares for a model or other processes.

  4. Intermediate Data - This stage is the most temporary of all data, it could be a step in a process that creates these data before processing into final data or simulated data that helps train a model.

  5. Final Data - The resulting data of a process. These data tell the story of the research and indicate the results.

  6. Published Data - These data are the same as the final data, but are in a format optimized for sharing

  7. Archived Data - This stage of data are no longer needed for ongoing research projects but are not deleted.


Assessing the Needs

Stage

Storage Need

Potential

None yet, but a plan for raw is in place

Raw

Could be stored in a temporary place or in a more permanent place if keeping raw is determined to be valuable

Prepared

Should be stored or transferred to storage that will optimize the next process

Intermediate

Should be stored in a highly performant in read and write operations and backups are not necessarily a requirement

Final

Should be stored in a safe place if the process is difficult to re-do

Published

May be in a different format for data sharing, possibly a compressed file stored on a repository

Archived

Could be stored somewhere in “cold” storage in the most cost effective way possible


Comparing Storage Options

Storage Type

Advantages

Disadvantages

External Storage i.e., portable hard-drive or Laptop

Fully user controlled, can be encrypted, portable, and not accessible without physical access

Easily lost, vulnerable to damage, no extra copy, only as safe as the circumstances

Cloud backed service i.e., Google Drive or Dropbox

User friendly, accessible from anywhere, interactive use of native files, shareable, sync-able

Possibly costly and subject to unexpected terms of service changes, potentially unauthorized access

Cloud storage services i.e., AWS, GCP, or Azure

Robust, scaleable storage with customizable access and interoperability within the cloud environment

Potentially costly egress fees, terms of service changes

Institutional Research storage service i.e., ARCC Data Portal

Free up to default limits, support for UWyo researchers, included backups and snapshots

Requires a UWyo based PI, does not include an offsite back up, non-compliant data only

Institutional HPC Storage i.e., ARCC MedicineBow

Access to compute power, specialized directories for performance and collaboration, snapshots

Linux only permissions, not backed up, non-compliant data only

Specialized Institutional storage i.e., ARCC Pathfinder

Cloud-like backend and functionality with S3 protcol for sharing

Not backed up, requires specialized software clients to interact with, non-compliant data only


Considering Other Requirements

Before determining a storage solution for a research project, researchers should take a moment and consider all requirements they may have and what sort of compromises they can live with. Here are a few questions to consider prior to making a choice:

  • How frequently will I need access to my data and how do I want to access?

  • Will I have collaborators that need access?

  • Do I require backups?

  • Will I need to compute on these data?

  • Are there any federal compliance requirements such as HIPAA or NIST 800-171?

  • Is this production-like data that need to have a systems with near 100% uptime?

  • Do I require proprietary software to access the data?


How to Decide

image-20240514-000127.png


Next Steps

Link to Previous sub-module or Home Module

Align left link to next sub-module or home

  • No labels