Skip to Main Content

Learn more about the core concepts of research data management

 


Roles and Responsibilities

Outline and communicate roles and responsibilities

A document that clarifies roles and responsibilities helps set expectations - whether you are on a team or on your own - and helps ensure that no data management tasks fall to the wayside due to turnover, confusion, or lack of guidance.

A Roles and Responsibilities document is also a way to help you keep on top of tasks related to data management, and facilitates communication of those tasks to others. Whether or not you have a large lab, a small one, or you are working on your own, a document that outlines expectations, contact information, timelines, and other important information related to your project can play a critical role in monitoring your work, and ensuring you are compliant with university, regional, and national requirements.

Examples of roles in data management:

  • data collector
  • data cleaner
  • data analyzer
  • metadata generator
  • instrument operator
  • coder
  • data backup and storage
  • data visualizer
Assign data management responsibilities to appropriate members of your group

In assigning data management responsibilities, be sure to:

  • Think about your research timeline, and be as specific as possible. The questions you ask yourself might include:
    • Who is collecting data?
    • When are they collecting data?
    • When does data analysis occur?
    • How often is raw data deleted from an instrument?
  • Understand university requirements (e.g., timeline for data destruction), and ensure roles are appropriately assigned and managed to ensure adherence to those requirements. This might require brief on-boarding and good documentation, especially in a lab that has a high turnover rate.
  • Explicitly state your expectations.

These questions, among others relevant to your work, will help inform how you assign data management responsibilities.


File Naming 

Use descriptive file names

Use a file naming convention that can help you figure out, at a glance, what is in a particular file. For example, a strong file naming convention might look like:

ProjectName_subject[nnn]_informationType_initialsOfDataCollector_YYYYMMDD

For a hypothetical project on letter learning, which uses a series of subject numbers to uniquely identify participants, a filename following this convention might look like:

letterLearning_subject001_responses_TA_20170405.csv
Avoid special characters and spaces

Special characters, like *&^%$#/\[ or ], spaces between words, or using a period somewhere other than denoting the file extension can cause corruption, or prevent file use across programs and platforms.Alternatively, you can use "camel caps," a technique that helps divide of a long string of characters by capitalizing the first letter of each new word, or use the underscore character (_). Hyphens are also acceptable in place of a space, but they can also cause corruption, so either an underscore or using camel caps is ideal.


File Versioning

Find a system that works for you, and be consistent

Writing down your system, or making a plan for yourself, can help you remember what you've set up for yourself. It can also help -- down the road -- when you come back to a project and have included abbreviations or other information that is not immediately clear. Having shared and documented expectations can be especially beneficial if you are collaborating with others.

Keep raw data in a read-only format, and use iterative copies each time you manipulate data

You are able to assign a "read-only" status to files on your computer, often by modifying the properties. 

For example, your raw data, transferred to your work station from an instrument, might be named:

letterLearning_subject001_rawData_TA_20170205.dcm

Any formatting or analysis of this data should be a new file. For example, this file needs to undergo some manipulation before it can be added to a larger dataset. The file that has undergone this manipulation might be named:

letterLearning_subject001_preProcessed_TA_20170209.dcm

Or maybe you need to return to a file multiple times for analysis. Again, each time you manipulate a file, try to change the date:

letterLearning_subject001_analysis_TA_20170210
letterLearning_subject001_analysis_TA_20170211
letterLearning_subject001_analysis_TA_20170215

Remember, the best versioning system is the one that works for you! 

"Low tech" versioning options - no programming or command line necessary

One way to version your work is to simply update the date of your file name each time it is modified. This way, you can easily go back in time and see previous versions of your work! For example:

manuscript_journal_20170105
manuscript_journal_20170106
manuscript_journal_20170215

For an added layer of depth, you could also create a spreadsheet or another separate file where you document what you changed each time. This gives you even more content and understanding to the transformations your work has undergone!

Tech-assisted - may require programming or command line knowledge

Use a program that tracks changes for you. Services like git and gitHub are well established in many communities, and are well-suited to collaborative projects.

A further benefit of services like git is that, if you have multiple collaborators working on the same document, like a complex spreadsheet or code, git helps limit instances of corruption or data loss by incorporating a system of checking out and checking in items. Git can also facilitate documenting changes to data, as whenever a file is 'checked in' to the system, users are prompted to write a short summary of the changes. This further helps when trying to navigate to specific changes in the document.

Note that some solutions, regardless of how well they meet your needs, may not be a good fit for your work. For example, if you have health information stored in a public git repository, that would be considered a breach of confidentiality.


Sustainable Data Formats

Use file formats that are stable, non-proprietary, and well-documented

The file format you select is a primary factor in ensuring the ability to use your data in the future. As technology continually changes, it's important to consider using file formats that will not become obsolete. 

Formats that are more likely to be accessible in the future are:

  • Non-proprietary
  • Open standard
  • Documented standard
  • Commonly used by your community
  • Unencrypted
  • Uncompressed
  • Standard representation (ASCII, Unicode)

All of these characteristics help protect against software and hardware obsolescence. Consider migrating your data into a format with the above characteristics, in addition to keeping a copy in the original software format.

Examples of preferred formats include:

  • CSV, instead of XSLX
  • MPEG-4, instead of Quicktime
  • TIFF or JPEG2000, instead of GIF or JPG
  • XML or RDF, not RDBMS

More advice on formats is available at Pronom, a service developed by the National Archives of the United Kingdom in support of digital preservation initiatives.

 

Metadata

Describe your data with a level of precision necessary for re-use in your field

Metadata describes important information about a data set. Appropriate metadata helps others find, use, and cite your data. Depending on your field, it will be important to collect a variety of parameters. For example, age range, organism information, phenotype information, sample characteristics, technologies used, and more. At a minimum, it will be important to collect:

Title
Name of the dataset or research project that produced it

Creator
Names and addresses of the organization or people who created the data

Identifier
Number used to identify the data, even if it is just an internal project reference number

Dates
Key dates associated with the data, including project start and end date, data modification data release date, and time period covered by the data

Subject
Keywords or phrases describing the subject or content of the data

Funders
Organizations or agencies who funded the research

Rights
Any known intellectual property rights held for the data

Language
Language(s) of the intellectual content of the resource, when applicable

Location
Where the data relates to a physical location, record information about its spatial coverage

Methodology
How the data was generated, including equipment or software used, experimental protocol, other things you might include in a lab notebook

Where possible, use metadata standards

A metadata standard will help you accurately describe your data with precision, and when adopted by a community, sustains community understanding and standards. Metadata standards also help increase the future usability of your data by providing important contextual information. This not only helps others find your data, it may also help in the future when other researchers are looking to combine similar sets of information in a comprehensive study, or provide insight to phenomena in new, unknown ways.

As there are many different types of metadata available, the following resources can help you locate more information:

As always, if you require more assistance, please reach out to us at the Data Working Group!

 

Security

Tools and techniques can help enhance the security of your work

One of the greatest challenges we face with security is the human element -- perhaps a member of your group uses insecure passwords, accidentally left your laptop in a cafe and now it's gone, or a window in your ground-level lab was open and someone took your desktop workstation.

You can take a few steps to help secure your work:

  • When using cloud storage, be sure to use a secure storage solution, like Box from UMass Amherst
  • Encrypt your machine
  • Keep your applications up to date
  • Control access to your machine -- password protect it, keep it in a locked room, etc.
  • Use sufficiently complex passwords
  • Use anti-virus software
  • Use anti-malware software
  • Use the internet and email safely -- keep an eye out for 'phishy' emails and try to work from secure wireless connections

Read more at UMass Amherst IT, and be sure to check out their Security Checklist for Personal Computers.

Certain data types require stricter security standards

Be sure you work to appropriately secure your data. For example, when you work with human subjects, endangered species, or artifacts of cultural significance, you must appropriately safeguard the information you collect. 

You may also be required to appropriately destroy data after a certain time period. See UMass Amherst's IT page on Hard Drive and Magnetic Tape destruction

Learn more at UMass Amherst's Research Compliance page, and be sure you are complying with any federal, university, or local policy.

Critically read and understand Terms of Service and Use

Make sure you read terms of use! For example, the Terms of Service for Google Drive states:

When you upload, submit, store, send or receive content to or through Google Drive, you give Google a worldwide license to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes we make so that your content works better with our services), communicate, publish, publicly perform, publicly display and distribute such content. The rights you grant in this license are for the limited purpose of operating, promoting, and improving our services, and to develop new ones. This license continues even if you stop using our services unless you delete your content. Make sure you have the necessary rights to grant us this license for any content that you submit to Google Drive.

Certain Terms of Service may be incompatible with your data.


Storage and Backup During Your Project

Automate your backups

Take advantage of backup systems loaded into your operating system, or those pre-loaded onto external hardware. These often come with a method to schedule your backups, as well as how granular your backup should be. Ideally, you should backup your data in such a way that data loss would only minimally impact your work. This might mean that you perform backups daily, weekly, or even monthly. 

Store your data following the 3 - 2 - 1 rule

Best practice for data backup is the 3 - 2 - 1 rule: In three places, in two different formats, and in one geographically distinct area.  This helps protect your data from both local media failure, like your hard drive crashing or dropping your laptop, as well as from catastrophic events, like a flood or fire. If data is also backed up in a geographically distinct area, any catastrophe that happens locally will very likely not occur in a different state or time zone. Cloud storage is often distributed across several different geographic areas, and can often meet your geographically distinct requirement.

One way to follow the 3 - 2 - 1 rule is to:

  • Store one copy on your local machine
  • Store one copy on your departmental servers
  • Store one copy with an appropriate cloud service, like Box, which is secure, and for UMass Amherst users, offers unlimited storage. 
Be wary of Terms of Use

Make sure you read terms of use! For example, the Terms of Service for Dropbox states:

When you use our Services, you provide us with things like your files, content, messages, contacts and so on ("Your Stuff"). Your Stuff is yours. These Terms don't give us any rights to Your Stuff except for the limited rights that enable us to offer the Services.

We need your permission to do things like hosting Your Stuff, backing it up, and sharing it when you ask us to. Our Services also provide you with features like photo thumbnails, document previews, commenting, easy sorting, editing, sharing and searching. These and other features may require our systems to access, store and scan Your Stuff. You give us permission to do those things, and this permission extends to our affiliates and trusted third parties we work with.

Certain Terms of Service may be incompatible with your data or your agreements with your funders, organization, or colleagues.

 

Sharing Data

Data papers

Data papers are one emerging method of sharing data with appropriate context, and are published specifically to facilitate use and re-use. Many are peer-reviewed. There are several venues that publish data papers:

Data journals, that only publish data, include:

Mixed journals, that publish a combination of data and research articles, include: 

Other journals and venues for publishing data are available, as well.

Use an appropriately robust, redundant, and findable platform for sharing

Appropriate venues for sharing varies from discipline to discipline, but typically, a repository or other systematically managed option is an ideal solution for sharing your data. Personal websites are a challenge to maintain in the face of many other priorities, and it is up to you as the administrator to migrate your website to new versions, ensure all links work appropriately, and migrate data to new formats as they arise. 

Many repositories will take care of this for you, and additionally, will have an appropriately robust level of metadata in use, are harvested by search engines, and some have metrics like number of hits, downloads, and citations, and better demonstrate your impact on your community and the world. 

Licensing

A license can help others quickly understand how you wish your work to be re-used. A commonly adopted license platform is Creative Commons, which allows for several levels of re-use, from public domain (C0) to  a very restrictive licenses that is not available for commercial use, does not allow derivatives, and can only be used if someone else uses the exact same license (CC BY ND SA).

 


Archiving Data for Long-Term Access

Use a data repository

Making data available for long-term access requires a lot of time and effort: you must ensure that the file format is compatible with today's operating systems, the link still works, the data is not corrupted, the hardware is not malfunctioning, there is appropriate security in place, data has not been tampered with by an outside source, and more. This work is often why some repositories charge a fee -- digital data is fragile, and a data repository is much more than a series of servers in a room. 

A good data repository will take care of some, if not all, of the above challenges. Be sure to do your due diligence when selecting a repository. 

It is also beneficial to select a repository at the beginning of your project. This way, you can plan your work around ultimately sharing it, and thus helps you not be in a position where you need to try to re-create the circumstances under which the data was created -- be it weeks or years ago!

Use formats that are well-suited for long-term accessibility

Similar to data formats that are sustainable (see: Sustainable Data Formats, above), formats that are well-suited for long-term accessibility will also be non-proprietary, well-documented, and well-adopted by your community. 

However, certain repositories or other venues of dissemination may have strict requirements on what data formats they will allow, so be sure that, when selecting the repository you'd like to use, that you can appropriately convert your data to a shareable format. 

The Library of Congress has more information on preservation and long-term accessibility on their Recommended Formats Statement.