Step 5 - Testing and Controls
Last updated on 2025-05-09 | Edit this page
Testing for Validity and Integrity
Unintended changes to datasets can happen easily. An added comma in a sentence can throw out a line in a spreadsheet. Missing data can alter a calculation. A broken link or download that didn’t complete properly can give us incomplete data without us realising.
But what can we do about this?
There are a few checks we can run to help identify when this has happened.
Is there the expected number of lines?
Are there the expected number of columns?
If you sort a column and find the unique entries, do those entries make sense? For example, in a column that you’re expecting to find months, is there the word “Monday” in it? If so, there may have been some movement in your cells.
Do the calculations make sense? If you have a range of 1-100 for an attribute, is the mean of that column an impossible number such as 254?
Does an analysis fail to run? If you look into it, there may be an unexpected value, such as a letter where it was expecting a number.
Again, R and Python have real strengths here, as you can run certain code to help you find probelmatic entries. For example, R has a function called ‘head’ that prints the first 6 lines of any file, so you can do a visual check of the data quickly. You can also use functions like filter() to filter out any rows with values above or below what is expected.
Coding Resources for testing the integrity of data
The Carpentries software programming courses also run through some basic tests in their workshops
The Turing Way has a great tutorial on how to write robust code.
Microsoft Excel default settings
As per the following paper: Ziemann, M., Eren, Y. & El-Osta, A. Gene name errors are widespread in the scientific literature. Genome Biol 17, 177 (2016). https://doi.org/10.1186/s13059-016-1044-7 licenced as CC-BY 4.0
” The spreadsheet software Microsoft Excel, when used with default settings has a tenadency to automatically convert values. The result maybe that data is converted into a format you don’t need, for example converting date data from UK date format of dd/mm/yyyy to mm/dd/yyyy. In the paper above the authors highlight that Excel is known to convert gene names to dates and floating-point numbers. The error is common and has even made it into approximately one-fifth of papers with supplementary Excel gene lists.”
To remedy this, there is now an option to turn off automatic data conversion - but you need to be aware of it.
It was suggested to reviewers and editorial staff:
“the kind of errors we describe can be spotted by copying the column of gene names and pasting it into a new sheet, and then sorting the column. Any gene symbols converted to dates will appear as numbers at the top of the column.”
Make sure if you are importing or exporting data into software like Excel, that you double check any potential areas where a default formatting option could impact your data.
Where did the pizza go wrong?
We’ve shared our pizza recipe with our friend, who attempted to cook it last night. They’ve called you up today and said “It didn’t work out… why?”
What questions can we ask to discover what could have gone wrong?
Questions could include:
Was it burnt? Or did the cheese not melt?
Is the base too thick/tough?
Does the ingredients taste ‘off’?
Where did the pizza go wrong? (continued)
Now for each question, let’s develop a test. We want to offer a way for our friend at each step to ‘test’ that their pizza is on the correct track.
Tests could include:
Was the oven on 190 degrees celsius, for 20 minutes?
Did the cheese melt?
Did the dough double in size when it rose?
Were the ingredients still in their use by or due date?
Physical Testing and Quality Assurance
Consider your ‘hardware’- the machines and devices you use in your research. Are they callibrated? For your control samples, are you getting the expected outcomes?
Consider your ‘consumables’ - Have your reagents gone out of date? Stored and labelled correctly? Did your blood samples heat up during transportation?
Document how you have checked and accounted for Quality Assurance
Providing Authenticity and Validity
Data lineage
Data lineage considers the data origin, what happens to it, and where it moves over time.
Consider your research data. What is its original source?
Did you obtain it through:
Conducting a survey?
Work in the physical field (such as tree mapping or rock art identification)?
From a collaborator?
Open dataset online?
A physical object (such as paintings)
A catalogue of images (such as satellite imagery)
Discussion
If it is something you haven’t created from scratch, such as a trial result or data collecting, have you noted where that source is?
Have you noted where your source obtained the information?
Your source may not be the data owner - who is?
What copyright and access limitations are on the data?
If you are using a repository that is regularly updated (satellite images, weather patterns, government policies or legislations etc), have you noted the version of the data?
This is all useful information to include in your documentation
This may be a good time for a refresh of Step 1 - Metadata on your files
Tracking your Analysis history
Now that we know where our raw data come from and how it was made, let’s think about the changes we make to it.
Let’s start with data cleaning. Have you made a log of the changes you have made?
Open Refine, NVIVO and SPSS all have logs of actions that you can download and save.
SPSS analysis pipeline comes as a .sps script file. You may see it referenced as ‘Syntax’
SAS has a .sas file for pipelines
STATA has a .do file for pipelines. You may see it referenced as ‘commandlog’
This is also where R and Python have huge strengths. Writing an R or Python script enables you to rerun with certainty the same analysis every time.
If you are using a random number generator, take note of the seed number.
This is a lot of work. What are the benefits to me?
Infinite undo’s: Control versions between active, live and archived.
Branching and experimentation: Copy code or other technical formulae and change to test hypotheses
Collaboration: Track changes, merge input.
These analysis pipelines can be saved as part of your folder structure from lesson 1.
Standard Operating Procedures
We talked about Standard Operating Procedures in Step 4 - Documentation. This is where a SOP is really powerful - your pipeline can even be part of your SOP.
Version Control and Tracking
Let’s now consider tracking your versions of your analysis pipeline.
You may be making changes to your analysis pipeline as you go
How are you taking notes of these changes?
What version of software are you using?
Have you noted the name, model and version number of any hardware you may be using (for example, cameras, microscopes, MRI machines, IoT sensors)?
Version control is keeping track of each change you have made, so that if you need to go back to a previous version of your analysis pipeline, you can!
Think of version control as an ‘undo’ button.
Discussion
We have already learnt some skills in step 1 - Organising your files and folders to track versions of files. We can use the same principles here. Your pipelines can be labelled V1.0, V1.1 etc.
You can use the first number as a major step in versions (for example, Draft v1.0 to Reviewed v2.0), and the second number as a minor step in versions to indicate a change has occured (from v1.1 to v1.2).
Retaining a copy of your raw data prior to any cleaning or analysis is also useful to version control.
Beginner
A great place to start is:
Save your analysis pipeline and store it safely in your organised folders.
Advanced
Your next move can be:
Ensure your data is well described (As per Step 2).
Check that it is clear which of your datasets pair with your Analysis pipelines and in what order.
Publish your protocols and code.
Resources
ADACS-Australia/good-code-etiquette. Manodeep Sinha, Paul Hancock, Rebecca Lange (2019, October 18). GitHub. Retrieved on 2024-04-17 from https://github.com/ADACS-Australia/good-code-etiquette/tree/master licenced as Creative Commons Attribution-ShareAlike 4.0 International License.
R-Pkgs Hadley Wickham and Jennifer Bryan (2024) ‘13 Testing basics’ Retrieved on 2024-04-17 https://r-pkgs.org/testing-basics.html licenced as CC BY-NC-ND 4.0
References
OFS. Valerie Collins Alicia Hofelich Mohr Samantha T Porter(2023) Reproducible research practices in Excel (yes, Excel) Retrieved on 2024-04-17 from https://osf.io/p2bdq/ licenced as CC-By Attribution 4.0 International
The Turing Way Community. (updated 2023) Writing robust code . Github.com Retrieved April 11, 2024, from https://the-turing-way.netlify.app/reproducible-research/code-quality/code-quality-robust licenced under CC-BY 4.0 licence
Ziemann, M., Eren, Y. & El-Osta, A. Gene name errors are widespread in the scientific literature. Genome Biol 17, 177 (2016). https://doi.org/10.1186/s13059-016-1044-7 licenced as CC-BY 4.0
Abeysooriya M, Soria M, Kasu MS, Ziemann M (2021) Gene name errors: Lessons not learned. PLoS Comput Biol 17(7): e1008984. https://doi.org/10.1371/journal.pcbi.1008984 licenced as CC-BY 4.0
In this lesson, we have learnt:
Why we should be checking our data for validity and integrity during processing
What we should be looking for when inspecting our data
Tools for inspecting data
Physical testing and hardware QA plays an important part too
Our data may have a lineage of origin, and we need to be aware and document the provenance of our data
It is important to track our analysis history (and how to record it)
That version control is a way to track changes over time
We build trust in our knowledge by:
We are testing our data for validity and integrity - and being able to show how we are testing!
We are tracking our versions of software, hardware and analysis pipelines, so that it is easier to reproduce later
We retain knowledge using:
Tracking metadata about our data (for example, where did an image come from? Who originally made the dataset?) for later reference
Recording the different versions of software and hardware, so we can go back to previous versions for reproducibility.
Tracking the different versions of our analysis pipelines