Root Cause Analysis in Clinical Trials
Summary
Root cause analysis (RCA) in clinical trial analyses, especially when using macros in SAS, is difficult and time consuming. It is, however, unavoidable to encounter issues in data analyses that span across thousands of lines of code and many interconnected programs. For example, it is not uncommon to detect issues in a TLF whose origin lies in the raw data. In this article, we describe how Verisian’s Tracer drastically improves the speed with which root causes can be identified and ultimately resolved by providing full traceability across programs and the dataset lineage of your entire study. You can try it out for free here.
Root Cause Analysis is Traceability Analysis
The goal of RCA is to identify the underlying cause of a problem and then fixing it, rather than fixing the symptom. Consider the following simple example:
proc import
datafile="example.xls"
out=_temp
dbms=xls
replace;
sheet="VARIABLE_METADATA";
run;
...
proc sort
data=_temp;
where keysequence ne . and domain="&dataset";
by keysequence;
run;
During the execution of the program, SAS throws an error in the proc sort:
ERROR: WHERE clause operator requires compatible variables.
This is because SAS cannot compare a string to “.”, because it is only a valid comparison if keysequence were a numerical variable. It would be possible to attempt a fix by adjusting the comparison itself, but the root cause lies in the proc import step, where a numerical column is imported as a char variable. The correct fix is to ensure keysequence to be a numerical value, as is both assumed and desired.
Although this is a toy example, consider that the essential part of the RCA is to understand the error in its full context, which can only be achieved by tracing the error upstream to its cause. There can be hundreds to thousands of lines of code between the detection of the issue and its root cause, spanning across macros and programs, that need to be navigated to understand which code could be involved. Worse yet: if macros are involved and potentially contribute to the issue, proper tracing is only possible by reviewing the execution logs and the macro-generated code as indicated by MPRINT statements in addition to the programs.
Log-based RCA is cumbersome and time consuming
SAS logs show code, macro invocations, potentially resolved macro code (given that the MRPINT option is enabled), and log messages interspersed with line breaks, custom log messages and other diagnostic information. Looking at logs alone, programmers can be drowned in partially redundant and often irrelevant information. Statistical analysis is difficult enough; trying to discern non-trivial implementation details from a SAS log, or worse, a combination of the SAS log and SAS code, is cumbersome and mentally exhausting.
When doing RCA, programmers will need to
- Trace and keep track of the problematic code pieces across programs and macros
- Search the log for resolved macro code, given that macro invocations are “black boxes” in the original code
- Simultaneously diagnose the problem
Tracer: Full Traceability, Full Clarity
Requiring only the study’s SAS log files, we built the Verisian Tracer to vastly improve the speed and quality of RCA and thereby overall study analysis quality and integrity.
Firstly, the Tracer builds an easy-to-understand and navigable interface to put full traceability throughout an entire study into the programmer’s hands. As is shown in the image below (which is taken from our free demo study), every node represents a dataset created during the analysis and shows what other datasets it depends on. All upstream datasets are in light blue, and all downstream datasets in dark blue. Any datasets that are grey do not directly contribute to or influence the currently selected dataset and can therefore safely be ignored for the RCA of the selected dataset.
When selecting a dataset, the code panel on the right shows the code that created the dataset. The Tracer also displays any log messages associated with the current code and thereby sets them into the proper context to enhance your RCA process. By clicking on any of the upstream nodes, you seamlessly navigate within and across your programs, skipping anything irrelevant to your current question.
Selecting the "resolved" tab shows the code where macro invocations are fully resolved, provided MPRINT is enabled.
The "Log" tab shows where the execution logs are stored, offering the source of truth of what was executed. The Derivation Tab assembles all code necessary to create the currently selected dataset from across your programs into a single file, skipping any code that does not contribute to the derivation of interest.
Taken together, instead of browsing through files and hundreds to thousands of lines of code, the dataset traceability graph and corresponding code allows the programmer to explore the full provenance of any dataset in a study. Once the root cause has been identified, the programmer knows exactly in which file(s) and at which line(s) the solution(s) should be introduced, and can understand what downstream effects these changes will have.
Next Steps
Expanding Dataset Traceability
We will soon introduce support for tracking formats, which commonly encode codelists and formats, and specify the datasets to which they are applied.
Variable Level Traceability
Currently, the Verisian Tracer merely provides dataset-traceability; it displays how datasets are interconnected, enabling you to easily browse through programs across a study. Soon, we will also be introducing variable-level traceability. This new feature will allow you to select any variable in your analysis, and the Verisian Tracer will show the exact code necessary for the derivation of that single variable. This will drastically reduce the amount of code to be reviewed and understood during RCA.
Q&A System - Built on Traceability, empowered by AI
Large Language Models have revolutionized AI and provide a new valuable resource for automation when applied appropriately. In the near future, we will introduce our own custom models that will summarize traceability information into natural language to create a study-specific question-answering system, fine-tuned and optimized for explaining analysis code and full derivations. At the core of our approach lies our traceability analysis, rooted in graph theory, ensuring inferences are not only fully reproducible but also without side-effects and hallucinations, enabling us to provide the source of truth for any inference to guarantee proper validation. This ensures we just use AI exactly for what it excels: to translate and summarize.
If you want to try the Tracer, try either the free demo or upload your own logs in our free cloud trial (coming soon).
You can find a set of tutorials and articles for the Verisian Tracer and our other products here.
You can also watch this webinar.