Inaugural Podcast: Validation in Clinical Programming

Tomás Sabat Stöfsel

Dec 3, 2023

34 min read

‍

Episode Summary

In the first episode of the Verisian Community Podcast, Tomás and Henning, along with industry expert Sunil, discuss how new technology can improve the world of statistical programming. They look at how the Verisian Validation Engine ensures programmers can guarantee the integrity of their analyses. The discussion navigates through the evolution of data analytics cas, the potential of AI in clinical programming, and previews Verisian's forward-looking initiatives in clinical reporting.

‍

About The Verisian Community Podcast

The Verisian Community Podcast brings together experts in clinical trials to exchange innovative ideas and best practices central to clinical reporting, submission and review. Aligned with Verisian's mission to accelerate the evaluation and market launch of new medical treatments, each episode features expert insights, with guests ranging from statistical programmers to medical writers, to discuss the challenges and opportunities of the latest software and technology.

You can listen to us on Apple Podcasts or Spotify.

‍

Resources

Learn about the Validation Engine: https://www.verisian.com/
Join the Verisian Community: https://www.verisian.com/community/
Follow us on LinkedIn: https://www.linkedin.com/company/verisian/
Follow Sunil on: https://r-guru.com/

‍

Podcast Transcript

Tomás: Welcome everyone to the first edition of the Verisian Community podcast. It's my pleasure to introduce two good friends of mine, Henning and Sunil. And, but first, before we dive into the topic that we want to talk about today, just a little bit background about why, who we are and why we're doing this podcast. Henning and I started Verisian only, I think, five months ago, and we have been very passionate to share a lot of the knowledge that we've encountered from people in the industry. One of our big missions in creating the Verisian community is to bring people together virtually and in person.

And one of the ways we want to do that is by hosting regular, a regular series of podcasts to invite the people that we know, share knowledge, build a community and see what happens. Because I think there's really nothing more beautiful than bringing people together and and seeing how we can build. A better future together, whether that's with technology, healthcare, or in any other field. Stay tuned with us. This is we're open for feedback. So all of our listeners, if you're interested to listen about specific topics or anything, or anyone, if, or even if you want to join as a guest of one of our future podcasts, please let us know. We've got an online community as well, so check out our page online. We'll be updating it regularly alongside other blog posts, webinars, and in-person. Events that are coming up as well. So with that it's my pleasure to introduce from all over the world actually. I'm currently in London. It's freezing here. Henning, who's actually in Berlin. How's the weather in Berlin, Henning.

Henning: Also freezing here with snow and everything.

Tomás: And Sunil you're on the other side of the world which you're in Los Angeles, which I'm assuming is a little bit better.

Sunil: yeah, Los Angeles weather is really great. Evenings are getting a little cool but during the day it's very pleasant. So definitely walking weather, nice and sunny weather for sure.

Tomás: Excellent. Firstly, thank you so much both for joining us today. So now I'd love for you to introduce yourself a little bit to the audience. Can you tell us a little bit about your story, your journey?

Sunil: Sure. Yeah. I've been in the pharmaceutical industry for well over three, three decades now. and really passionate about working with clinical trials, the outcome of working on studies as far as medical devices or pharmaceutical companies working with the clinical data and, attending conferences of like-minded people.

That's really exciting. Recently, I was the co-chair of the few single day event, and that's where we had an opportunity to meet, meet you Tomás as, invited speaker. We had some really good topics and discussions. I'm really passionate about, leveraging technology earlier days I've been , very active about leveraging metadata because I'm a big advocate of standardization and automation as there's a tremendous benefit to be gained, from that. And yeah, I think, with that I've been fortunate enough to meet you and then be invited to this podcast.

Tomás: And I think that was a wonderful day. And actually, we, I think we met the night before because there was the dinner and we had a we had some, I think we had some margaritas at someone's place before we had that to the, there were a lot of margaritas flowing that evening, so awesome. And gosh that's already three months ago that time flie

Sunil: That's true. Yep. A lot has happened since those past three months.

Tomás: And Henning, of course we've known each other for many years in different capacities. But tell the audience a little bit about yourself and how you ended up co-founder, CTO Verisian.

Henning: Sure. Yeah. Basically from my background, I'm a biomedical scientist mostly but I also studied software engineering. And during my PhD I worked on a variety of topics. I did some work in epigenetics sensory perception and metabolism. But independent of the field, I was always building software in order to do science.

And either do or analyze experiments in ways that weren't possible before without that software. Yeah. After my academic career, I worked at Bayer for about seven years. There I worked in the IT and infrastructure did some product management, but also engineering. And finally ended up in, in basic computational research, which is basically where, you and I met again when you were at Vaticle.

And I was using TypeDB formally known as Grakn. So I did some work actually with Grakn built a type now called type DB loader. So that's how we've known each other also, I think since 2019 already. And yeah, and so basically at the beginning of this year, we started Verisian really in the more towards the middle of the year at least officially.

Tomás: Yeah, and it's been quite the journey. I remember it, it was a shame we didn't get to meet because we organized. So for also the audience, for context, I, the last company I was building was an open source database company where Henning was a very valued customer of ours. At Bayer and we organized one of our main user conference.

We organized that. You won't believe it. I still think about it, how lucky we were, but we organized it. I dunno if I told you Sunil, but we organized a very big conference with 300 speakers. Three three, sorry, 300 guests, 25, 30 speakers. Literally at the beginning of February 20 20, and six weeks four, I think four or six weeks later lockdown happened so yeah, those are definitely different times. So switching a little bit and talking about the topic that we wanna discuss today, which is, so we've recently released a demo. And that demo is called the validation engine. And in that demo we've tried to put together something very quickly. We built it in just a couple of weeks to showcase to the world a little bit what our philosophy is on how we think we can help the world of statistical programming. And maybe perhaps, Henning, could you start by telling the audience a little bit about what we've done with the validation engine and why you think it's useful for programmers?

Henning: Yeah, sure. So the validation engine like you said is basically let's say an expression of the core tech technology that we're building. So it's powered our by our graph analytics and. What we do is we analyze SAS logs of a clinical trial or a study analysis. And this can be logs for an entire study or even just, just a single log file for a single single analysis.

And so from this let's say, let's go with a case of a full study. We can produce a graph that contains a representation essentially for anything that happens in that analysis. So every line of code, every data set, every variable. Any derivation, all of it is in there. And so we can provide a full traceability.

You can pick any point in the, in, in the code or any data set that you're interested in, any variable, and you can figure out what all is required to produce this, produce it, and everything that it depends on. anD so that is what we call the full traceability. And so fundamentally.

The demo shows exactly that. So one of the views of the demo is that you can actually look into that study graph and you can traverse the whole dependency graph of all the data sets. So that's the level that we chose to display for the demo. And the other thing is then, because we wanted to.

Really showcase the value of it from a specific angle, which is this the logs. So when we find log messages, we can also attach them to the graph because the log message, of course, comes from some executed code, which maybe didn't work the way it was supposed to. And so the log messages are now attached to that graph.

And so what we can show is. For any log messages exactly where did it occur that that everyone can do, you can just look into the logs. But we can also do a couple of other things so we can from there very easily figure out what the log message is likely about. And even more exciting, we can show how log messages are connected.

So maybe how one log message that is upstream in analysis is an indication for why there are downstream problems, which again is indicated by other log messages. And so that is really what the demo is trying to show. It's a mixture of capability, but also immediate use because you get an overview of an entire trial or a trial analysis.

So it's easy to understand how it was built. It's easier to onboard people that are maybe new to that trial. It and it gives you a full overview of the log messages and how they're connected as well as suggestions for how to fix them.

Tomás: And Sunil, now you've worked in the industry for a long time. You have so much experience. I listened to your talk at the PHUSE SDE in, in Silicon Valley, and I was amazed by your expertise. I'm curious. How do you see this being placed in the workflow of a statistical program?

And what are your thoughts generally about, about the demo and what Henning just said

Sunil: Yeah, actually I'm very excited about this. As as traditionally programmers, what they do is. There's, if you're working in the SAS environment, you have a display manager and in that display manager, you have individual programs that you're working with and for programmers, definitely taking a look at the log is is part of the process of what you're running the programs and you're debugging.

And so programmers are very familiar with the log, but it's a very manual type processes, individual program type of a process that you go through and you make sure that programs are working fine. But what I really like about, this approach and the technology that that's being introduced here is leveraging the.

It's leveraging the metadata behind the programs. So programs aren't just working with one program, they're working with a collection of programs that may be called macros or other programs. So there's a level of complexity, there's a level of hierarchy, there's a level of organization, priority dependency that's there. And instead of manually tracing and going through the programs, of course as programmers, we have to. Structure and set this up and everything. But if we can have technology or tools that enable us to be organized, especially in a visualization way that connects programs and not only that have a way that the log, so the log actually is executable code. When we run macros, we don't actually, may not actually have the actual code with us, but in the log it's, get executed and having been able to trace the, execute the code with the actual code representation, seems is a very great potential. For, I think SAS programmers I think is, I'm really excited to learn more about it

Tomás: So now Henning, the, obviously what we're trying to do in our mission is, goes beyond just the validation engine. From what Sunil said, tell our listeners a little bit about how the validation then connects to the larger mission of Verisian.

Henning: Yeah, so may maybe we start with a larger mission itself. So what we really want to do is make clinical trial analysis faster, cheaper, and better. And that's that magic triangle that people say you, you actually can't do because if you make something faster and cheaper than you can't raise the quality and so on.

So there's always a trade off between these three dimensions. But the truth of course is that you can change it, but then you have to change the whole paradigm of how something's done. So you have to fundamentally do it differently. And that's really what we're trying to do. So we're trying to introduce a paradigm shift with this new software and tooling that's based on our graph analytics and also the graph based AI that we're developing.

I would say the core piece of it is, at least for the , next year or so, is this statistical programming hub of which the validation engine is one component. We're calling these engines, so the validation engine basically. Focuses on ensuring quality and analysis integrity.

And then as well as, adherence to SOPs, regulatory authority requirements, all of these sorts of things. So it's a bit of the, it's a bit of , the quality, control arm of the whole thing, which is, you have that in software development as well. You have your testing, you have your QA and so on.

The other part is really the building. And so if we want to do things faster, that also means we have to build faster. And that's really the other part of it it's really two sides of the of a single coin, which is the validation engine one side and the analysis engine on the other one.

And so one of the things that Sunil just said about the macro system, so the macro systems are great for generating code and doing that in a standardized way. But it's it's moving towards generation and you just call this little line, but you actually not always know what's behind it.

And also for people then to understand what is behind that. It's a little bit of, it's literally a small little black box that you can't look into. . And so what we can do now is we can resolve all of that through analyzing the logs so we can tell what code was actually executed. And what would be great then is the next logical step really also is that you would be able to take that code that is extracted and then work with it directly, right?

Independent of whether it came out of out of a a macro or not. And so that's where the analysis engine really starts. so we can use the analysis engine at this combination with the validation engine to extract old code and reuse it. But we can also then in this analysis engine build code much more quickly because it's integrated with a validation engine.

So as you build new code, as you take read in a data set and you create a new one it will in real time update that graph that shows you the overall overview of your analysis. And so you can actually see how you're changing the structure. And we have this this feature in mind called edit in confidence, which will tell you that if you modify some code, so let's say there's already an analysis, but something's not quite right and you're working on a little piece of it, and you're changing how a variable is derived in a dataset.

It will tell you exactly what consequence that change has in all of the downstream data sets that are, that depend on that line of code basically that you just changed. And so what that will do is instead of you having to make a fix basically, and running into that problem that all programmers have is that you might be introducing more bugs than you're fixing.

With that one line change is you actually will get, you can think of it a little bit like a to-do list that says, okay, these are all the things that you have affected. This is exactly how you affected them. Is this what you meant to do? And if if you go through it and you say yes.

Then your fix is an actual fix. But that just ensures that when you do make edits that you actually, having the intended, intended consequence. And I think especially this part, um, was something that a lot of the users that we've been talking to, users the clinical analysts that we've been talking to say is that the pressure.

To be correct and right and have your analysis actually be integral and not and not introduce any new errors is so high that a feature like this that will just give you the confidence and everyone else the confidence that as you develop new things or as you try to fix things, these actually fix things that don't introduce more problems.

That, that would be of great value. And again, just save an extraordinary amount of time because instead of having to go through hundreds of lines potentially of code. Where your changes might've had effects exactly where to look, and you only have to look there. So it's gonna cut down the time that you that it takes you to do that. Absolutely dramatically.

Tomás: So Sunil what do you think about that and specifically tell me your thoughts and our listeners about changing the way programmers go about building their programs, whether it's ADaM TFLs anything else. How can programmers, are programmers open to those changes? And how would you even go about doing those things given that these are large organizations with very established environments and, we've had a lot of conversations with programmers that are very open to that, but they themselves aren't quite sure how to go about doing that because they're not the ones responsible. What do you think?

Sunil: Yeah, no, definitely. Those are all good questions and very valid questions as well. Many organizations are well established. They do have systems in place programs, macros in place and everything. Many organizations also several things. They have a scan log.

So they, they realize the importance of checking the results of SAS programs. Ideally, these scan logs will be summarized and it has a list. But there's a lot of limitations in what organizations have as far as the scan log, the ability to make program corrections. With this type of tool is phenomenal. I think that people will need to, see it in action to see that, okay, a change using this tool could actually actually update the programs. But more importantly, wIth the tool that helps us identify through a visual concept with a dashboard. And everybody's pretty much sold on dashboards.

The key thing about dashboards is you put the key things in there, the KPIs, the things that are most important. So when we're talking about SAS logs, we're talking about hes warnings and notes. So we wanna know across the board, not just in one program, but a collection of programs that are used, whether it's to create SDTMs ADaMs or TLGs, where are the vulnerable points?

Where are those issues? Obviously we wanna have, clean logs, so getting rid of identifying, removing all the errors. But there could be cases where warnings or notes come up. And in particular, when organizations, what they do is they have a Software development lifecycle. So they do the development of the individual programs. And when they pass QC, they put in a production environment. In a production environment, they will have scripts. So they will run these on a regular basis. What I see organizations, doing is leveraging this type of tool technology in their production environment, mainly because it's an automated way to scan. Programs, scan logs to see, give them a firsthand view. We always wanna be proactive and a technology tool like this enables us to be very proactive without having to do manual effort. So we can have dashboards that give us information in advance. Previously what organizations would do is they would maybe put information within their programs to capture error messages and then loop and feed that back maybe as an email because they wanna really be proactive because last thing that you want is things to fall and errors and down downstream, days or weeks later and someone else finding issues that you may have in your program. I think, while this is this is a kind of a revolutionary, approach. Where you have a system to monitor and then actually make updates to the program. So I think something that, programmers it will definitely, I think welcome, but it'll be a little bit of a, okay, can this really work for me? And so they'll be a little bit cautious, but I think they'll be exciting to see how this plays out.

Tomás: and I think, and those are really interesting conversations. I was in I think I was in China three weeks ago, and what's really interesting is that all of these challenges, these conversations are Are ubiquitous across the board. It's, everyone talks about the decoupling between the West and China, but I find it fascinating that the, the passion that people have, whether it's programmers in the US, in Europe, in India, in China, in Japan. All similar. And that's really interesting. A question for you, Henning, and it's a bit of a loaded question actually. I wanna hear your opinion on this as well, Sunil. Now, but Henning, you've got a lot of experience in software engineering across different fields. And in your opinion how do you define code quality and specifically how do you see that view of code quality perhaps differ from normal software engineering to clinical programming.

Henning: Yeah so I would try to split that up maybe first with the commonalities and then maybe what the differences would be. So I think common is that I would say robustness is probably one of the main things so that the quote that you write is able to handle situations that, that you maybe didn't think of.

So this classical try catch construct where you're trying something and you've, you formalize your assumptions that you think you can do it in code. And so it's really the robustness of being able to handle unexpected events, unexpected data.

I think that's very similar in both cases. The other thing that I think especially as you get more senior in development, that becomes more and more important is how transparent it is. How transparent is your code is. So how easy is it for yourself to understand after two months, how easy is it for someone else to understand?
‍

Because that has immediate implications for, also for maintainability and just, as soon as you work on projects beyond just yourself and again, like I said even if you, even if it's just you, usually after two months when you look back at anything that's non-trivial, you're like what exactly I. Happened here. What did I do here? What was I thinking? So I think, even if you become like more verbose and I do a lot of things which I know I, I could do differently, but I do them because I know that if I go back to it two months later I, now I understand what I was doing.

And that sort of flows also into the extendability so that you write code in such a way that you that if you then have to modify it or you say, this is an interesting avenue, or this is an important feature, I build it in such a way that I can build on top of that afterwards. Which is similar to reusability, that you write it in such a way that you can reuse functionality again all over the place.

And that's of course a massive topic for anything that's analytics focused and not so much software development focused because having to rewrite analysis that you've done before is of course a waste of time in all worlds. The other thing that I think that's more common in in software engineering is testing because testing and data analytics is a bit more difficult because you basically you're basically testing data and how data flows through something and that's, that tends to be a bit more difficult than testing functionality of code itself. But I, I think fundamentally the principles are very much the same.

The main difference is that maybe an analysis you have a more or less defined inputs and you have an end to the analysis, right? So it starts somewhere and the ends somewhere where, and if you work in an application codes, there's not really a, there's not really a defined end.

It's more that there's a set of processes that are, that sort of have a start and end of course but they're all interconnected and. Yeah so I know it's a bit of a wishy-washy answer, but I think the core, the commonality of this robustness writing clean code having several steps instead of one giant big one that solves everything, but that neither you nor anyone else can understand, and it's impossible to debug if there's any changes to be made. I think those are really the main ones.

And and of course if you are. If you are in the analytical space, then what's most important, I would say, is that you can handle or be aware of seeing unexpected data and inputs. Because the greatest danger is that there are no errors thrown because the code works.

But, um, it's doing something which you didn't intend it to do. And I think the danger for that is actually higher in, in data analysis. Than in more of the classical software engineering. And so I, I think that's, again, that's one of those things that we want to look into very strongly with the validation engine.

That it's not just catching log messages, which is, fundamentally like Sunil says, right? Everyone probably has a tool for that. Where it says, here I don't know you're trying to do an operational missing values, but the question is, where do these missing values come from?

And can we, as someone develops code or as someone modifies code, can't we in real time look at it and say, look, if you do this, given the fact that you're in this spot, in this data analysis if there is, let's say a missing value. Three steps up. Because that's just the data that comes in, then you're going to have a problem, which you can't see right now because it hasn't happened yet.

So it's not in the log, but your code is written in such a way that you're going to produce this. So I think this predictive power, which is basically the fusion of taking, looking at the code and really understanding what someone's writing. And then being able to say, this code actually is robust or no we can, basically identify already cases in which your code will fail.

Highlighting that and then even connecting it to maybe log messages, which are in a different spot, but they're really indicating the problem is somewhere else. I think that's really where this traceability comes in extremely handy. So we are going to be pushing a lot of these software engineering principles into our analysis, and it's not to to overwhelm people, right? It's not going to be like, big words and complicated explanations for how we did this. It's really just what we're trying to do is abstract that away, right? So there should be a message that says, look this code can produce the following issues.

If you write it in this way, all of that disappears. And then ideally, just a button where you say, okay, accept. I agree. That, that's the idea.

Tomás: And how do you see us, this the topic of code quality in, in clinical programming.

Sunil: Yeah, no I just to add onto that, I totally agree. I think the, the concept of program validation and compliance. So when we're talking about, in this regulatory environment, we're talking about, strict compliance. The tool, a tool like this, what happens is you have a team.

So you have individual programmers, you have, source programmer QC program, you have several people. And so a lead would have a ultimate responsibility. A lead would be taking a look at, this type of tool to take a look at overall, how, what's going on across the programs. And I really like the concept of traceability, what as mentioned before is, the scan log, that's something that is pretty standard common in the industry, but a tool such as this, it's gonna enable us to do a lot more. And that's exciting. The concept of traceability where you can trace the impact, many organizations will do type of impact analysis and so having a tool to facilitate impact analysis, enables us to do. Enable us to do make better decisions in very timely matter. Anything that you can eliminate a manual process. And this is a very integrated, the tool like this has to be fully integrated with the programs. And that makes it, very much interactive. So that's really exciting to see again, the visual component of that and being able to, drill down. I think that really empowers SAS programmers to provide more a better service or higher compliance, in this regulatory environment. So I think, these are really nice features. That programmers will welcome.

Henning: And maybe if I can add to that to just pick up what Sunil said about about this drill down idea that he was referring to that you can, you have a visual representation, but you have this integration directly with the code. I think one way of thinking about this is actually that I.

The ultimate source of truth of everything that we're doing is either the log or the code that we extract out of it. And for, probably the vast majority of the functionality, it's going to be the code itself. And so really what we're doing with, with this overview of showing how the data sets are created and how they depend on each other and so on, it comes out of the code.

So you can think of it as like one, one view of the actual code, but we can show a similar view on variable level and we can show a similar view on. If you're interested in formats, right? So where are they created? How are they stored? Where and when are they applied to which dataset and so on. Because, if you use formats formatting features and there's a lot of things that can go wrong because you're, because of the way you're using formats.

So it's just, what we're trying to do is we're trying to figure out what are the best ways of displaying this. It, ultimately extremely high complexity of all the things that are happening in a complicated analysis. And give them to you piecemeal that it's easy for you, that you can go through them like one by one and it's becomes very transparent and it's not just you, the programmer, but also if you're in a discussion with someone, let's say maybe with a statistician.

That you can actually say, this is exactly what I implement. Is this, is this in accordance with the SAP? Is this in accordance with what you were thinking? And you have a different level of basis for communication. It's not just the code, but ultimately anything that we show you.

At any level, you can always go back to the source of truth in the code and we can tell you exactly this is exactly where it was specified, this is exactly where it was applied, and this is how we can tell, that these are the same two things. So that's what gives you the confidence at the same time being able then to go into these higher levels of abstractions to facilitate the communication.

And that can go, within your team, within the programming team with . Between validation and the people who are who are writing like the first ideas exploration. And as we imagine Tomás, this very well maybe also towards regulatory authorities eventually, that it's easier for them to understand what is happening in analysis and how it was structured et cetera. But, that's a bit more down the line.

Tomás: So the last thing I wanted to talk about today, and it's something that it's on everyone's mind, so I, it's something we have to touch despite I know some of the opinions that you guys have on this, but generative AI, the biggest buzzword going on today, so, Sunil now I'd like to start with you.

What's, how do you see generative AI and is there any role at all for Gen AI in statistical programming where, 99% is just not good enough. It has to be a hundred percent correct. And obviously statistical methods are, by definition, probabilistic in nature. And having a deterministic AI system is against the whole principle of an LLM.

But I'd love to hear your view on this.‍

Sunil: Yeah, obviously when you're talking about AI It, It it, it conjures up, maybe some fear and some uncertainty and an unknown because you're not exactly sure what what exactly it is. But then on the other side we have seen, I. Technology. We've seen advancements, in in, in cars, for example. So we can see, I guess the benefits of it. One of the things that you know when I look at a new technology or maybe losing AI, you know I see, okay. I. How can I relate to that? How does it make sense to me? Okay. So there, the backbone of it as a programmer, I try to take a look at the design of it, what's the backend of it? So maybe it's working off of metadata components and then can that, simulate and organization based on how would I know and, I can come up with some, ways that it could happen and the technology can make it happen. One of the key things that, I think, people should take a look at when we're taking a look at, tools and then specifically when you're talking about AI tools, because we want to have some type of benefit from them because there's some intelligence or some smart, as we have the smartphones. It empowers us to do more, but we have to be careful. We have to be careful to make sure that that what it's actually doing is actually what I consider to be, validated and actually useful for the user. Those two things, how useful is it for me, as a as actual way to, translate into actual, true benefits and is it really validated? And so those two things I think. If I feel real comfortable about that, I think it can really go a long ways to really save me the time and still have the high quality results that that my team would expect from me.

Tomás: And how do you see the role of, AI in general in clinical programming Henning.

Henning: Yeah, I think it's I think it's very difficult obviously, because it's moving very quickly. There's developments everywhere. But. So, as I've been struggling with this myself and trying to put it into some sort of frame and, where can it be useful?

Where should it be useful? I think that's really a question that that isn't asked enough. Where should it be useful? Where should it be applied and how should it be applied? And so the way I've been thinking about it is just from first principles is that. And I think Sunil basically said the same thing, which is that you have, what is the source?

Like where does it come from? In, in which direction? That's the way I'm thinking about it. Does it generate? And so I think if it generates from natural language into analysis. So it's trying to go from something where someone says, let's say, this is my clinical trial.

These are the kind of patients that I have in there. This is the kind of endpoint I want. Now generate me an analysis. I find that to be very unlikely to be successful very soon. Because you're going from let's say, lower information density and you're trying to get something extremely specific out of it, that might be very complicated.

But if you go on about it the other way, like what, where can we go to an actual source of truth and then from there, generate something which is maybe a bit less dense. And this is probably apparent by now, but we're focusing on code. And the reason why, you know Tomás I gave you.

It gave you a lot of my opinion on that at the very beginning is that, the ultimate source of truth of what a clinical trial is or what happens in it is actually the analysis, right? So it's so a clinical trial really is the analysis plus the data and the analysis sets everything into context with the metadata and with the data and then the, ultimately the conclusions.

And what I think is that if you start with code and you train and fine tune on code and then you try to . To support the processes, which now just take a lot of effort. But a lot of them are repetitive. But you go from this higher information density to something that's a bit less.

Then I think there they can be very efficient always with the idea that they make the processes more efficient and don't take over the process of creating the science itself. Because I think for science to work, and I think a lot of the a lot of the discovery and creativity comes out of friction with the problem itself.

So if people are too far removed from the friction I think a lot of it's going, a lot of things are going to be lost, I'll be I'll try to get very concrete now, after the, after this philosophy. So if we start with code, what are the things that we can do in clinical trials that I think would be very useful?

So one thing is there's going to be boilerplate code that everyone has to write. That's just a pain. And we do this in software engineering. And actually like the big models are now. Actually quite good at just writing the really boring stuff. But as soon as you get to more complicated things they, they get, they go off very quickly.

But, so I think the boilerplate code stuff, I think we can say that will be useful, right? So if you have an AI agent that, that is trained in SAS it's specifically trained for clinical trials and how things are done, they can probably write some stuff that's very useful automatically and save you time.

The other thing is there's a big transition now, or at least an attempt for a transition from SAS to R. And going from one language to another language, I think that's also realistic. So if there are certain things or modules or like microsystems or things that microsystems produce that would be useful to translate from SAS to, R, I think that's also realistic.

And the other thing is I think it could greatly help in communication. So if you want to, for example, help the medical writers to write first drafts of certain submission documents because a lot of that information is either in data. Or in code or in the combination of data and code. I think if you start with the correct bottom line, which is code and data, and then generate from there, I think that's actually quite realistic, at least in the first draft sort of thing.

A sense if you start in documentation and then try to generate more documentation, I'll be very careful with that because it's like the information density is very low again, to the way you start. And the other thing is what I really like is the summary of code.

Which is also what LLMs are actually quite good at right now. So if you have code already . So take taking my favorite use case, I wrote code three months ago and I'm looking at it again. Then to have something that basically says, look, fundamentally this step does X, Y, Z. And it's in this pipeline. And this pipeline overall does this right?

And these three steps do X, Y, Z and they do a filter and the cord is filtered based on these three summary these three variables. So like these kinds of summaries that just. Make it easy to quickly orient yourself, which is again, just an efficiency gain in terms of orientation, onboarding, knowing where you are, what you need to do.

I think there it can get very useful very quickly. And as to us we're building this we're starting to build this right now. Exactly these three use cases because I think they're the ones that are. Most realistic and useful enough at a high and high fidelity enough to be deployed.

But of course no one's tried it. So we'll have to see how good it is, and if the quality is high enough, or if all statistical programmers say, this is causing me more trouble, then it's actually helping. But everything is moving extremely fast. So we're going to be continuously monitoring it, continuously trying to turn this into something very useful.

Make the people who already have a lot to do much more efficient and get rid of the boring stuff that occupies their brain so they can use it for things that are actually challenging and actually really important.

Tomás: So that's a lead up. That's a good lead up to the last question that I wanted to ask and this is for you. He tell our listeners a little bit about what's next for version, what's next? What are we building next? What's in the roadmap for the next 12 months?

Henning: So for the next 12 months, I think it's really a focus on the statistical programming hub which I already mentioned. So we have the analysis engine now but we're going to expand that quite strongly, right? So right now we're looking at log messages. We're going to start adding all of these things like a robustness score, scanning for just code that opens up, opens the analysis up to problems if there's missing data or if there's certain construct that it's not expecting.

These sorts of things. We want to. Then also move on to the analysis engine, which is this idea of building studies. So the analysis engine is actually a combination. It's both building studies, but also modifying existing studies. As I said, this added in confidence. So if you do make a change, what are the downstream effects?

If you do need to make a change or write something new, you know what's the best place to start your analysis from. And if you read in a certain data set, you really know where that data set comes from. Everything that it depends on. Um.. The other thing that that we can do as a, as an, as a combination with a validation engine and the analysis engine is we can look at and process a lot of old studies that have already been, submitted or run or whatever.

Because if we look at these logs, we can extract the code, right? We can extract a mock resolve code. We already talked about that. And so what we can do is we can basically build up a library. Of code that has been written before. And we can organize that library, let's say by indication or by phase or something like that.

So we can create a library of code for these different pieces that we can put together and maybe thereby make it much more efficient to build new study studies. And this actually ties into the AI part discussion that we just had. Which is that the core of fine tuning these AI systems now for statistical programming of course requires that you fine tune them with code.

And so the tooling that we're building incidentally is handling code, right? So we're extracting, handling, rearranging, doing whatever we want really with code. And so we can use that whole framework that we have, all the tooling that we have at Ian to create data sets, to train AI agents. So one of the ideas that, that we're really

Wanting to get into it now and we're looking for companies that wanna do this with us. If you're a company that has a lot of studies, let's say in a certain indication in the past, 20, 30 years, then to go through all of these studies in this particular indication and create sort of an AI agent for that company for that indication, that is fine-tuned on exactly how these kinds of studies.

Were analyzed and then help again with this efficiency idea that, it can maybe suggest certain ways of doing an analysis or even generate very specific code. So that's I think, a really exciting avenue that we want to explore. And that's all part of the analysis and validation engine.

And these two, once they both exist, we can couple them, right? So that the validation engine doesn't become something that you run at the end of having written a bunch of programs and then you run at the end and you get a score and you get an overview, but that it's now, it's in real time.

So it's attached to the analysis engine. And as you write code, as get the AI engine to generate something, or as you reuse code from before, it basically gives you live feedback. You write a piece of code and says, look, if you write it at that point you're not considering the case introduced.

20 steps above for X, Y, Z, that sort of idea. So you, these things really become, that's why they're together in the statistical programming hub, because the programmers then, as they build or as they look at already build studies they feed off of each other and everything becomes real time.

So the idea, picking up where what Sunil was saying is the idea then would be that. It's not that you have, that you're putting a script into your production environment and then you run this as a monitor, but it's always running. And so you get the feedback not only in production, but also as you're developing like every single little piece of code.

Which of course just again, makes the cycle times much shorter. So off of that then and this is basically depends on, on, on feedback from the industry and what people are interested in, what are the greatest pain points we really have a couple of other avenues.

One is the medical writing hub that we're, that we've been looking at, which is basically what we also just mentioned is to create first drafts of some of these submission documents based on the code analysis that we can do. And I think we'll stay high level for these. So the submission hub, which is this part that we are already also mentioned.

Where we really want to have regulatory authorities to give them the ability to explore studies that were created through the Verisian platform. So if you get a validation from us using the validation engine that says, look your code quality's extremely high, you're handing all kinds of crazy cases.

It's extremely robust. It's very it's very transparent. We can't find any problems. We can't find, all of the . All of the data standards are are up to snuff. Then we can package that and turn that in maybe as part of a submission so that it's just yet another stamp of approval that says, look, this is really well done.

Yeah so that's one of these ideas. And then also give them access to explore. Then of course, this analysis which they're free to do or would be free to do. And underneath all of it what we're imagining is what Sunil also already indicated. What would be useful is really to have a dashboarding system that covers the whole process.

So we want to have a real time overview of everything we're calling the monitoring hub. So we wanna have a live overview of. Of data quality as it's being collected and as it enters the realm of statistical programmers, the code quality, but also the status of the study itself. For a certain data set that you're interested in or for a certain statistical, or TLG given the data that's been collected so far, what does that table look like?

What does that statistic look like? So that you have a constant idea of: is your code actually running? Is it processing the data the way it's supposed to? As pa new patients come in, do the right buckets fill up or is there a problem with the analysis that filters out a certain population, which you really want to have?

And I think that just needs to be real time. We can do it in all kinds of other industries that as data is created, we know exactly where it's going. And that just needs to happen here as well. So that will basically cover the entire process in anything we're doing. And I think those are, those will be the next steps.

So the statistical programming hub to build and to validate analysis. The medical writing hub the submission hub and the real time monitoring to have awareness over over the entire study as it progresses.

Tomás: Thank you Henning for running our listeners through that. There's a lot of work that we have to do for the next 12 months. That's certainly the case. And we're looking forward to that. Sunil, do you have final remarks, whether it's, albeit for what Henning just mentioned, or anything else you'd like to impart on our listeners before we conclude our first episode?

Sunil: No, other than it seems to be really exciting. From what I'm seeing, the tip of iceberg, I'm really, excited about the potential possibilities that they're being available to SAS programmers and the things that we can do to better provide support, to the team using these types of tools in advances.

So I'm really excited to be part of this effort. Thank you.

Tomás: And I think we're gonna see you at the, you're gonna be at the PHUSE US Connect, right?

Sunil: Yes, exactly. I'm giving a presentation over there, so I'm looking forward to seeing both of you over there as well.

Tomás: what's the name of the presentation? Just to give a little bit of advertisement for your talk.

Sunil: Sure. It's in the professional development section and as was mentioned, there is definitely a big transition going on from SAS to R. And what I've done is I've put together a platform r-guru.com. And in there I feature a lot of things as far as making it easier for SAS programs to better understand R in addition, I have a whole page on Pharma verse which is dedicated to our packages supporting the pharmaceutical industry. I think that's the big trend that's going on and look forward to giving that presentation.

Tomás: Yep. And we'll add the website to the podcast notes so that everyone, anyone who's interested should definitely go to the website, reach out to you if they're interested, and certainly attend your talk. At PHUSE US Connect.

Sunil: Thank you.‍

Tomás: Okay. And first of off, of course we're gonna be there and we've got some exciting stuff.

We're presenting our work there as well. And if anyone's interested to connect with me with Henning, with Sunil, don't wait until PHUSE US Connect, because that's still, gosh, that's still three months away actually. and we've got a, we've got two, we've got actually a webinar coming up very soon, and there's a lot more coming. I'm really passionate about building our community and not just led by us, we want it to be a community effort. My background is in open source. I come from open source and I think there's a lot of great things we can do by having open conversations and the move to R is particularly relevant here because we've also seen a lot of interesting push to R and it's difficult because up until today, probably almost 98% of every single study has been done in SAS. Henning, is there a final remark you wanna make before we conclude the episode?

Henning: I think it's really about the open source and the R so that would be something great to get in terms of feedback as as Sunil is also looking into R. You know everything that, that we've built so far and that we're building we can do also for R and we've had these discussions these discussions going back and forth and actually one of the first things Sun asked was exactly that.

Can we do this for r as well? So of course we can, but we have to focus our development efforts somewhere. What I forgot to mention within like the next 12 months in terms of roadmap is that we are currently building everything for SAS, but if the community has a stronger pull towards R then that is also something that, that we can explore and get started on.

I would be really looking forward to, to feedback in that regard to get a get a feeling for where everything's heading and where people's heads are at. And otherwise, yeah, I hope to see many people at the webinar. I hope also to, to meet some of the listeners in Washington in February.

Or before then, who knows? How our paths will cross. But yeah, otherwise, thank you very much. Thanks, Sunil. It was an honor to to be on here with you and yeah,

Sunil: Thank you. I appreciate being here. Thank you for inviting

Tomás: thank you all, and thank you to all of our listeners to listening to our, the first podcast of the Verisian Community Podcast. And I hope you'll, I hope you'll find it interesting. If you didn't let us know. That is actually great feedback and I hope to see you at the next podcast or webinar or wherever that is. Henning, Sunil.

Have a wonderful day and evening, and till the next time.

Sunil: Thank you.

Tomás: Bye-Bye. ‍

Henning: Thanks.

Inaugural Podcast: Validation in Clinical Programming

Episode Summary

About The Verisian Community Podcast

Resources

Podcast Transcript

Explore Further

Traceability and AI for Better Understanding, Communication, and QC of Trials

The Limitations and Opportunities of Large Language Models

Introducing the Verisian Community