3 steps to make your research more reproducible

Reproducible research is a key part of research in the pharma industry.

It allows for transparency, understanding, and accuracy in the research process. But how can you make your research more reproducible?

Today, I talk with Heidi Seibold who has dedicated her career to helping researchers become more reproducible.

Let’s take a look at 3 steps that she recommended for making your research more reproducible:

Document Everything
Create Reusable Code
Share Results with Others

Making sure that your research is reproducible is important in the pharma industry because it allows others to understand what you have done and how you arrived at those conclusions. By following these three steps—documenting everything, creating reusable code, and sharing results—you will be able to make sure that your research is reproducible for anyone who needs access to it now or in the future! Resource: Transparimed Share this episode with your friends and colleagues who can learn from this!

Dr. Heidi Seibold

She helps research teams with open innovation and reproducible data science by providing interactive workshops and consulting. As a well known expert on open science and research software I give keynote talks and serve as an ambassador for good scientific practice in the digital era. She used to work as a biostatistics and machine learning researcher at the University of Zurich, LMU Munich, University of Bielefeld and Helmholtz AI.

Transcript

3 steps to make your research more reproducible

[00:00:00] Alexander: Welcome to another episode of The Effective Statistician. Today, I’m really excited to have Heidi here with me and the topic we will talk about. Hi, how are you doing?

[00:00:14] Heidi: Hi. Thanks for having me. I’m great.

[00:00:17] Alexander: Greetings to Munich.

[00:00:20] Heidi: Thank you.

[00:00:21] Alexander: What a wonderful city. .

[00:00:24] Heidi: That’s true. I like living here.

[00:00:28] Alexander: For those who don’t know you, maybe you can introduce you a little bit and what brought you into creating your own business now.

[00:00:40] Heidi: Yeah. Thank you. My name is Heidi Selbold. I’m a trained data scientist and statistician, and I decided this year to become self-employed. I’m now a trainer and consultant for open and Reproducible Data Science. And I guess what that means, we’ll talk a little bit more about later on.

Why am I self-employed and at a university or research institution or a company? I was a researcher since 2014. I did my PhD at the University of Zurich in computational biostatistics, and then went on to do several postdoc positions at different research institutions here in Munich. So lmmune Health center. Also at the University of PK and they were all great. But I felt more and more that I’m not like a researcher myself, or I don’t wanna be a researcher myself, but I wanna help people do better research. So, I’m really excited about improving the way we do research and that’s what I’m committing my time to now.

[00:01:45] Alexander: Yeah. And as we record this, it is at the end of 2022 the podcast goes live and 2023. You’re building your company quite nicely. And just before we started, hit the record button we were talking about your retreats that you have organized and that is fully booked out. Congratulations about that.

[00:02:12] Heidi: I’m super excited about it. Yeah, it’s called open science retreat.

[00:02:15] Alexander: Yeah. And it’s happening in April and I’m pretty sure it will not be the last time that this is happening.

[00:02:24] Heidi: Pretty sure. So if this will be a success, which I really hope it will be we’re putting all our efforts into it that it’s gonna be a success. And if it is, then we’ll for sure organize another one or many more actually, because I really like organizing events and yeah.

[00:02:42] Alexander: And given that is fully booked, I think that speaks for its soulful already. That also slots have been filled and especially also slots that are paid, but also those where you give out free slots or supported slots.

[00:02:57] Heidi: Yes, exactly. We have sponsors and we were able to give out. It’s like we’re gonna be six or seven stipends, travel stipends. So we, it’s not fixed at the moment as we speak, but it looks like it’s gonna be around six stipends that we can give out to people who cannot afford joining with their own finances or the finances of their institution. But we were able to find sponsors that help us with that which is really cool.

[00:03:26] Alexander: Yeah. So, let’s dive a little bit into the background. We talked about reproducible research now a couple of times. What is that and what’s the underlying problem there?

[00:03:39] Heidi: Yeah, that’s a good question. So what is reproducible research? I, for now, for the sake of this podcast, I will focus on the part that’s happening on the computer. What we call also computational reproducibility. And we speak of reproducibility in that context. If with the same data and the same analysis, we get the same results, and that should be like the minimum standard, right?

If you do the same thing with the same data, then you should get the same results although that turns out is already pretty hard. There’s several reasons for that and that’s a huge issue, especially when we talk about research. But in statistics generally, or in data science, this is a huge issue that we wanna avoid non reproducibility. And why is this a such a big thing? been several studies now that show that a lot of research out there is not reproducible for one re reason or another. So that may be because the data or code aren’t available. That may also be because the software packages have changed in the meantime. That may be because the results were generated not with code, but with clicking programs or something that where it’s really hard to reproduce results in the first place. So there’s all kinds of different reasons why results are not reproducible.

[00:05:02] Alexander: What has set to do with that the results that you see in the paper are traceable back to the data. So if you see the results and you have, that you understand where they are coming from and how they got into the paper.

[00:05:20] Heidi: Yeah. That’s a good question. The traceability is super important because if I cannot check how you got your results. So if I don’t have access to the code, for example, then it’s really hard for me to understand whether it is reproducible, your results are reproducible or not. These question of availability of data in code is super important and very much linked to the question of reproducibility and possibility to check whether something is reproducible or not.

[00:05:57] Alexander: That’s cool. So what would be the ideal state for reproducible research?

[00:06:05] Heidi: Yeah. If everything was reproducible, that would be great. And the gold standard that we speak of today is when we have something that’s called research compendium. Essentially, you can think of it as a little box and everything you need to reproduce the results is within that box, and there’s just a button that you need to push. That runs all the analysis again, and you get the results in a way that you can actually understand what’s happening as well.

[00:06:37] Alexander: So not a black box thing? Yeah?

[00:06:39] Heidi: Exactly. Not a black box thing, but something where you can actually look inside, check out what did the person who did the analysis the first time. What did the person do in these, which decisions did they make? And what does the data look like? How was the data cleaned? How was it analyzed? What decisions were made regarding the modeling inclusion of parameters or whatever. Yeah.

[00:07:04] Alexander: Okay. Very good. So you can see step by step backwards and I guess, You want to do it in a way that let’s say the average researcher can actually do that.

[00:07:14] Heidi: Yes.

[00:07:14] Alexander: You’re not aiming that, any lay person could actually look into it.

[00:07:18] Heidi: Yeah. So I think that’s where we’re at right now, because if we would go a step further and say everybody needs to be able to check it, then we would need to have a common language, and not every person will be able to, for example, understand our code. And then we would need even.

A higher level where we would do that. I could imagine something like like the step towards that is using literate , for example. So our markdown or CU or these kinds of tools. I think that’s already a little bit a step in that direction where we combine text that explains what we’re doing with the actual code.

[00:07:58] Alexander: Yeah. So that is maybe already one big step towards it, having everything documented and commente. So, you actually put together six helpful steps towards reproducible research and will not go into all six today. We’ll put them into the show notes so you can find out. But we’ll go into the first three, which are probably helpful for everybody. And honestly I say I probably have messed them up all in the beginning and even if you don’t work on data, you can probably get a couple of these things right on your personal files. The other point is even in big companies, I’ve seen that theoretically maybe they’re in place, but when you look into the details, , everything is so cryptic, that it’s nearly impossible to really find someone unless you belong to this core team of 3, 4, 5 people that actually work on this day-to-day. And that’s probably not the idea. Yeah. So the idea is that any kind of reasonably well-trained researcher should be able to follow this. Okay, so let’s get into step number one. What is that?

[00:09:21] Heidi: Yeah, so you mentioned that you did those wrong and I really learned this first step the hard way. I have such a mess on my computer. So first step one is get your files and folders in order. And that’s really something that I recommend to anyone who works with computers, but especially to people who wanna work on complex projects, potentially in teams. And it’s such a simple recommendation at first sight, but it does need a lot of discussions and thinking, and I think it’ll never be perfect, but I’m thinking about it and implementing good folder structures, for example, is so helpful in both getting your life together. And reproducible research as a side product, I would say.

[00:10:12] Alexander: Yeah. So, that is actually one of the interesting things. Some people recommend using text rather than folders. What’s your kind of experience with that? Do you have any experience with that?

[00:10:25] Heidi: Text?

[00:10:27] Alexander: Yeah. The problem with folders is that it’s always hierarchical. Whereas tax or something that you can, it’s much easier to structure. Yeah. So imagine you have, let’s say the same document. Across different studies. Now, if you can, of course, have let’s say a compound folder and under says the study folder and underset all different, other folders. And then you have the folder where these documents are now, if you wanna find all the different documents across all your studies, it becomes really hard.

[00:11:02] Heidi: Yeah.

[00:11:02] Alexander: Because you have decided that the documents that at the lowest end. Of course, you could have also that, okay, here are all the SAPs and send within the SAP folder you have that and then that, and then, it’s you have a problem. Yeah. So if you wanna have all documents related to one study together, then. So it’s always some kind of problem if you decide on a hierarchy, what is the hierarchy?

[00:11:32] Heidi: Yeah, very good question. So I always teach folder structures because people understand what it means. Okay. They already work with it and they already know if they have a mess or not with folder structures.

But I actually honestly never thought about tags before. So this, we can have, I’m happy to discuss more in this because I think that’s a really cool idea as well. Especially since you can, as you mentioned use different hierarchies in different ways. But so what I usually teach and I teach mostly PhD students and people who work in research is to have one folder or repository per research project. And then of course you might have the same data set in several research projects or whatever other documents as well. But in that, Case we start with the simple version and you might have duplicates in that simple version, right? And so my recommendation is to have one folder for each research project, and then each research project is organized in a similar fashion. So you have maybe one folder that’s called paper, one folder that’s called analysis. One folder, that’s called data. And then in data you might have raw data and clean data because you, always wanna have a raw data folder to let that be the raw data forever and touch that. So these are like the general recommendations that I have for reproducible research in the classical research projects.

[00:13:03] Alexander: Yeah. So that’s good. And what I also like is that if these folders have meaningful names, and that gets us to step number two. So things like test is probably not a good name, I would say.

[00:13:21] Heidi: Yeah. So there’s a couple of things that we can do well and we can do terribly when thinking about names and with , I usually talk about names for folders, names for files, and then also names for things in a script, right? Like variables and functions and and the rules are usually the same or at least similar. The first recommendation is to use names that go well with your computer. So don’t use spaces in file names, because that’s always , a mess. Don’t le use characters. They’re strange. So please know Chinese characters, for example to the Germans, no.

These are always hard for computers to understand, especially if you, for example, send it to a different operating system. Yeah. That will always mess things up. And. . Yeah, just computers are not good with these special characters or spaces in file names. On the other hand, it should be also be useful for humans, not only for computers. So the file name should have some information on what the file contains. Often it makes sense. For example, for my slide decks, I usually start with the year and the month. And then I have the content and where I gave the talk. And that really plays well with the fault ordering of the computer. And I can easily see how many talks did I give in 2022, for example. So this is like a little bit of an extra thing that I often do depending on what kind of file it is, right?

[00:14:55] Alexander: Yeah, I completely agree. And honestly, I think FMS can rather be a little bit longer than what they currently are very often if I’m just thinking about all the final names that are commonly used in C disk, that very short. And if you are not 100% trained, it’s sometimes really. Difficult to understand what’s in there. And for sure don’t just number your files. I’ve also worked on a study where people numbered output 1, 2, 3, 4, to 8,500 whatsoever. And so that’s also not especially helpful.

[00:15:40] Heidi: Especially not for the humans. The computers might be fine, but the humans not.

[00:15:46] Alexander: Yeah. So have meaningful names. What’s your recommendation in terms of the ending final or something like this?

[00:15:58] Heidi: Yeah, that’s my favorite. So if you have you write a paper for example, and then you call it my paper version one, and then you call it my paper draft, and then you finally think you’re done. And you call it my paper final. And then you go to your supervisor, your collaborators, and they say we have this and that comments can you please still integrate that? And then it’s suddenly final two. And 10 days later it’s my paper final, I’m gonna kill myself. It just doesn’t end . That can be a very frustrating experience and my general recommendation towards that is just, using a version control system rather than versioning your files by file name. So something like it for example is an excellent choice.

[00:16:45] Alexander: Yeah. Do you help people using git?

[00:16:48] Heidi: I do, yeah, I do teach GIT to researchers. So I always say I’m not a GI expert, but I know everything you need to know for research purposes. And that’s usually enough. And the hardest part is always the installation for people. It’s crazy. Yeah.

[00:17:05] Alexander: Yeah. Okay so once we have the file names, in the correct folders, what’s step number three?

[00:17:15] Heidi: Documentation. And that’s again, something that doesn’t sound very sexy, like the first steps as well. And by the way, steps four, five, and six are of course very sexy , but we are not gonna tell you about it now. Yeah, documentation is super important and thinking about how to do it well, how to do it for future you, but also for people that are not human, and there are various ways of documenting. It starts with using code comments or literal programming as we mentioned before, to show what does this piece of code actually do. Also having a readme or something that explains what the project is about. For example that Readme can, for example, also contain the information on naming conventions and the photo structure.

So that new collaborators who come into the project can understand the reasoning behind the things that you ca came up with. Yeah. , it can also read me, should always contain the information on the research question the steps that are need to be taken to fulfill the project. Yeah, and just general information about how to use the folder that you work with.

[00:18:34] Alexander: Actually, one of the other things that I recommend is people that work on this project and what I especially laugh and hyperlinks in there , so to all the relevant documents so that you can access them really fast. So that makes things so much more useful that, so you don’t need to click through all the different folders to get to a couple of the common used documents really fast.

[00:19:00] Heidi: Absolutely. Yeah. And then when we also adding to what else needs to be documented data usually needs some information as well. That’s the information about data is what we call metadata. So thinking about metadata and like how to cite the data. For example, who owns the data, what license does the data have? That’s also a really important part and very useful for both the collaborators and also for outsiders who are interested in the project.

[00:19:33] Alexander: Yeah. For example, if you use any scales, okay, what questionnaire are these? What are the items? Where is it described? Where’s the manual for analyzing these? All these kind of different things. If you have all that in place for your first project. My recommendation is to create some kind of template folder.

[00:19:57] Heidi: Absolutely.

[00:19:57] Alexander: Do you do that as well?

[00:20:00] Heidi: Yeah. I just two weeks ago created a GitHub repository that people can actually download with a template for general research project structures. And it’s, this is super helpful because it’s gonna be similar most of the time, right? Your projects are usually quite similar. You have quite similar folders that you need, quite similar read means, and it’s super, super helpful to just especially if you’re also a bigger research group, right? To have a template and everybody uses that template, and then it, solve so many problems just with that.

[00:20:35] Alexander: Yeah. That makes it so easy to just, everybody that is new in the team gets trained on, okay, we use this structure that’s the naming convention. And then the folders always are called the same. And you know where, the typical documentation is. And whenever there’s change in people, which is not the question. When of if, but when it’s much easier start. Yeah.

[00:21:03] Heidi: Yes, absolutely.

[00:21:04] Alexander: Can we get a link to that GitHub?

[00:21:06] Heidi: Sure, yeah. We can just put it in the show notes.

[00:21:09] Alexander: Yep. Awesome. Very good.

[00:21:10] Heidi: It’s a repository. I’m happy to share.

[00:21:12] Alexander: Yeah. I’m pretty sure lots of pharma companies will hopefully have something like this. But for any research organization or smaller company, I think that will be hugely helpful to have something like this. Yeah. Or at least have a look into it and get some inspiration from.

[00:21:29] Heidi: Yeah, so I’m not claiming that my recommendation is like the gold standard or the perfect thing. It’s something that needs discussion, right? So each research group or research field maybe may have and need different conventions for that because they have different. Types of data, for example, that need special, have special needs, I know that for example, in the neurosciences, they have a recommended, they have like a template for the field, which is really cool. I haven’t found any other field so far, but the neurosciences, they are top-notch. They already thought about it for their entire field, which is really cool.

[00:22:11] Alexander: Really cool. And of course that if you think about it, in the end, you and you put all your research into the box that we talked about at the beginning and you publish it, that makes it super easy for people to follow, what you have done.

[00:22:27] Heidi: Yeah, exactly.

[00:22:27] Alexander: I think that that’s actually maybe one of the other sides of reproducible research that transparency. Where do you seize the landscape in terms of that?

[00:22:39] Heidi: Yeah, I think we’re still at the very beginning. So I’m talking now about like Academia. Because that’s where I know things passed, I’m not so sure about pharma. I think that it’s even entirely different. But in academia, we have In most fields, a culture that’s focused on publishing and writing papers. , and the data and the code that are important for reproducible research are more of like a tool rather than an output. . And that is something that is slowly changing. So we see changes here both on the side of the researchers, the funders, and the institutions. Also the journals as well. But it’s moving slowly. So most researchers, when they publish a paper, they don’t think. The data and the code afterwards, and they don’t publish it either. So we’re at the very beginning of a journey towards more openness in research and yeah I think that’s something that we need if we want to move more towards reproducible research, because otherwise it may be reproducible. So the person who did the research may be able to reproduce the results in the future. But we will never know. So it’s impossible to know. And so openness is so super key when we talk about reproducible research.

[00:24:02] Alexander: Yeah, I know that at least for clinical studies, there’s a push to provide more access to these. There’s a variety of companies where you can get access to the original data. Unfortunately, I think it’s pretty rare that you get some access to the code as well. But at least then with the published description and the CO of the original data, hopefully you should be able to get a sense of where the outcome. Yeah, it’s reproducible. But of course, It would be really lovely with, if you have a paper and in the electronic appendix of the paper, she would have all the other things. Given of course that you have patient privacy and these kind of things taking, taken care of because that’s the side of of transparency is making sure that you can’t identify the individual patient.

[00:24:56] Heidi: Absolutely. Yeah. So with clinical trials, it’s interesting because we would already be happy if all clinical trials would publish the results in the first place. Because there is another issue with clinical trials is that often the results aren’t published because the trial failed in one way or another. And failure might also mean, yeah. that there was no superior effect of the new drug and that’s a huge problem. When we think of, let’s say there’s hundred, 100 studies are being made on the same drug and only five are published because they showed an positive effect and the 95 other studies shot, there was no positive effect or a negative effect. And then we have a huge bias towards. Just these interesting findings or these findings that support our hope for hypothesis . And so there the first step will be that we really ensure that all trial results are actually published. That where the trial was started . This is something that is actually in Germany as far as I know, required by law, but it’s just not enforced so far. And we’re getting better also here, but it’s still it’s still not great. I think there is a website that actually checks for each institution, whether they’re following the rules.

[00:26:18] Alexander: Yeah. With clinicaltrials.gov I think at least from the pharma side, things are pretty clear now. You can’t get run a study without getting into it on clinical trials dot cough.

And if it’s there, then sooner or later the po the results need to appear there. I think the problem with clinicaltrials.gov is that the way it’s required to publish the results there, It is less than optimal, let’s put it that way. Even if you’re an expert, it’s really difficult to follow and find everything. And discouragement of using any visuals is also not helpful from my perspective. But that’s my, these on my 2 cents. I think it is your, the goal was there to get it all out and maybe electronically readable. But looking into it from the human perspective, is maybe not that optimal. I think that’s always probably a lot of illegal considerations in terms of it not being promotional and all these kind of other things. So that’s yet another topic, but…

[00:27:27] Heidi: it’s a huge topic, by the way, and if you wanna talk to someone I recommend Tim Burkna from Trans, he is like the go-to person when it comes to just these evidence and medicine and clinical trials. He’s. Very good at, he knows all about it. . .

[00:27:46] Alexander: Awesome. Very good. So at the end of the episode, we got another recommendation for a future episode.

[00:27:53] Heidi: Exactly.

[00:27:54] Alexander: It says, one other thing that I wanted to talk with you about and set is what’s what you are planning for 2023?

[00:28:04] Heidi: Oh wow. So lots of things. First of all, we talked already about the open science retreat that will happen in April. So that’s an event where we go to a mountain lake and a castle and hang out, reboot, but also talk open science and think about how the future of open science can look like. I’m working with several research projects on making their work more open and reproducible. Teaching phD students in reproducible data science. I work with some research groups more closely in implementing their steps towards open and reproducible data science. I’m probably, I will be probably moderating two conferences. So the first is the He Jose Eye conference. That’s happening summer next year. And the other one I can’t tell yet because I’m not sure if I’m gonna be chosen, but…

[00:29:04] Alexander: Okay, cool.

[00:29:05] Heidi: And of course the two of us will be working together next year, which I’m super excited about. Setting up a training program on reproducible research as well.

[00:29:14] Alexander: Yes. That will be awesome. So the first is on the 25th of April there will be the Effective Statistician Conference, the first conference that I’m running, and I’m super excited to have Heidi be there and talk about reproducible research and how you can make things for yourself effective. And that conference will also mark the launch of this programming. That we are putting together, which I’m also super excited about because, yeah. If I think back in my beginnings as I said, I made all the things in the wrong way and missed out on all these different things, and I remember when I was looking into my programs and saying, when did I do this? Why did I do this? And I don’t know how many times I recreated stuff just because I didn’t, couldn’t find it anymore.

[00:30:11] Heidi: Yeah, absolutely. And I think what’s also really cool about our program that we’re putting together is that we’re not only gonna talk about like the technical aspects of reproducible research, but also the social and the change management aspects. But you are like super expert in, and I’m looking forward to learning more about that myself. And I think that will be super cool for people. Who really wanna not only implement these things for themselves, but also change their entire group or institution or company or whatever. So I think that’s really that’s what we need to think about because open science and reproducible research is partly technical, but a big part of it is social and social change.

[00:30:55] Alexander: Yeah, I completely agree. Because it’s really easy to change yourself. But the benefits really come well, in a way. It’s easy. . Yeah. As we, as we recorded at the end of December and the new year resolutions are just around the corner and we know what whole reliable things are helpful. But that’s another topic. That actually covered in some other podcast episodes. So if you could throw it back to the, my episodes about goal setting, you can learn about that.

Yes. The change part is a big aspect to it. If you want to establish something like that within your department or maybe even within your, university overall that’s a big change and most of these change projects fail because people don’t understand even the basics about change management. So I’m really stoked about this as well. And we are very much looking for great 2023.

[00:32:00] Heidi: Yeah, absolutely. .

[00:32:03] Alexander: Thanks so much Heidi. That was an awesome discussion about reproducible research, what it is, three back steps towards it, transparency and all the different things that are associated about it. I really enjoyed this discussion and I’m really looking forward to our collaboration.

[00:32:25] Heidi: Yehey! See you next year and yeah let’s do this. And people come to the conference and join the training program.

[00:32:33] Alexander: Yes, yes, for sure.