Moving from SAS to R – Practical Tips Part 1

The use of R in the pharmaceutical and healthcare industry has been increasing over the years, with larger companies like Roche adopting it as their primary statistical software. Others like GSK, NovoNordisk, and Merck make heavy investments into it. 

In this episode, Thomas Neitmann discusses various lessons from his career within pharma and the role R played in it. He provides insights into how to start with R and what are key differences to SAS.

We also discuss the new SAS to R course of The Effective Statistician and you’ll learn, if this is the right course for you.

We provide a couple of learnings from the course for you to get an impression of what we cover in the course.

Click here to get to the course overview!

We also discuss the following points:

  1. What are some recommended courses for healthcare businesses looking to transition to R programming?
  2. Why is it essential for beginners to install R and R Studio?
  3. How can find help about R programming?
  4. Which data exploration techniques can be used in R programming?
  5. How can data manipulation packages help in R programming?
  6. What is Quarto software in relation to R programming?
  7. What is Admiral, and how is it used in clinical trial studies?
Interested to learn more? Check out the links and course: Share this link to your friends and colleagues who can benefit from this episode!

Thomas Neitmann

He is an R enthusiast currently working for Swiss pharmaceutical company Roche as a Statistical Programmer Analyst for late-phase clinical trials in neuroscience indications.

His R journey began in 2014 when a coworker told him to run an R script to “analyze some data”. Having never programmed before at that time, he was overwhelmed. But he took on the challenge and soon realized the power and joy of programming.

Since then, he learned a couple of other programming languages including Matlab, Python, and SAS. But his favorite is still by far R.

He enjoys sharing his knowledge and started doing so publicly on LinkedIn in late 2019. Since then he went from around 300 to 7000+ followers. Many of those encouraged him to create a blog to have a central place for all his posts. So that’s what he did.

As the name suggests this blog focuses predominantly on R. He will occasionally cover other, related topics such as git, though. He also enjoys data visualization a lot so you’ll likely find some posts on that, too.

If you have a specific topic you would like him to write about, please feel free to reach out. The best option to do so is via his LinkedIn. If you are not yet connected with him, make sure to send him a request!

Transcript

Moving SAS to R Part 1

[00:00:00] Alexander: Welcome to another episode of The Effective Statistician, and today I have again, Thomas Nietmann here. And if you scroll back a couple of episodes, actually a couple of more episodes, we recorded already podcast together about SAS vs R at the time with Sam Gardner as a big SAS advocate and Thomas as the R advocate.

And so I’m really happy to have Thomas again on the show. How are you doing?

[00:00:33] Thomas: I’m doing very Alex and great to be here.

[00:00:35] Alexander: You have, developed even further in your career since then and moved out of big pharma into a smaller biotech company, but still working on R, isn’t it?

[00:00:50] Thomas: Absolutely. Definitely.

[00:00:51] Alexander: Yes. Yeah, that, that’s so cool. So just, you know, to get started, where do you see R free a t the moment?

[00:01:02] Thomas: Yeah, that’s a, that’s a very good question. To be quite frank, SAS is still very much the dominant language people use these days, but if you look at the growth rate R is really on a very much inclining adoption curve.

And I think it’s really at the point where it’s not only the, you know, very techy people who like to play with new things. We’re getting to a point where it’s sort of we getting into the early majority of adoption. You can really see that a lot of big pharma companies have invested heavily into this space.

Not only Roche where I have previously worked, but for example, Novo Nordis will have a very exciting webinar, I think actually a couple weeks from now where they tell about how they adopted our internally. And I think it’s not a spoiler if I say they actually used it for a submission very recently, which I’m very excited to hear about how that went.

So, yeah. Very much growing. Which also means that if you are someone who is not yet into R that much, I think it’s a great point in time to pick it up because I think it’s a skill that will be very valuable in the future.

[00:02:00] Alexander: Yeah. I think it’s not just valuable in the future. It’s really valuable at the moment already. There’s a couple of things that are much, much easier, much faster with R. I’m just thinking about, for example, simulations. Yeah. There’s the course that I’m doing together with Kim and Jamie from Exploristics and they show how you can do simulations in R and especially if you think about study design and these kind of areas, you’re pretty free in terms of which software you use and you can do things so much faster and easier with R than the SAS there. The other point, of course, is everything around data visualization. Yeah. If I look into the submissions for the wonderful Wednesday webinar series of the Data Visualization Special Interest Group, by the way, if you have never heard about this, then definitely check out the special interest groups on the P S I homepage for that. Most submissions, well, nearly 100% of the submissions are based in r and so that is really the dominant language there as well.

Okay. So more and more companies are joining into R. Where do you see the biggest challenge for the companies to transition from SAS to R?

[00:03:35] Thomas: Yeah. I think really the biggest challenge. Or maybe let me start with what is easy. It is very easy to get new talent in, which is good in R because that’s what people learn these days. Whether it’s R or Python, these kind of open source languages. That’s really what, you know, people of my age sort of grow up with, learn at university and a lot of jobs.

If you look outside of pharma, if someone has a title of data scientist or statistician, that’s what they use. But then of course you have a lot of people who’ve been with the companies for five years, 10 years, 20 years and they are typically very fluent in SAS, but not so much in our and then this becomes really a challenge to have this change management effort to say, we have all these people in-house, which have a lot of expertise, which obviously we want to keep in-house, but move them over to a new tech.

I have in the past given a lot of workshops within Roche, and I kind of know, the pain that is part of that process. ’cause you have to imagine someone is very proficient at something and then we tell them, here’s this other great tool, which once you get it might make you even, or I would say very likely, makes you even more efficient.

But initially what happens is that people struggle a lot because, you know, they know the command in SAS, but they have no idea how to do it in R. So initially their productivity actually slumps goes down, which I think is very frustrating to people. That’s why you really have to make this changement and effort of you know, having a plan in place.

How you get those people from where they are now, through that little belly of you know, where things get hard and move them up the slope where they will be even more proficient than they’re right now. And this has to really be strategically embedded within the organization. This is not something that you just tell people to do in their free time.

This has to be at the core of your business. If you decide that that’s the way you want to go through.

[00:05:22] Alexander: Completely agree. And of course, upskilling training people is at the covid. And I’m not just talking about you know, the statistical programmers. It’s the same for the statisticians that do a lot of programming. You need to be fluent in R and get more in it. There’s so many things that you can do really, really fast in r that takes you forever in SAS. And so and of course you can, you know, grab a lot of open source code from all the different places to kind of adapt and learn from. Now, there are lots of R training courses, lots of our books whatsoever.

My biggest challenge was that when I had a dip into this, it is too kind of generic. You know, it’s kind of, for anyone who wants to program in our, or for anyone who wants to work with data in our, but of course our data is different, our environment is different, our way we do analysis is very specific.

Yeah. All these kind of different things. And so I always struggled quite a lot with these generic courses. What’s your point on that?

[00:06:49] Thomas: No, I think so obviously it’s not my personal experience, but the people I’ve worked with who made that transition, they very much echoed what you just said.

It’s can often be frustrating if you want to learn this to apply to your job, but the course you take, or the book you read is something very generic or something even related to a totally different industry. Yes, you can pick up some things, but at the end of the day, what you want to know is I don’t know how to fit a particular survival model for my study here, or how do I create this data transformation, which I need to use in my analysis dataset transformation there.

So Oftentimes this is what leads people to actually quit rather quickly because they feel that whatever they learn is not really tailored towards what they need. And if it’s not something that they can immediately apply to their job, why kind of make the effort? So I think it’s of paramount importance to have a course that actually is designed specifically for what you want to do at the end of the day.

So if you’re someone who works in clinical trials within the pharmaceutical environment, you need to have something that is tailored specifically towards that.

[00:07:49] Alexander: Yes, completely agree. So that is why I’m super happy is that Thomas and myself, we offer you a specific course such as tailored to your needs in pharma, and C R O, in universities working with clinical trial data and these other kind of typical data sets that we have in the healthcare business. And Thomas picked on his quite awesome experience of teaching the art to colleagues of yours and created a very, very nice course syllabus. And in the episode today and next ones, we want to go over that and give you a little bit of a peek inside water there and also, show you a couple of quick tips you can use directly to improve your ask programming skills.

So the course actually starts with an overall introduction into our what is, you know, how do I actually open more?

[00:09:04] Thomas: Very good point. So the first thing you would need to do is actually install it on your machine. So, There is the r-project.org website where you can download R and once you’ve done that, it comes with a very bare bones graphical user interface.

Or if you know how to use the terminal, you can actually just type in R and then you know you are in the terminal and can write commence there. That is not what I would recommend because that is quite tedious. What I always would recommend is that you install an additional software called R Studio, which is a, what is called an integrated development environment, specifically for the R language.

So in addition to a text editor with syntax highlighting, you have your R console, you have a terminal. You see what is in your global environment in terms of variables. You have a window where you will see any kind of plots or data visualizations to create. So overall, it just makes the experience so much more enjoyable because anything you need is really there.

And once you kind of know what the four different pains in that I D E R, you will get accustomed to it very quickly and you never want to go back. That’s what I can tell you.

[00:10:08] Alexander: To this interface. How is that different to the SAS interface?

[00:10:13] Thomas: Yeah, so SAS actually has so many different interfaces. There is PC SAS, there’s Enterprise Guide, but I think what is most similar to is actually SAS Studio, which is a web-based interface. And it also looks, I would say, very much like any kind of modern IDE where you have these kind of features I highlighted. So I text that over with syntax highlighting placed to look at your data sets your logs and whatnot. So, Yeah, that is probably the most similar. But given that the languages are so different, there’s certainly a lot of differences too.

[00:10:44] Alexander: I D E, by the way, stands for?

[00:10:47] Thomas: So, Integrated Development Environment. There’s a very fancy way to say that this is a great program for you to make you more efficient in writing code.

[00:10:55] Alexander: Yeah. And it’s developed actually by a company that formally called R studio. They recently changed their name to something that is just not on my mind.

[00:11:06] Thomas: It’s, pause it.

[00:11:07] Alexander: Pause it. Yeah. Because they’re not just promoting I anymore, but also Python and other things as well. And they have, by the way, lots of, lots of training as well on their homepage. And, you know, getting into R studio is, is very, very easy.

Now, one of the key areas where I always looked into when I went into SAS, is the SAS documentation. Yeah. Looking for, okay. How does that look like? How do, how is the documentation for R looking like?

[00:11:40] Thomas: Yeah, so it is different, right? Because SAS is by SAS Institute, the company, and all the documentation is basically in one place. You have this one doc website where everything is collated, which to be honest is quite nice.

R being open source in nature. When you install r it comes sort of with a set of packages, which is just a collection of functionalities. And those, as I said, ship with the language. If you are an R itself, you can always type in the question mark followed by the name of a function, and you will get to the help page, or you can use the help commence and say Help package equals package name, and it will list you all the functions that are available.

But really the strength of r is these extension packages, which are not necessarily written by the people who write R itself, but by someone like me, for example pharmaceutical companies, some graduate student for a thesis. It’s really all over the place. And If they are picked up as an R package, once you install it again, you can access the help page with this question mark operator.

But these days most of the r packages actually have a very nice website. There’s a standard way how to set it up and it’s actually much more visually appealing to look at and easier to navigate. So for example, lyer is a very popular package for data manipulation. If you type that in into Google, I’m pretty sure the first or second result is the actual package website.

Where you can see all the examples long form documentation and what is called vignettes and so on and so forth. So sometimes it can be a bit tricky if you are looking for, you know, how to do operation X, Y, Z, and you’re not actually yet sure which package is it that you want to use. But once you’ve made that initial step, and there’s lots of great blog posts, for example, which tell you what is a good package to use and once you’ve made it to their website, these are typically very well written.

[00:13:25] Alexander: Awesome. Yeah. Very, very good. And in the course we’ll show you in detail how that all works and how you can also install, load additional packages. Absolutely. What are the pitfalls there? And kind of. So trustworthy environments and the maybe less trustworthy environment to pick things up because that is of course one of the differences compared to us where everything comes from one source.

Now if you do your first analysis what are your recommendations there? In terms of exploratory data analysis, what do you usually do there as first steps?

[00:14:09] Thomas: Oh yeah, great point. So if it’s a completely new data set, I would start with something very simple. How many rows are there, how many variables do I have? What are the identifiers here? Do some initial kind of descriptive statistics? If it’s a numerical column or if it’s just a factor variable or something qualitative. I look at, you know, what is the frequency distribution of the different levels of the variable. And certainly you can do that all with kind of functions that outputs tables or whatnot.

But if I’m a visual person, so I like to do plots so maybe a bar chart of the frequency distribution, some histograms or density estimates. Yeah. And then if, if there’s something I feel like, hey, maybe there’s a correlation, you can start doing some scatter plots or whatnot. And I think in general, R makes it very easy to do these kind of analysis and if you then want to save that and maybe even share it, potentially you could use something like, or a markdown. Where you kind of write a little bit of text and then you have a little bit of code, see the output, and then go on from there. So in a way, you document your kind of thought process as well, not just the output.

So you don’t end up with maybe 10 or 15 plots in p and g documents, but you have this one consolidated report where you kind of say, Hey, I started off with this. Oh look, we have 200 rows and 15 columns, and go from there and explore the data.

[00:15:26] Alexander: This is actually one of the things that I really love about r these integrations with reports. I think that is really something of a be best practice. Yeah. That you don’t like, in SAS. You know, create one table after the other and one table loved as the other, but you directly create some kind of report. Especially, for example, if you work on an exploratory data analysis. Yeah. And you go through all these different steps and you document all these different steps or you simulate something and you directly document what you have simulated and what you have changed, what you have updated.

So you have kind of a track record for all the different things that you stepped through. And it helps you to document and to share it really, really easily. So how do I actually get into this and what kind of recommendation would you give to set up something like this process that you just described?

[00:16:34] Thomas: Sorry, you just broke up for me so I didn’t get the middle part of the question.

[00:16:38] Alexander: If you want to set up this kind of process with the writing and analyzing and updating your analysis and kind of getting this kind of dossier or report of what you all did. How do you actually set that up?

[00:16:57] Thomas: Yes. So if you are using this R studio IDE, which I talked about earlier in our conversation, then there is a little menu item where you can create what is called a order document, which allows you to inter tangle written text, which would just get displayed as if you were to write a Word document at the end of the day with code that you can say, I want to actually show the code or not.

And then the, the output. Once you have that template document open and start writing the document, you can always click the render button. And then this kind of plain text document that you created gets rendered either into an H T M L document or P D F depending on what you specify. And then you can kind of look at what will the final output would be.

And if you say, oh, maybe I want to add something more, you go back to your source document. Add what you need to add and then re-render the document. So in a way that this becomes a cycle, right? So you have your plain text document where you add something, you render it, you look at the output, and then maybe you go back to adjust. So in a, in a way it’s pretty easy because once you’ve made that initial commitment that, hey, I want to use, for example, a quarter document you really, there’s no reason why you should deviate from that. It’s sort of very natural to say, then, Hey, I add some code. I look how it looks, I go back and adjust, and so on and so forth.

So you’re really in, in that loop and there’s no way why you would want to break out of it. Until the point where you either have a complete report that you would like to share with someone and you can just send that over to people. Or maybe you say, Hey, actually this is only for my internal consumption.

And now I want to create maybe a different kind of report where I don’t specify on these exploratory findings. But actually I tell a story about what, what I discovered here.

[00:18:34] Alexander: you mentioned Quarto. What is quarto?

[00:18:38] Thomas: So we do have a excellent episode on the Effective Data Scientist Podcast about that. So I would encourage people to take a look at that. In essence, it allows you to create these kind of documents I just spoke about. So you have a plain text document where you can intermingle, usually just plain text. As you would have in a Word document with code and the output, and Q is sort of the name of the software, that then renders this document, it executes the R code and then captures the output and puts everything in the final report, whether it’s H T M L or P D F.

There’s actually many more output formats, which I don’t have on my mind right now. But it’s a great software tool and it kind of came out of the R community. There was a predecessor called our Markdown And Quarto is even more general because it allows you to use different languages, not just r but also Python, for example. And I think they even have a SAS engine. So if someone is eager to try. You can go ahead.

[00:19:33] Alexander: Have fun with that as well. Exactly. Awesome. Yeah, and thanks for mentioning the Effective Data Scientists. This is a sister podcast that was created some time ago, and Paolo, Thomas and myself hosted together and, and push it forward.

And it’s, yeah, mostly data science part of it and much more kind of programming heavy. So if you want to look into this, there’s lots of episodes there about this as well. There’s also some episodes that are more kind of basic statistics because your audience are more data scientists and there’s a lot of data scientists that never had any, you know, formal statistical education.

The next module in the SAS to R called course that we are talking about is about data manipulation. Now, I always found it pretty tedious to manipulate data with sars. I don’t know why, but you know, It’s always kind of a pain. Yeah. So how do you compare SAS to R in terms of, you know, aggregating dataset, merging transposing, and all these kind of different things compared to SAS?

[00:20:53] Thomas: I would say if there’s one strength to SaaS, I would actually argue it’s this part of data manipulation. But if you compare it to the way that R does it, it is very, very different. The kind of mental model you need to have to understand what is going on, it’s just completely different. So in SAS you have these data steps and in a way it’s an implicit fall loop over every row of the dataset.

And then you can have special tricks where you sometimes can push out multiple roles in kind of one iteration. So your dataset actually becomes larger than the input dataset, but we’re, we’re not gonna talk about that. So let’s, let’s focus on the R side of things. Where there are certainly different approaches to how you could do data manipulation. One very, very popular packages called Deli which comes from what is called the Tidyverse. We talked about positive, the company. This is something they are heavily investing on. And in a way it allows you to use what they call simple verbs. And then to chain these verbs in sequence to do a data transformation.

So imagine you have a source dataset, then you what is called piping. So you take that and you pipe it into the next function, which could be a select function where you say, Hey, I only wanna keep the first three columns. And then you maybe pipe the result of that into the next function where you say, you know what?

I want to actually only filter. The subjects which come from Site X, and then maybe you say, now it’s time to do some aggregation. So let me first group by you know, maybe a biomarker of interest. And then you say, and now let’s summarize and kind of squash down each group to a single role.

So in a way, it’s this sort of step-by-step approach where you take the source dataset, you pipe it into the next function, and then in the next function and the next function. And in a way, you, you could write your whole script in that manner, which I probably would not recommend, but it’s, it’s a nice model.

I think that you just take the result of the previous computation and push it towards to the next function. So in a way, every function itself is very simple because it only really does one thing. Select some columns, it filters some rows, it does some aggregation. But if you have all these Lego bricks together, now you can build whatever you want in a way. And that can be, you know, something very simple or it can be something very, very complex depending on how much effort you put in.

[00:23:02] Alexander: There’s a couple of interesting packages and there’s one package that I want to specifically mention here, and that is Admiral. Can you talk a little bit about this one?

[00:23:16] Thomas: Absolutely. So this is something very dear to my heart because I was involved in creating that. And the idea was that, you know, there are these general purpose data manipulation packages that are, but again, what we do for clinical trials, for transforming, especially S C T M to analysis or atom data sets is somewhat specific.

And there are certain algorithms we use all over the. Again, so it would actually be nice to not have to use these individual words and try to put them together ourselves all the time, but wrap them up into sort of higher level functions, which can then do the stuff that we need in a very efficient manner.

So when Roche decided that they really want to heavily go into R they saw that this is somewhat a gap that we need to fill. And then I was happily enough to lead that effort of creating this R package, which then became a collaborative effort with G S K initially, so Roche and G S K, and later on actually, Got a lot of traction within industry, also from other companies.

And now there’s actually extension packages for Admiral, which go into different therapeutic areas, whether it’s oncology or vaccine development. Yeah, so quite exciting stuff I have to say. And we will definitely touch up on that in the course. Probably not too much in depth, but I definitely want people to be aware of that.

Because at the end of the day, yes, you can always go back to, you know, the, the basic building blocks and build everything yourselves. The question is though, Why should you and why would you also? Probably even if you’re a great programmer, we all do mistakes sometimes. So if you can just use a function which has already heavily been battle tested and tested I would encourage people to use that instead.

[00:24:46] Alexander: This is exactly why I think one of the reasons why I think Roche moved into that direction with the open source environment. Yeah, you don’t need to recreate everything yourself because it’s really a community approach, and that’s one of the really great things about art. In the next episode, we’ll dive a little bit more into the key strengths of R. How you can move from SAS to R, a couple of further differences.

So stay tuned for that one. And if you haven’t registered yet for the SAS to R course, then head over to The Effective Statistician, you’ll find it on the courses or just go directly to the block of this episode and then, you can check the links there as well. Thanks so much, Thomas, and talk to you soon.

[00:25:46] Thomas: Thanks a lot, Alexander. Bye.

Never miss an episode!

Join thousends of your peers and subscribe to get our latest updates by email!

Get the shownotes of our podcast episodes plus tips and tricks to increase your impact at work to boost your career!

We won't send you spam. Unsubscribe at any time. Powered by ConvertKit