Dark Data: Why what you don’t know matters

Interview with David Hand

What is Dark Data?
What inspired David to write this book? 
What did he hope to accomplish in its writing?

Today, we all make decisions using data. Dark Data shows us all how to reduce the risk of making bad ones.

In this episode we focus on how you can grow in your effectiveness as scientists and leaders.  With that in mind, we like to dive into this concept of Dark Data and how, where, when, why this impacts decision making in the pharmaceutical industry.

  • How dark data is an issue in the area of healthcare, particularly pharmaceutical R&D, clinical trials, manufacturing, marketing, and health technology assessments?
  • What’s the taxonomy of dark data?
  • Where might statisticians/data scientists have their own “blind spots” related to dark data?
  • From the standpoint of “effectiveness”, what is Davids advice to statisticians when it comes to this matter of Dark Data?

Reference: Dark Data

Listen to this amazing episode and share it with your friends and colleagues!

David Hand

David Hand is Emeritus Professor of Mathematics and a Senior Research Investigator at Imperial College, London, where he previously chaired the Statistics Section. He is a Fellow of the British Academy and a former President of the Royal Statistical Society. He has served on many boards and advisory committees, including the Board of the UK Statistics Authority, the European Statistical Advisory Committee, the AstraZeneca Expert Statistics Panel, the GSK Biometrics Advisory Board, and many others. For eight years he was Chief Scientific Advisor at Winton Capital Management.

He has received many awards for his research, including the Guy Medal of the Royal Statistical Society, the Box Medal from the European Network for Business and Industrial Statistics, the Credit Collections and Risk award for Contributions to the Credit Industry, and the International Research Medal of the IFCS. His 31 books include Principles of Data Mining, Artificial Intelligence Frontiers in Statistics, The Improbability Principle, Statistics – A Very Short Introduction, and The Wellbeing of Nations. His latest book, Dark Data, deals with the challenges for statistics, machine learning, and AI arising from incomplete and distorted data.



CV:  https://www.imperial.ac.uk/people/d.j.hand/cv/CV%20for%20Imperial%20website%2020200222.rtf


Alexander: You’re listening to the effective statistician podcast, a weekly podcast with Alexander Schacht, Benjamin Piske, and Sam Gardner, designed to help you reach your potential to really create science and serve patients without becoming overwhelmed by work. Today, Sam, and I talked with David Hand. He wrote a book called Dark Data. What’s Dark Data? So stay tuned and now, some music. Dark Data is actually all around us, like dark matter in the universe, things that we don’t see, but yet still have an influence on us. And so, David wrote this really insightful book about it. And you will learn lots from his experience with dark data. And what we as statisticians in the  health industry can learn from that. If you want to learn more about being impactful, check the resources on our homepage theeffectivestatistician.com. I’m producing this podcast in association with PSI, a community dedicated to leading and promoting the use of statistics within the healthcare industry for the benefit of patients. Join PSI today to further develop your statistical capabilities with access to the video on demand content library, free registration to all PSI webinars, and much, much more. visit a PSI website psi web.org to learn more about PSI activities and become a PSI member today.

Welcome to another episode of the effective statistician. Today. I’m here with my co host, Sam. Hi, Sam, how are you doing?

Sam: I’m doing great, really excited about today’s podcast. 

Alexander: Yeah, me too, because we have a famous guest here. David Hand. Hi, David, how are you doing? 

David: Thank you very much for inviting me.

Alexander: Okay, so I mentioned a couple of things already in the podcast intro. Of course, we’ll also talk about your new book, Dark Data. I think there’s a lot of things that dark data plays a role, also in the healthcare and healthcare data. There’s lots of aspects. So we’ll surely dive into that. But first, I want to hand it over to Sam to kind of raise some questions.

Sam: Sure. So well, first of all, I mean, you’ve been working as a statistician for a long time. And I just like to know a little bit more about your background. And history. How was it that you came to be a statistician? How did you fall upon that career? 

David: That’s a nice question. The short answer is when I was a kid, when I was child, I became interested in science fiction probably starts from that. And that led to an interest in science. And I had this sort of childish view of what a scientist was that you know, you sort of get up in the morning, you prove a theorem before breakfast, you dig up a fossil before lunch. In the afternoon, you brew up some bubbling chemicals. In the evening, when it gets dark, you discover a new galaxy. And then I discovered as I got older science wasn’t quite like that you had to focus down and the more advanced, you became the narrower and narrower your focus. So in the end, you became a world leader in a pinhead of knowledge. And that was a bit frustrating, because I had so many interests, you know, I was interested in biology and physics, chemistry, and so on. And then I discovered statistics purely by accident, I sort of focused on math, because that I thought was General. Then I discovered statistics. And I’m sure you’re familiar with John Tookers’ famous saying that the great thing about statistics is you get to play in everyone’s backyard. When I realized that was true, I could work with archaeologists in the morning, chemists in the afternoon and astronomers in the evening because they all had data that they needed, analyzed. So I could contribute across the board. And that’s really one of the wonderful things about statistics you can get to play in everyone’s backyard if you want to, you know, it’s just you don’t have to narrow down. If you’re an expert in data analysis, collecting data and so on, then well the world is your oyster. What I sometimes say about statisticians is that they’re like modern day explorers. you can see things that nobody else has ever seen. That’s the great thing about statistics.

Sam: I think we all kind of have our own unique and interesting story on how we came about to be statisticians if you’re working in that area. You know, for me it was I was a Physics and Chemistry major initially and university and discovered after a couple years I was a lot better at math I wasn’t either of the other sciences. So I changed math but when I changed my major, I went to the Academic advisor and he looked at me and he said, Well, you got to take a probability and statistics course as part of your degree. And I, in my mind, I said no way, I’m not going to do that. That’s boring. Why would I do that? It wasn’t until I got into my professional career and then into graduate school, that I got forced really to learn some statistics and probability and then took some classes and then just fell in love with it. You know.

David: It’s very interesting, what you’re what you just said, you thought it was boring. And, you know, for most of my life, the public perception of statistics has indeed been that it’s a boring discipline. And I’ve spent my life trying to convince people that it’s really exciting and very relevant and all the rest. And suddenly over the last 10 years, mainly because of the impact of computers. And it’s sort of a transition to data science, which is really mainly statistics. Suddenly, it’s become the exciting discipline. So we want in the end 

Alexander: Yep. We know data science is the sexiest job of the industries, so all century and so we all know, very sexy, so to say, although we may not look like that way.

Sam: You just said something that some people might think is a little controversial is that, you know, statisticians or data scientists, I was actually in a discussion just this morning with a team that I’m working with, and we were having this discussion. Are we data scientists? Are we statisticians? Are we both? And I wonder, what do you think about that? 

David: Oh, I think we’re both. I think data science is a deep sense statistics, about extracting understanding and illumination from data, clearly, you’ve got to be able to manipulate it, manipulate the data, you’ve got to be able to collect it, you’ve got to be able to search filter, do all these sort of computer science type things as well, you’ve got to be able to code and so on, but extracting understanding from data is fundamentally statistics. 

Sam: Yeah. And I think the part of that is, you can get very specialized just within our field. Right? And, and you talked about science,  that the key to success in science is off specialization. We, even inside statistics, data science, specialization, like I’m really good at data engineering, or I’m really good at predictive modeling or things like that, you get to get those types of focuses. And I think that’s where some of the worse or the conflict comes in. Because right now, I think people view data science as people who can manage huge volumes of data and build predictive models. That’s kind of mostly what that’s about. 

Alexander: But I think it will become a team sport, like, and also areas where, you get experts in programming, and you get experts and data visualization, and you get experts and machine learning, and you get experts in certain types of data sets, and whatever. And so, I think, it’s just all different facets of the same thing. 

David: Yes, I think that’s absolutely right. And I also agree that data science is fundamentally a team exercise, you do have to have people who really are experts at these different areas, so that you can come together to do a good job. 

Sam: Well, you know, with with that kind of, that’s a nice lead in then to talk about a little bit about your book and what, what Dark Data is, and I’m gonna read this introduction, this is from the Amazon website, that that where your book is listed, says, “In the era of big data, it’s easy to imagine that we all have the information we need to make good decisions, and that we have all the information we need to make good decisions. But in fact, the data we have is never complete. And maybe only the tip of the iceberg. Just as much as the universe is composed of dark matter. Much of the universe is composed of dark matter, invisible to us, but nonetheless present, the universe of information is full of dark data that we overlook at our peril.” So dark data explores the ways in which we can be blind to missing data and how that can lead us to conclusions and actions that are mistaken, dangerous, or even disastrous. And then there’s more in that intro. I’m going to stop with that. But so dark, I gave that little intro, could you kind of give us a short synopsis of what is dark data? 

David: Yeah, sure. So basically dark data is data you think you have or hope to have a want to believe you have or something but you don’t actually have. It’s missing for some reason, or inadequate, perhaps for some reason, it might not be recorded and collected in the first place. Perhaps it’s been distorted by error. But one way or another, it’s hidden from you, perhaps it’s been distorted because you summarized the data into an average, which obviously doesn’t tell you about the extremes. Or perhaps Well, we’ll get into the other reasons why it might not be there, but it’s data you haven’t got up perhaps you’ve overlooked the fact that you haven’t got it. And that’s why it’s dangerous, because your inferences, your conclusions are based on assumptions about having data which you don’t have.

Alexander: And there’s actually lots of examples for that. One of the older ones I’m thinking about is this from World War II, a collection of where the Plains came back from Germany, where they were hit. And you have this kind of nice graphic where, you can see, okay, these are all the areas where planes got hit by fire and then there’s certain areas that are completely free. And I think that’s a really nice story about, the data that you don’t

David: Absolutely. It’s a wonderful example, because the question was, where should you put armor on these aircraft to protect them? And the obvious sort of obvious in inverted commas, answers is where all the bullet holes are, because that’s where they seem to be getting here. But Abraham Wald in New York at the time said, That’s quite wrong, you should do the opposite. You should put the armor where the bullet holes aren’t. Because if you’re getting hit there, the aircraft aren’t coming back, and you don’t see any bullets really thinking. And, as you say, perfect illustration. 

Sam: Yeah, it’s interesting. That example hits home with me, because my first job I had at the university was working in the United States Air Force, as an aircraft survivability analyst.  And doing modeling and simulation of weapon systems? And how easy would it be for them to shoot down and destroy aircraft? And that was a classic example. There’s a book by Robert Ball called Introduction to Aircraft Combat Survivability, or something like that. And it’s got that.

David: Wow!

Alexander: Yeah, so that’s called kind of very often survivorship bias or something exists. Right?. What other sources Do you see for dark data?

David: Oh, you name it. And there are any number of sort of ways that I mean, there are lots of familiar ways nonresponse in surveys, people who just refuse to answer, for example, but then, there are sort of standard ways, or familiar ways in the pharma sector as well, dropouts from clinical trials, you know, if you, base your conclusions only on the people that you had, at the end of the study, you could get very misleading conclusions, perhaps there is preferential drop out for one of the treatment arms, for example. So you have to be very careful. But basically, there are any number of mechanisms leading to data being inadequate measurement error is an example of a more sophisticated reason is sort of regression to the mean. So in trading, for example, if you just identify and trade on the companies, which have been most successful in the past, ignoring all the others, you might get a surprise in the future, because you will have been selecting the ones which perhaps just by chance, have done well.

Alexander: Same with kind of fluctuating diseases. Yeah, if you have study and you have this kind of pretty high above, for certain symptoms, then you get the all these patients rolled in that, you have at set time point, these high severity symptoms, and just by natural fluctuation, lots of them will decrease and then sets,so called placebo effect

David: Exactly. And if you’re not aware of this regression to the mean phenomenon, you think, Aha! My medication works. This is wonderful, I shall sell it to the public. And then funnily enough, it doesn’t work. In practice, things go horribly wrong. 

Sam: Does the regression the mean impact also like how we assess risk? And what I mean by that is sometimes we take the mean as the answer. Right? But then almost nothing. There’s almost no average patient, right, that the average patient doesn’t exist. 

David: Yeah, I think that’s right. I mean, there is always a tendency to focus too much. On the average, I think this is sort of another kind of dark data where you focus on a summary statistic and average, for example, ignoring the fact that average might mean that everybody is clustered very close to it. Or it could mean there’s a huge range, you know, with people being quite different at either end, they could also mean that you’ve got a very skewed distribution that most people get made slightly worse by the medicine, a few people have made a great deal better, it can be very misleading.

Sam: So when you wrote this book, I’m just wondering, what was the inspiration for what made you want to write this book?

David: So over the years, and it’s something that Alex said at the beginning, I’ve always been interested in working on important problems and in a sense, tackling things that people care about. So I’ve done a lot of consultancy work with all sorts of organizations, because that sort of got me out of the narrow sort of Mathematical Statistics of universities to do things which are that people want to know the answers to. So I’ve done a lot of consultancy work and kept running up against these sorts of dark data issues. They would be presented in sort of different ways that I can remember early on Non response in studying non response in surveys, which is familiar kind of dark data, and that’s a kind which has been very well explored, and people have developed tools, but looking within the same sort of thing for coping with dropouts or clinical trials. But then I encountered work in retail credit scoring where the objective is to build a scorecard, a predictive statistical model to predict those people who are likely to default on loans or whatever. And I encountered the problem that they in that industry called Reject Inference. Basically, the data you’re looking at your sample of data is a set of people who have been given a loan in the past, and then you follow them up, you follow up the people, you’ve given the loans to, and see who defaults and who doesn’t. And you can build a statistical model on that. But that fails, that simple sort of perspective fails to take into account that those people who are given loans were selected in the first place, the people that were thought to be really bad don’t appear in that they’re part of the population of people applying for loans. So your model is built on a distorted sample, and could be completely wrong. And the bank, this was quite a few years ago, now approached me and said, How do we cope with this? What should we do about it? So that was one of the things which I think was one of the first sort of real problems that got me focusing on this. And then after that, I sort of realized that these sorts of issues cropped up just all over the place. They’re ubiquitous. 

Alexander: Yeah, I read an article about, I think, a big consultancy company. And they were looking into car promotion data, which kind of people got promoted, and centered, okay, in the future, we really focus on these people. And let’s kind of recruit mostly these people. And since I saw, oh, they’ve kind of rejected anybody that was not white male, because most of the data was from the past where they had recruited white male people. And so lots of them, are of more senior people, well, white males, and anybody else and didn’t have a chance to protect, fortunately formed out quite fast sets, they had some kind of shortcoming sensor data. 

David: And that’s a classic kind of problem is that if you, build your model, if you train your your algorithm or build your statistical model or whatever, on data that you’ve collected in the past, well, maybe that data in the past isn’t what you need to be collecting, it needs to be applying the models to now So absolutely.

Sam: I think we, probably anybody who’s worked in this area for a while, probably has this long collection of examples where this has been the case, you know, I want to think about it had to do with, it was another human resource. One, where it was a place, the site that I worked at, gave everyone that worked in the operations, a pre employment aptitude test. Basically, it was like, you know, just an IQ test, or some skill and IQ test, right? And then what they wanted to know was, was that predictive of performance in the future. So when they hired somebody, the scores that someone got on the test, predict how well they perform, and they gave me all the data, and there was no correlation, like, there’s just nothing there. And they couldn’t figure it out. And I said, Well, how Tell me how you actually hire the people. And they would describe and it turned out what they were doing is, that was one aspect they use, but they were really selecting the upper tail of the distribution, if you want to think about it that way, you know, so everyone was kind of already kind of close together in that aptitude. They already had these thresholds and cut offs they applied. So they weren’t experiencing the full range of variation that those scores could have. So it’s no wonder that there was no relation. Even if it was predictive, they didn’t have any data to show that it would end in some respect. If it was predictive, why would you pick someone from the lower end detail just to show that they

David: I mean, that was one of the things that they said they should do in the credit industry to reject inference, they should pick a few people they thought work, were poor risks, just so that you could build a better model. So it would save them money in the long run. But it was an uphill struggle persuading them that you identify people, they thought they would lose money. Incidentally, your point, I think, is very important, because the way to talk about this lecture I’m sure, the way to tackle these risks, is to try to be aware of them to try to spot things that you might be overlooking. And the way to do that, of course, is to have a diversity of perspectives on it. So you, you know, you’re this point about, you’ve got to have a team, and you’ve got to have a lot of different ways of looking at things. So, people can spot things that I would miss and vice versa. And so

Alexander: I think you need to have a good understanding of how data happens. Our host Sam described it in recent episodes, which is that you actually look for the experiment. You talk to the people that collected data here and you see how the data can move through the systems versus any kind of filtering in sales? Or, you know, any surprising things that are happening?

David: And that you reminded me I should have said this at the start. What data science we discussed at the beginning. And one of the things about data science is having a contribution from the domain that you’re studying, you’re applying it to, it’s not just enough to have stuttered statisticians and computer scientists and AI people, you’ve got to have people from the pharmaceutical sector or the finance sector or whatever area you’re working because they understand that data.

Sam: Yeah, you’ve really taken a liking to my, my term there how data happens. And I think that’s, I kind of came up with that on my own A long time ago. And I really do think that’s important is viewing the process for collecting the data as a process. And, sometimes that tells you what you would expect the variation would be into data. And also, in this case, the issues of where you might have missing data or dark data. So you came up with a taxonomy of dark data, it’s detailed, it’s more detailed than I thought it would be when I read, I read the book, and, and But well, you have 15 categories of dark data, and I’m not gonna read every single one of them. But for instance, type 1 is data that we know are missing. And type 2 is data we don’t know are missing, and that seems pretty simple. But how did you come up with these categories? Where did you set the size?

David: Yeah, I suppose the short answer is, again, from problems that I’d worked on, I started, you know, started with things Oh, well, I mean, in a way, you can start with Rumsfeld, known unknowns. But that’s just a very crude sort of categorization. And as you dig down, and you look at the problems you’ve worked on, your problems that have been reported in the literature and problems other people describe to you, you see that you need to refine it. And I just like to comment on my 15 item taxonomy. I’m sure it’s not complete. As we go on, new kinds of data are being collected, new ways of collecting data are being developed. So I’m sure that there will be other new kinds of dark data that I haven’t covered. 

Sam: I remember reading it. And there’s also some overlap, like even some of the examples you gave in each category, you know, sometimes they fit that there’s not just one category, sometimes that example fits in, what’s the dark data? What do you think is the usefulness of the categorization or the taxonomy?

David: I think the usefulness is, if you can identify if you recognize that your particular data set is suffering from one of these, then you can start to think, how might we cope with this? You know, in what way might our conclusions be wrong? How can we adjust for that, but I do think really, the point you just made, it is important to recognize that any one data set won’t, is unlikely to be suffering from one of those problems, it’s more likely to be suffering from several problems. So just because you have spotted that it’s suffering from dark data, type 3, or whatever, you can’t sort of say, right, that’s okay, I’ll tackle that. And then my data is clean for whatever I’m doing. It’s not like that it’s more likely that you’ve got more than one type. And they can work together in a horrible sort of demonic synergy to make things even more complicated.

Alexander: Do you think you could use these taxonomy categories as kind of a checklist? Once you dive into a new data set, you kind of go through all these differences and say, does it apply here? Does it not apply here? What’s the likelihood that this applies? 

David: I think it could. Funnily enough, though, I’ve been meaning arising from my sort of conversations with people to actually focus down and produce a checklist. So to turn that taxonomy into something more practically useful. So I think you could use it as a checklist, it certainly gives you an indication of the sorts of problems to look out for, but I think it can be transformed into something more useful. And one day I hope to get around to that. I hope to get the time to do that.

Sam: So you know, the the focus of our this podcast about the target audience, we have statisticians working in the pharmaceutical industry, but that’s primarily what this podcast came about as although I think, I hope more and more statisticians learn about it that don’t work in pharma, because I think the problems are the same. They just have a different flavor or a different label on them. But the problems are pretty common that we face in those areas. You know, thinking about pharmaceuticals. Do you have any examples or you know, something that you’ve seen where dark data happened, and maybe it had an impact where either potentially, a wrong decision was made or not the best decision was made.

David: First, the generic sort of classic things like drop outs and perhaps even more important exclusion criteria. So for example, if we feel familiar, you’re probably very familiar with Caroline Criado Perez, her book invisible women. And that’s really worth looking at. If you haven’t read it, I think it’s a really excellent book, essentially, it’s about dark data as sort of the male/female difference. It shows you how much data across the world, and in the pharmaceutical industry, Chapter 10, dealing with the pharma and health sectors, it relates to men. So I think there’s a big gap, men are sort of regarded as the default in this, people are recognizing it now. So things are improving. But in the past, men have been sort of regarded as the default. So for example, diagnostic criteria for heart attacks tended to look at how the symptoms that men evidenced, whereas women’s produce rather different symptoms. So in other areas, I suppose COVID, through a lot of examples of dark data, we’ve learned a lot over the last year. But as the year progressed, it was interesting, observing things about, you know, that first, that people weren’t aware of this, and then they began to have a suspicion that maybe men were impacted more seriously than women. And then there began to be a suspicion that maybe older people were impacted. And then, maybe there was a sort of ethnic, and it was realized that it was related to deprivation, and so on. So COVID is sort of, I suppose one could write a whole book about COVID and dark data and how these issues arose. Of course, people have focused a lot of attention on them. So, we’ve learned about it, but at the time, it wasn’t so clear. I suppose another example in the pharma sector, sort of really more general but it’s certainly true in the pharma sector is a publication bias. There are specific examples, I don’t know if you’re familiar with the, you’re probably familiar with this, given that it’s farmer Scott Hall Conant, he was found guilty, I was gonna say he was found guilty of pee hacking. The judge didn’t put it like that. But he basically ransacked his data, you know, the treatment wasn’t effective on the primary endpoint. So he looked through the unblinded data to find a subgroup that it did appear to work on and publish that as if it was a sort of proper discovery. So essentially, there’s dark data there, because he’s ignoring all the other stuff, which given There’s a wonderful example of a machine learning system for predicting which patients were likely to die from pneumonia, I can see your references, this is a great example. And this machine learning system appeared to find that patients with a history of asthma were less likely to die from pneumonia. This is related to our bias data sets that we were talking about before. And you know, it’s very clear and obvious from the data, if they had a history of breathing problems in the past, then they were less likely to die from pneumonia. And, you know, this could be a great discovery. And it’s possible to contrive and come up with biological mechanisms, perhaps they developed some sort of immunity, or which made them more resistant. But then it was discovered that the patients who had a history of asthma, breathing problems were sent to an intensive care unit where they received extra special treatment, and so we’re less likely to die and more likely to recover. So distorted data sets, but a real sort of example.

Alexander: Yeah, that’s a typical thing that you see, quite often it’s kind of this, whenever the treatment doesn’t work, you get on a better treatment. And, you don’t see any more, in the long run, what actually happens

David: I think this is a special problem in the sort of medical and pharma sector, exactly, as you say, because so much of it is sort of observational, not controlled, the patient’s coming along, doesn’t, as you say, isn’t getting better. So you try another one. And it’s not a sort of randomization in any sense, the doctor chooses one because they have experienced it in the past and think it will be better. And so it’s very difficult to, you’re a whole load of patients who have gone through these strange sort of sequences of, of medications, very difficult to track to extract sensible conclusions from that.

Alexander: That reminds me of data sets that I analyzed very, early in my career when I was still at academia. A Physician brought some data from his surgery, And we looked into all kinds of different outcomes. And one of the sectors that we could look into was the surgeons. And there was, of course, the head of the department, professor, and since there were some, more experienced ones and more junior ones. And we looked at the outcome rates and the head of the department had the worst outcome rates Oh, so the head of department doesn’t seem to be that good of a surgeon? And he said, No, no, no, no. Look at these patients. These are all those set, had no other chance. And they wouldn’t have never been, gone to some kind of junior.

David: They are really sick patients. Yes,

Alexander Schacht: This is only cold if nobody else wants to touch it.

Sam: A little bit of covariate adjustment there. 

Alexander: So talking about covariate adjustment, is that one of the things to kind of go there, as soon as that you adjust for factors to kind of see how that dark data might affect your analysis?

David: That’s right. Yes. That’s one of the sort of strategies for adjusting for this. if you. I mean, fundamentally, the way to cope with dark data is to understand why you’ve got dark data and what might be missing when, if that happens, then you can think about right, how do I adjust for this? How do I compensate for it? And COVID? Use of covariates is indeed one way to do that. Yeah. In fact, that’s referring to the credit scoring example, way back. That was one of the strategies that I described for that to the bank that was employing me as a consultant, but more generally, yeah. In pharma and in the health sector. Yeah, absolutely.

Sam: You know, sometimes in this dark data, when I read it, I was reminded of something that I read, Edward Deming wrote a long time ago, and it was in his 14 points for management, and if you’ve ever read that, but one of the things he said is, is the key figures that management need are either unknown, or unknowable. And so you can say that as a statement and if you’d ever seen any recordings of Deming, he’s very blunt,  you’d just kind of like, hit you with a hammer with what he said. And so What do you do in a situation where the data you need is either unknown, or unknowable? How do you handle that?

David: Yeah. I mean, the sort of simple answer is you try to find proxies for it. But of course, you have to be very careful, because any number of stories of real cases of the proxies misleading you, because they don’t really quite capture what you’re interested in, or because they’re distorted in some other way. But that’s fundamentally what you have to do. Or I quite like this in some context, maybe change the question, maybe you don’t really want to know that. You want to know this, and I have got data for this. Sometimes you can do that.

Alexander: I’ve tried to use that approach, also, as well as kind of finding a related question. And getting an answer to that. What do you think about these kinds of more, extrapolating approaches? Where you can have, if you go back to the example with the predicting defaults, you could kind of see, okay, we look into a couple of these covariates. And then kind of see, okay, if we move out of this range here, and you can think like, well, maybe you have a linear, or we can have some kind of exponential effects, then we could kind of get a little bit outside of our kind of the data area where we have data. Yes. So

David: I think you can use those, but you have to be very careful. When I was working in the credit sector. around the year 2000, I used to project in my presentations, I used to show the graph of the growth in retail credit, consumer credit card that sort of thing, consumer credit. And it was basically exponentially increasing up to the year 2000. And I would say, What do you think’s going to happen next? Look, I can fit a perfect exponential curve to this data, do you think it’s going to go on increasing forever exponentially? Or do you think something is going to happen? I’ve subsequently claimed I predicted 2008. My prediction, my prediction was useless, because I didn’t say when it was, anyway, I used that illustration to try to convince that sector that it shouldn’t just use data driven models, models, which just fit to a relationship pattern, the configuration in the data, but should try to inject some kind of understanding or theory in that case, it would be economic understanding, but it could be chemical understanding, biochemical understanding in the pharma sector, into what you’re doing, because that sort of constraints. Because you know, you’d never fit an exponential model, which was going to go on increasing forever. Because real life doesn’t do that, so that you would. So I think that’s one way you can approach these things. And perhaps I can say, I think there’s a more general issue here. One of the reasons that Data Science, machine learning and so on has taken off in such a big way is because of this, you can take a data set, probably a big data set, a large data set, and just fit a model to it using machine learning algorithms and so on. But these things I think, are fundamentally brittle. Because if the world changes, like in my retail credit example, your model just won’t work anymore. And I think there’s a bit of a risk, sort of underlying this tremendous interest in data science, because so many of the models are of that kind. They haven’t injected theory about what’s going on, it’s just data driven. And so I think that there’s a bit of a risk.

Alexander: So is it about model uncertainty in a way?

David Hand: Part of it’s about model uncertainty, but I think it’s really about understanding, obviously, if your theory is wrong, well, you know, that’s a different matter. But if your theory has an element of truth in it, then that sort of constrains the sorts of shapes of models and so on that you’re likely to fit. So model uncertainty comes into it. But I think it’s really understanding of the call.

Sam: I think we all see that we generally understand, you know, these issues, but do you think statisticians have any blind spots with respect to dark data?

David: Oh, yeah, we all have blind spots. I’m sure I’ve made a lot of mistakes in the past. I think the trouble is, of course, what we’re trying to do when we’re trying to identify our blind spots is we’re trying to identify things we haven’t thought of, which is sort of fun to see something which isn’t there, which is the Darth Vader problem. And I think that’s fundamentally different. So I suppose I come back to this point about diversity, what we were saying earlier about teams, having some other people who might see things might not have the same blind spots, who do you know, so that we sort of cover I’m also an enthusiast for for red teams, you build a model, and then you get someone to criticize it to point out? Well, you did think about this for the data, maybe the data is inadequate in that way, that kind of thing, try to try to sort of see through the blind spot.

Alexander: So basically, something like a peer review system, yes. And some kind of person that plays devil’s advocate will say, what happens if you know, something here goes wrong? Or like my former professor at University would say, what happens at the margin? It’s kind of an interesting question.

Sam: And perhaps more formally, you could do that, you know, where you list out your assumptions you’re making? And then we stress test your assumptions.

David: Yes, And again, I think that’s another example of why it’s good to have other people as well, because they can point out Oh, but you also assumed this, and you didn’t say that you assumed that the cases were independent, you didn’t actually write that down. So yeah,

Alexander: Yeah. And that is actually a very nice scenario where you can, who would hugely overestimate your precision, if you think that everything that you observe is independent, when in fact, it’s not?

David: Yeah, I’m sure, you’ve had examples of this, as well, but I’ve considered it. I used to work at the Institute of psychiatry in London as a statistician quite a long time ago now. And I can certainly remember cases of researchers at the institute coming to me, and we spend half an hour talking about the data and they got so many data points. And then at the end, this was when I was young, I’ve learned not to do it now. But at the end, I would say, so, at the end, I discovered that they were only based on three cases, it’s just that each case had 100 observations on.

Alexander: Everybody was rated by the same physician and you’re saying like,

Sam: I wonder what taxonomy that fits within, you know, in terms of where it fits in taxonomy, where really it has to do with what’s the independent data, right, as opposed to the correlated data in the absence of that perhaps that person thought they had hundreds of data points? But apparently they had three. 

David: So what’s missing, there is a crucial aspect of the data description. It’s sort of related to  one of the items in my taxonomy is missing variables, essentially, I haven’t called it that, but it’s missing variables you hadn’t thought to use and thought to measure this. And yet, this explains a lot of things, which, of course, in their classic examples in the literature of correlations, which you try to explain in a causal way. But in fact, there’s a third variable, which just explains,

Sam: Even the thing that I find a lot is we just repeated measurements of the same thing. People think that like It’s just more data, it’s not always more data. The phrase I use a lot with people is, it’s as if I had knocked on the door, and you answered the door, and I said, Hello 5 times,  Still means hello, right? That’s all it means you just got it five times. And that’s what happens sometimes with it just replicating in, but there’s almost no replication error. And it’s all based on the system that generated that results, you’re just gonna get the same number every time anyways, Anyway, 

David: Exactly with that tiny variation and then you draw a conclusion based on that. Tiny standard deviation. So you’re very accurate conclusions, but don’t generalize to the population. You want to generalize that? Yeah.

 Alexander:  You now know everything about these three rats as I am still alive?

Okay, very good. So, we talked quite a lot about all kinds of different reasons. We’re dark data matters. We talked about different aspects of it happening in all different areas. We talked about the military, HR data, Finance data, lots of medical examples. There’s a lot of taxonomies. So I very much encourage everybody to have a look into the book and read the book, be more aware of what might happen and stay curious of what might happen. Because I think that is one of The underlying qualities that we need to have, we need to be curious. We need to be data detectives in a sense to really dig into the data and see if there are any things missing?  I’m just, I’m just thinking about that example with the cholera thing in London. And there was this one building that had no cases because although it was directly next to the suspicious water source, it was a brewery and that was kind of a missing data point, missing variable. 

Sam: Yes. It’s because I had purified water.

Alexander: And they have their own well, they weren’t affected. And yeah

David Hand:  That’s a wonderful example 

Alexander: Through this, all kinds of these things. And if you understand some wells, and you can adjust for that, and you can, State your sanctions carefully. You can of you come up with kind of better ways to explain it to those, that actually based decision on this data so that they can better understand the uncertainties and risks with it. David. Is there any kind of final words that you would like to give to the listener?

David: I think you’ve summarized it very nicely. I think, you know, be aware of the issues. Is there any other way that these results could explain some distortion in the data which might explain my results. Rather than the fact that there is a real effect of this kind on our every study, we should ask what are the weaknesses of these data and try to probe what might compromise my conclusions and I suppose the diversity point. And as I say, I’m an Enthusiast for red team’s sort of peer review, people trying to pick holes in your conclusion. Well, yes, I could explain that because of such and such distortion in the data and then you can check that the data don’t suffer from those kinds of problems. That way, you’ll, your conclusions will be more robust, more likely to be corrected. 

Alexander: So avoid any group think and things like this, where you just don’t want to challenge each other anymore because you all want to be consistent.

Sam:  Yeah. I think this book is a great resource to give to the people that you collaborate with, because that may help them understand a little bit the way statisticians think.  Why is it? We asked ten questions, when all they want is a sample size?  

David: Yes. That’s absolutely right. 

Sam: And that’s it helps them understand our process for, you know, thinking about how data happens and the impact of that. And maybe sometimes where data doesn’t happen.

 David: I think it’s a good point. I mean, statisticians have had a reputation of being cautious and unhelpful. And the reason why we’re unhelpful is because we’re aware that these problems can occur. And if one can help them recognize that we just try to protect them. Help the people who were advising that. we’re just trying to protect them. Maybe that’s a very good thing 

Alexander: I think we just need to be helpful in explaining? Why can’t the data  answer that question and not just I can’t answer this but what could we do to uncertain the future? What are related questions that we can answer? What are the risks associated with it? So that to some scenario planning. Yeah, maybe put assumptions in it. But if we stress test it, how fast does it break? Yeah, and that’s when you’ve been chosen, that’s what we have. That’s kind of the stress test. Do you want to base your decision on it? What’s the risk that you’re taking or making the wrong decision here? 

Sam: We’re not just here to say no 

Alexander:  That’s exactly. 

Sam: Well David, thank you so much for taking the time to talk with us. It has really been a delight to speak with you and I think this has been a great, great talk together. 

David: Thank you very much indeed. It’s been great fun. I’ve enjoyed it. Very, very searching questions. Thank you very much indeed. 

Alexander: The  show was created in association with PSI, thanks to Reinne who helps us in the show in the background and thank you for listening. Head over to theeffectivestastician.com to find the link to the book that we talked about many more stuff. Also be answered by podcast episodes. There is material for how to better influence, how to better visualize, and other things that you likely will need as a statistician in the health sector, to boost your career, reach your potential,  lead  Great Signs and serve patients, Just Be an Effective Statistician. 🙂

Never miss an episode!

Join thousends of your peers and subscribe to get our latest updates by email!

Get the shownotes of our podcast episodes plus tips and tricks to increase your impact at work to boost your career!

We won't send you spam. Unsubscribe at any time. Powered by ConvertKit

Leave a Comment

Your email address will not be published. Required fields are marked *