Mumbai: Daily new COVID-19 cases in India rose to over 350,000 on April 25, a day which saw 2,808 deaths. These are just the numbers we know. The numbers we do not know could be a lot more. We do not know how many more cases there could potentially be because of no testing, or because test results are not revealed. More tricky is assessing the number of COVID-19 deaths. Going by several media reports, the number of deaths at present is much higher than what is being officially documented. Reports suggest the number of fatalities in some cities are almost 10-15 times what is being revealed, particularly by crematoria, municipal administrations and other local governing bodies. Whether these deaths are because of COVID-19 or not, we do not know either way.

What are the implications of such huge data gaps in the middle of a raging pandemic, in the near term for Indians trying to understand how badly we are affected, and in the longer term for scientists and researchers trying to model and predict the trajectory of COVID-19 cases to inform future public health responses? How do public health professionals try and make a case for responses in terms of infrastructure, medicine, health facilities, etc., with such data gaps in the system? And lastly, what can India do to fix this, whether at the policy or execution level, or by using technology innovatively?

For answers, we speak with Giridhar R. Babu, professor and head, life-course epidemiology at the Public Health Foundation of India in Bengaluru. Babu has a master's in Public Health and a PhD from the University of California at Los Angeles, and began as a resident at the All India Institute of Medical Sciences, New Delhi. We also speak with Murad Banaji, a PhD in mathematics, currently a senior lecturer at Middlesex University, London. And finally, Rukmini Srinivasan, a data journalist, who is currently writing a book titled 'Whole Numbers and Half-Truths: What Data Tells Us About Modern India'.

Edited excerpts:

Dr Babu, what are the COVID-19 data gaps India is facing, why are these critical and what are the implications of such data gaps?

GB: It's nice to be with two experts in analysing data and providing inferences to the public. I have a very simple understanding of how to use data. We should use data for action. If you are not using the data that could decrease inequities and ensure smooth provision of services, then there's no use in collecting such data. I'll give you an example: If State A has better data in terms of number of COVID-19 cases per million, and State B does not, [the latter] is actually disempowering its population by not detecting the infection when it is already present, by not providing them treatment and by not giving them an opportunity to prevent deaths. Inequities already exist in the community, and when you don't do justice to data collection, analysis and inferences, you are actually causing more harm than doing service to the people.

Further, for any action you need to take, you need to use the data in a timely manner. If the data is analysed maybe after a year, then it is not useful. For example, the genomic sequencing data coming after the surge has already started is not useful. If the genomic sequencing data is not matched to cluster investigations to see how many people are getting infected because of a particular variant of concern, then it will not be useful. So it is not just a question of collection of data and data accuracy; it is also about being timely.

MB: I second everything that Dr Babu has just said about data not just being collected but also being shared in time, if it is to be useful. Speaking from the point of view of a modeller, i.e. somebody who tries to look at past data to make some kind of prediction of what's going to happen next and where the disease is going to go, if you have good quality data then you can construct models which have some hope of predicting and explaining what's happened. For example, if we know how much COVID-19 has spread in different places, then we might be able to predict what the effect of a second wave will be on those places. If we know which COVID variants have been spreading, then we might be able to predict how fast the spread will occur. If we know how many people have died, then that also acts as a very important indicator of how much spread there was.

On the other hand, if we have vastly different levels of surveillance between different states, then you are in a very complicated game. Let's take the example of Maharashtra, which we know has been very badly hit in both waves of this pandemic so far. But we don't know the extent to which something is actually specific to Maharashtra's demography, to its urbanisation, or to some other factors, and to what extent it is specific to the degree of surveillance in Maharashtra vis-à-vis some other states. So it is very hard to unpack the story of what's happening in one state, whether there are reasons why COVID-19 is spreading faster or causing more deaths without actually having some sanity checks [steps to check data accuracy] on how good the disease surveillance is. Once you've lost that basic data integrity, making reasonable predictions becomes very hard.

We have seen that with how out-of-depth some of the modellers have been in terms of trying to predict what's going to happen with this second wave. We've got a very wide range of different predictions and I wouldn't blame the modellers for that. It's very hard. How do you parameterise your models, i.e. how do you accurately construct them in order to make predictions [without quality data]?

Not to mention all the multiple other issues, such as the fact that it's basic natural justice that people's deaths need to be counted. I'm talking about prediction and forecasting, but there's also obviously questions of justice and equity in terms of accurately recording the impact the pandemic is having on different communities.

Ms Srinivasan, what would be the two or three biggest pain points when it comes to trying to understand India's COVID-19 data?

RS: As someone who's not just looking at the data but also used to seeing the limitations of the Indian state in all its various forms, I think of this at two to three levels. One level is the data that we would like to see but that has simply not been collected. For instance, sero surveys simply do not exist in large parts of the country. Wanting to know about it is one thing, but hypothesising about it is pointless. And we know that data collection is not going to start overnight. So I think we need to be practical about the things for which data does not exist and just be clear about that.

The second level is data which has been collected but is either not very good or unusable, or is not being published. In terms of the not very good and unusable aspects, there you see this great community coming together to try and put things together. For example, covid19india.org exists because the data [COVID-19 cases in India compiled from government bulletins from the start of the pandemic till date, disaggregated by state] does exist, but it was simply not put together by the Indian government with all the data checks needed. Some of it is unusable, either due to its format or for reasons like denominator issues [how a population has been defined for data-gathering]. There's a good survey that the Bihar government did initially with returning migrants, from which we were able to get some indication of high positivity from particular source areas, but that was not a perfectly done study, so it leaves you unsure about how to use that data.

Then, lastly, there is the issue of data that's not being released or, as Dr Babu said, data that's being released too late. It's a real problem that the national sero-survey data is coming with such gaps. There is a whole lot of scientific research for which there isn't published data, although claims are being made for it. So there's that universe of data which simply is not being shared, and that's a problem.

You mentioned data that's not being released at all. What's the one data-set that stands out the most for not having been released?

RS: We've been told about COVID-19 surveys that some states have done which they've not made public, information that they did collect but they haven't put out. Multiple states have told me off the record that they've done, for example, audits of the COVID-19 deaths in the state. We sent Right to Information applications to multiple states asking for these audit committee reports, but we didn't get them. So that data is simply not being revealed. Even the PM-CARES fund is in that category of information that we do need but are not getting the data for.

Dr Babu, what are the three or four data collection areas that are crying for addressal? What data are we losing out on? And how do we convince people that because if you've not collected the data, you've perhaps lost the opportunity to ever collect that data? Because in many cases in India, if things get solved by themselves, we tend to forget about it and no change happens.

GB: I'll give you an example here. If you look at the top five states in terms of cases per million population, it's probably going to average around 21,000 cases per million. And if you go to the bottom five, it will be around 2,000 per million. Now where is 2,000 per million and where is 20,000 per million? There is 10 times the difference between the states that we are looking at within India. So to answer your question very specifically, when we review a programme, we should ask the right question. What is the right question? It is: Why are you not detecting enough cases in your population? Why are you not doing proper testing? Why are you not using syndromic approach? Instead of that, you are asking: Why does your state have more cases?

Now, you see, the entire focus has shifted elsewhere. This is the irony of public health implementation. States like Delhi, Karnataka and Maharashtra have, let's just assume without getting into any political ramifications, that they have good governance, so they'll do better testing, thus they'll have more cases and they'll ask for more help, more beds, more oxygen, etc. Thus they may receive more support. On the other hand, we are not concentrating on areas where people are not getting tested, are not getting treated, and where there are probably more COVID-19 deaths. So see the difference just because of the wrong question being asked.

This is not unique to COVID-19 control. This has happened in every public health programme. I worked in the National Polio Surveillance Project of the World Health Organization (WHO) in India. Till the WHO started working on this, all states were reporting a poor number of polio cases. But when we started, Uttar Pradesh and Bihar had the highest number of polio cases till we eradicated it. How did it happen? It happened in the same system, but by giving the work of data collection and inference to an independent and autonomous agency within the entire governance structure. We need to evolve mechanisms which take data collection and inferences for action out of the general management of health systems. This is purely technical work. I don't think any administration mechanisms will be able to tackle this.

Dr Banaji, some of us have seen what you're trying to do with existing COVID-19 data. What's your concern about using data which you are unsure about, and trying to model? Is there a sort of moral hazard in doing all of this?

MB: You have to take the uncertainties into account and build that into your modeling approach, otherwise you risk saying very wrong things. And you have to constantly update what you're doing. So modeling is not just taking data and using it. In some ways, it's also commenting on the data at some points.

So for example, when I was tracking Mumbai's COVID-19 epidemic in the early days, March, April and May 2020, there were certain very clear and stark trends which started to appear. And the question which arose for me as a modeler was, the city was not generating the number of COVID-19 fatalities based on early data that the models were predicting. And the question arose, why? Is it because COVID-19 deaths are not being recorded in the city, or because the disease is spreading in a different population where there are fewer fatalities occurring, or is it for some other reason? I explored those issues in a couple of publications at the time. I gave space to all the different possibilities, but I said that I believed that there were deaths which were going missing in the city, that were off the record, so to speak. And then later on, this was vindicated. There was a massive reconciliation, where some 1,700 deaths were added back into the city's toll. Now it's nice to be vindicated when you say that stats are going missing and that later turns out to be true. But the problem with that episode is now we don't really know when those deaths occurred.

So when we're tracking the early trajectory of the epidemic, when it was raging in the slums of Mumbai, we don't know exactly who were the people being affected by this and how badly. And when it comes to trying to predict what's going to happen, and trying to understand Mumbai's second wave, that gap in our knowledge is actually a huge, very important gap. So even though there was some reconciliation and an improvement which was a very positive thing--and I do feel that Mumbai's data has improved a lot since those early days--that gap still left a shadow on attempts to understand what's going on with the second wave.

Since then, we've had a lot of sero-survey data and a variety of other kinds of data which have helped to inform on Mumbai's epidemic. But the broad point is what [happens] when you have poor quality data. For instance, if you look at what's happening in Gujarat and Madhya Pradesh, where there are reports of death undercounting, this is going to leave a shadow in terms of understanding how the epidemic has actually travelled through these states, what are the vulnerabilities for future waves, and perhaps also questions about how vaccination ought to be targeted in the future. So it leaves very practical gaps in our knowledge.

Ms Srinivasan, you've written about under-counting of deaths. You've looked at Kerala and Maharashtra and were able to show that there was over-counting and under-counting of deaths. Is this only because you have data? Were these states revealing more than others were? If this was the position in these states, what could it have been in any other large, populous state that was not giving out such data?

RS: The data that we have from Kerala and Mumbai isn't really able to give us a wide enough picture for the rest of the country. For example, Kerala did not find an increase in all-cause mortality in 2020, which is a surprise. I think one possibly sort of political way of looking at it would be for the state to praise itself and say that it did a good job. But even those in the state administration told me that they did not feel confident of the data and they would not take this as proof of having done particularly well. And they do think that future surveys, going over the data or an audit in the future would be a good idea. In Mumbai, not having a cause-by-cause breakdown of the data is a problem because, as we know, accidental deaths are significant in Mumbai. Both of these are very specific regions and generalising about them becomes difficult.

There is also an attempt to use the Centre for Monitoring Indian Economy's database, which is not actually intended to look at health data. But in the absence of other sources, this is what some have looked at. Murad made the point earlier about data from crematoria as well.

I think a lot of time last year was spent in polarised arguments about how many deaths were from COVID-19, or not. Epidemiologically and from a public health perspective, it is very important to figure out how many deaths were from COVID-19. I think what we're seeing right now, what's coming out from news reporting from burial grounds and crematoria, is that there is an increase in the number of people dying. And maybe at this moment, we do not have the infrastructure, but we are going to need a small survey to tell us how many of these were COVID-19 deaths. But at the very least, I think they do seem to be saying something about all-cause mortality. Unfortunately, that data for India last came out for 2018. I think politicians [and others] really need to push the government to do a sample survey or something of the sort to figure out all-cause mortality for 2020 and 2021. As this year is going, it's the all-cause mortality that personally bothers me the most.

Dr Babu, on the COVID-19 dashboard, what data stands out most for its absence, if not its presence?

GB: Definitely, the excess deaths which are the result of an epidemic is of great interest to epidemiologists, and also to understand how much priority should be given for some control programmes over others. At the same time, we should be mindful that whatever we do in terms of understanding the data, we should build robust systems which will take care of a long-term perspective, not just for one small programme. If you were to have a stronger surveillance platform, which collects the data and then analyses it in a timely manner, adding another programme to it will not be a burden.

India's premier Integrated Disease Surveillance Program (IDSP) is one such programme, which had the ambition of integrating all the other surveillance programmes. But people will be surprised to learn there are at least eight or 10 vertical programmes, and some of them collect health data which are either duplicate or redundant. As a result, all of this load [of collecting data] is on the basic health worker in the field. They have to collect the same data for different vertical programmes. I think it is time to reorient the systems towards data that is one horizontal integration, which is feasible, within the context. Second, it should have the flexibility to onboard some of the pandemics, or any other short-term health programmes that we do. So instead of picking one indicator, I would say we need to build systems.

Dr Babu, your thoughts on Dr Banaji's point about how the lack of data is affecting India's likely vaccination strategy?

GB: So now, regarding vaccination, there's a lot of data that one would like to know. At least in the initial phase, because one vaccine was permitted in a clinical trial mode, we wanted to see what kind of effects there are after taking that. There are these mechanisms of building trust in the community which are very important, and that comes only from transparency in data. Most times, I have worked with several governments for several programmes. It's not that the policymakers don't want to share data. They are scared that once an adverse event is known to the community, maybe people will not turn up for vaccination. So [not sharing data] comes from a very paternalistic approach, of 'let me take care of society'. But there are other issues of why the data is not shared. But I think this is time to reorient our systems. And this will also require people who manage the data to be free from people who run the programme, or who make the administrative decisions. This is as important as getting a few data indicators set right.

Many of India's flagship health programmes also have dedicated budgets for monitoring and data collection, don't they, Dr Babu?

GB: We want the data to be collected for every programme. But then the resources provided for data collection are very meagre. There is no separate budget head in most programmes for data collection. I will just give you a comparison. The National Centre for Disease Control, which is the place for looking at data in terms of IDSP and all the other programmes, gets a miniscule share out of the entire health budget. The equivalent Centers for Disease Control in the US gets around $650 million and the tasks are the same. So you want to under-staff, under-resource and undervalue the people who are engaged in data collection and inference, but then you want them to overperform, especially during pandemics, but also in other routine programmes.

Dr Banaji, you have looked at other countries. England is a good example which is emerging from a lockdown. What role did better data collection and therefore better modeling play in this?

MB: I haven't tracked the UK's epidemic as closely as I've tracked India's, but there are a few stark contrasts. I think there were a lot of things which were done wrong in the UK as well. For example, the tendency to spin data rather than to present it transparently. But one of the plus points there is that some of the bodies which have been handling data in the UK are quite independent of the government.

There are a couple of UK programmes which I think would have been really fantastic to see in India. One is that there's been very regular surveying of the population to decide on incidence, i.e. how much COVID-19 is there in the community at the moment. This is surveying of different kinds, including for active infection via RT-PCR tests and surveying for antibody levels, so that you've got a sense of how many people have at least recently had the disease. So the UK had a kind of tracking mechanism.

It's easy to pick out incidence levels rising with a tracking mechanism. For instance, when a new variant was starting to spread, it was quite early and clearly visible in the data that something was going on. The UK has also had a high level of genome sequencing, so it was possible to track the rise of the new COVID variant, B.1.1.7, in the UK. And also, as COVID-19 vaccination was rolled out very, very quickly relative to many other countries, we were able to track the effects vaccination was having and how successful it was proving to be at bringing down severe disease, and then ultimately, also infection levels and transmission.

We would like in India to be able to say that although vaccination has been limited, in that it's been targeted at groups which are most vulnerable to severe disease, it's having an impact, it's bringing down deaths. The trouble with this is that one, in the middle of a raging wave, it's extremely hard to say 'yes, vaccination is working and is bringing down deaths', although I believe it is. But the kind of data that you need to be gathering is very specific to be able to say that. You need to be very carefully tracking people who have been vaccinated a sufficient time ago and to see how many of them subsequently actually test positive. And to see ultimately, if any of those COVID-19 cases are of severe disease. Instead, what we saw is essentially a very big spin operation claiming that there are very few breakout infections occurring in people who are vaccinated, which Rukmini revealed in IndiaSpend. The intention was probably good, to increase positivity and confidence in the vaccines, but it was done on the basis of poor quality data analysis and in fact, I think, some clear dishonesty, because the necessary data wasn't being tracked until early April. You can't build confidence in vaccination without actually properly tracking what's happening. And I would love to be able to say with confidence that the vaccination drive is having a positive impact in India. And I don't understand why there isn't more of an effort to actually gather the kind of data that you need to make that claim.

Dr Banaji, what macro or micro policy interventions can India do today, for the near and long-term future, in the context of data? By micro policy, I mean at the level of urban local bodies, which are also critical.

MB: At the moment, it's a firefighting exercise, the COVID-19 wave is so overwhelming. It's a tsunami of cases. Hospitals are being overwhelmed. There's almost certainly a huge rise in all-cause mortality that Rukmini talked about, not just in COVID-19 deaths. Because once hospitals are overwhelmed, people with every condition risk an avoidable death. So it's very difficult to talk about what data should be collected at the moment when really all resources need to be poured into just slowing down the spread of disease, in order to control the wave. So I think the first priority at the moment is obviously just to slow the spread. And we understand the basic mechanisms by which you slow the spread of COVID. We have a lot of experience now with that. I won't go into all the different kinds of mitigation, but that has to be the broad principle.

And obviously, we also need to continue vaccination as quickly as possible. Increasing the pace of vaccination needs to be a key priority. I do feel that it's important to focus on those who are most vulnerable until you've got extremely good coverage in those groups. So focus on people who are elderly and people with comorbidities, because we shouldn't yet be seeing vaccination as a way of bringing down the spread of disease. We should be seeing it as a way of stopping severe disease and reducing mortality. And then once we've achieved some success with that, then we can talk about reducing transmission and slowing the spread of disease.

In the longer term, looking forward, I think that all the kinds of excess mortality that we're seeing right now will need to be surveyed. When you see the reports coming from Gujarat and Madhya Pradesh, and I'm sure this will spread to many other parts of the country, when you see the huge and devastating kind of tolls and the obituaries pages in local newspapers, you have to say we need to survey to at least track what happened during this second wave, and how. Then the understanding will come about why the disease spread so very fast, why health systems were so quickly overwhelmed. And hopefully, all of that understanding will lead to some changes in the future.

Ms Srinivasan, where do you feel India could devote resources from a macro policy point of view, looking several years ahead rather than just right now?

RS: One is genomic sequencing, which both now and in the future is going to be hugely valuable. First, the issue was that there was too little genomic sequencing and now there's a little more. But the thing is, we don't know anything about the samples. For example, I saw some reporting from the UK that an inexplicably large number of the Indian samples from March of this year were from West Bengal. Now that has an impact on what you're going to find from it. So we need much better transparency about what the overall Indian sample [for genomic sequencing] is, and why. What is the strategy behind what samples are being sequenced? And for the future, I can picture how more genomic sequencing is going to be of great value.

Secondly, I think we need better health systems for rural areas and smaller cities. We've had to rely on the hard work of local reporters to understand what's going on in most other places. The sort of graphs that Murad has been able to produce showing differential spread in slum and non-slum areas of Mumbai, it's great that we have it for Mumbai and tragic that we don't have it for most of the country. The extent of true infection is not known for non-metros and we don't have centralised data about hospitalisation, so all that we rely on is reporters standing outside of hospitals telling us what's going on there.

And finally, for the future, I would want a culture of transparency and of data integrity. India has the abilities and the capacity to produce that. And producing data without spin. That's about as much as I can ask for.

Dr Babu, if you don't want to reveal the numbers, if you want to put up scaffolding and corrugated sheets to ensure that people don't look at crematoria, there is a more fundamental problem. So how do we address that, in terms of what India could be focusing on as you also look ahead?

GB: There were some brilliant points discussed throughout the interview. If I can sum up, we need to shift from policy-based evidence generation to cater to how the policy works, to evidence-based policy implementation. So that's one difference. Right now we're trying to show data that justifies what the policy actions are. That's not what we should be doing. We should have the evidence, and evidence should dictate what policy we implement. In order to do this, we need to generate timely, reliable and disaggregated data at every level. So that's the first task.

The second is in terms of changing the culture in order to encourage data for decision-making at every level. And this is not an easy thing; it doesn't just come to everybody. You have to change the entire mindset and culture. We have dealt with this in several programmes and this can be done.

Finally, we have enough data in different systems. So for example, there is death data collected in vital statistics, there is also the National Cancer Registry. We need to combine all of this. The approach has to be multi-sectoral. Just collecting data from one department and one arm is not going to work. The dashboard has to be for the Government of India, which is an all-of-society approach, and that should include all the sectors involved. This requires phenomenal engagement of software developers, people in public health and in social sciences, and this has to be a reform by itself. It can't be patchy, isolated or fragmented. So that culture, that approach, that perspective has to happen now.

The COVID-19 pandemic has made us realise how vulnerable our data systems are, but this vulnerability continues throughout for all the other programmes. At least now we have to make a decision and say 'let's build'. We have the best software companies also in India, so why not begin from that?