---
title: "Brief analysis of UK coronavirus DHSC daily death figures"
author: "Alan Grafen"
date: "2020-05-19"
output: html_document
---


```{r echo=FALSE}

## This gets the data in
get_data_from_last_file_in_folder<-function(dir,cols,text="Select one of the data files"){
	
ordir<-getwd()

if (missing(dir)) {
	message("before flush.console\n")
	cat("\n")}

# message("before flush.console")

# This crashed R	
# flush.console()

# message("after flush.console")

if (missing(dir)) {
	a<-file.choose()
	dir<-dirname(a)}

files<-list.files(dir,no..=TRUE)

setwd(dir)

lastfile<-files[length(files)]

# Second read each one in turn, and do the small tasks

outdf<-read.csv(lastfile)[,cols]

setwd(ordir)
return(outdf)}

## This line gets the data for me -- it is commented out so if necessary, it will prompt you for the relevant folder,
##   which should contain DHSC datafiles from after the 29th April 2020
dft<-get_data_from_last_file_in_folder("~/OneDrive - Nexus365/Coronavirus stuff/Data/deaths/posteq29")

dft<-subset(dft,Area.type=="UK")
Deaths<-dft$Daily.change.in.deaths[nrow(dft):1]
## We now have all the data we are carrying forward in Deaths

library("MASS")
ndays<-100
downames<-c("1Sunday","2Monday","3Tuesday","4Wednesday","5Thursday","6Friday","7Saturday")[c(6:7,1:5)]
dow<-factor(rep(downames,length=ndays))
t<-c(1:ndays)
week<-rep(1:100,each=7,length=ndays)
weeknames<-c("6 Mar-","13 Mar-","20 Mar-","27 Mar-","3 Apr-","10 Apr-","17 Apr-","24 Apr-","1 May-","8 May-", "15 May-")

fulldf<-data.frame(t=t,dow=dow,week=week)
df<-fulldf[1:length(Deaths),]
df$Deaths<-Deaths
df$wts<-1  

rm(Deaths,t,dow,week)
```

## Simple views of the data

The data are the daily deaths from Covid-19 in the UK, announced by DHSC each day, and available at https://coronavirus.data.gov.uk/archive. The data were revised on 29th April to include all deaths certified to be caused by the coronavirus, and not just those in hospital. There was a retrospective revision of all daily figures, and now for each "publication day" there is a whole series of death tallies for each past day. (I previously provided [analyses of the pre-29th April data](http://users.ox.ac.uk/~grafen/covstats/Tabulan.html).)

[An initial look](http://users.ox.ac.uk/~grafen/covstats/background.html) at the whole archive concludes that

(i) the revisions are insubstantial, so it is reasonable to operate with only the most recent dataset. Of course, this may change in future.

(ii) there are strong day-of-the-week effects, but these changed between Fri 17th April and Sat 18th April, for a reason unknown to me. Furthermore, they changed suddenly, with no evidence for change within either half of the dataset. This may also change as new data arrives. The data will be presented here for all days and, for the moment, the statistical analysis will only be applied to Sat 18th April onwards.

This document was prepared using [R-markdown](http://rmarkdown.rstudio.com) and [RStudio](http://www.rstudio.com). To find all the R code used to generate the tables and figures, download [the .Rmd file](http://users.ox.ac.uk/~grafen/covstats/foreground.Rmd). The only thing you will need to change is the directory to which you have downloaded the data file, in .csv format, for the day to be analysed. The program expects that file to be alphabetically the last file in the directory, as it will be if the only files you keep there are the series of .csv files for death rates from 29th April onwards.

Turning to the data, Table 1 shows the death tallies arranged by day of the week, and Figure 1 plots them by day of the week. Each curve has a much simpler pattern than if we ignored day of the week. The weekly pattern is presumably due to working shifts and administrative arrangements for people involved in different stages of reporting a death.

```{r echo=FALSE}
## warr is just an array of the Deaths figures, by day of the week, and week
## warrrat is the ratio of one day's figure to the figure exactly a week beforehand,
##  and so only starts in the second week of data.
warr<-array(c(df$Deaths,rep(0,7)),dim=c(7,1+(length(df$Deaths)-1)%/%7))
rownames(warr)<-downames
#colnames(warr)<-seq(1:ncol(warr))
colnames(warr)<-weeknames
warrrat<-warr[,2:ncol(warr)]/warr[,1:(ncol(warr)-1)]

## wartimes is the number of weeks it takes to double the number of deaths (or halve, if negative)
## 1/wartimes is the number of doublings per week (or halvings, if negative)
myf<-function(x) log(2)/log(x)
wartimes<-apply(warrrat,c(1,2),myf)

mff<-function(myar){
  mfo<-myar
  if (!(length(df$Deaths)%%7==0)) mfo[(1+length(df$Deaths)%%7):7,ncol(myar)]<-""
  return(noquote(mfo))
}

## These show the data in a presentable form
options(width=120)
mff(format(warr))
```

Table 1. The numbers of deaths arranged by day of the week

```{r echo=FALSE}
## This code plots the data for deaths by day and week, for ratios, and for the number of doublings/halvings per week
mycs<-c("pink","darkred","lightblue","darkblue","lightgreen","darkgreen","brown")
artypes<-c("Daily toll", "Ratio to previous week","Doubling/halving times")
myp<-function(ar,artype) {
	plot(ar[1,],ylim=c(0,max(ar,na.rm=TRUE)),col="transparent",ylab=artypes[artype],xlab="Week")
	for (j in 1:7) lines(ar[j,],col=mycs[j])
	legend(1,1200,legend=downames,lwd=5,col=mycs)
	}
	
myp(ifelse(warr!=0,warr,NA),1)
```

Figure 1. The numbers of deaths plotted by day of the week


Next, in Table 2 and Figure 2, we express each figure as a ratio, by dividing it by the figure of exactly one week previously. These ratios are not affected by day of the week effects, and they also incorporate a whole week of change, so are likely to be more reliable indicators. The obvious pattern is that the ratios are high to begin with, and do get gradually lower. The grey line at a ratio of one separates ratios indicating numbers are increasing, from those that indicate a decrease

```{r echo=FALSE}
mff(format(warrrat,digits=2))
```

Table 2. The ratio of number of deaths to the preceding same day of the week

```{r echo=FALSE}
warrratplot<-function(){
warrratc<-c(ifelse(warrrat!=0,warrrat,NA))
plot(warrratc,ylab="Ratio to previous week (log scale)",xlab="Days",col="transparent",ylim=c(0.5,15)) ## now omitted ",log='y'", as Fig3 does that
lines(warrratc)
lines(c(0,length(warrratc)),c(1,1), col="grey")}

wartimesplot<-function(){wartimesp<-c(ifelse(wartimes!=0,wartimes,NA))
plot(1/wartimesp,ylab="Number of doublings per week",xlab="Day",ylim=c(-1.5,4.5),col="transparent")
lines(1/wartimesp)
lines(c(0,length(wartimesp)),c(0,0),col="grey")}

warrratplot()

```

Figure 2. The ratios plotted against time

## How quickly will the death rates fall?

The ratios seem to be currently between 0.7 and 0.9, and we can ask how quickly will the numbers decrease? A transformation of the ratios allows us to see in Table 3 and Figure 3 the number of doublings or halvings per week. In the early days, there were large positive figures, indicating more than two doublings per week, so four times as many people were dying at the end of the week than at the beginning. This was a very steep ascent at the beginning of the epidemic.

Later, the doubling times came down, and did so gradually, and eventually became negative on the 15th April (apart from a couple of rare reversions). Once negative, we can take the signs off, and think of them as the number of halvings in a week. For the figures to drop as fast as they rose, we would need to see the number of halvings per week to rise to more than 2, to match the number of doublings per week of 2. However, the number of halvings hovers around 0.3 to 0.4, suggesting it would take 2.5 to 3 weeks to halve the number of deaths. This clearly indicates a very slow tailing off, certainly compared to the very rapid initial increase.

The slowness of decline must be the result of ineffectiveness in our lockdown. Infectious people are still meeting susceptible people, and passing the infection on. These figures don't tell us whether these are key workers who use public transport, health workers and hospital patients, care home workers and residents, or people not obeying the lockdown restrictions. Or, indeed, it is possible that non-key workers who are obeying the lockdown restrictions are still not sufficiently protected. I am sure someone connected to SAGE is studying these important questions, with more informative data.

These figures can also be used to note that a delay of one week in imposing the lockdown, at a time when there were two doublings per week, would result in an extension of over four weeks now, to get down to any given level. The sharp rise and slow fall makes those early decisions look very important.


```{r echo=FALSE}
mff(format(1/wartimes,digits=2))
```

Table 3. The doublings/halvings per week, based on the ratios in Table 2. It is the number of doublings in a week if positive, and the number of halvings in a week if negative.

```{r echo=FALSE}
wartimesplot()
```

Figure 3. The doublings/halvings per week plotted against time. For the epidemic to reduce at the same rate it increased, the later figures would have to reach the same magnitude as the earlier figures, but be negative. Unfortunately, the early doublings per week are high, being greater than 2 for a number of weeks. The halvings per week reach only low numbers, and indicate a long "shoulder" to the decline in deaths.


## Very rough estimates of the number of remaining deaths

If we assume the ratio from one week to the previous week is really fixed, and the variations we see in the tables are the result of random fluctuations, then we can estimate how many deaths are still to come in this wave of the epidemic. The estimates also assume no change in the lockdown conditions, as they would almost certainly change the ratio.

One method for estimation is to analyse a sequence of days' data, allowing for day of the week effects, and estimating the rate of decline in numbers. That rate of decline estimates the ratio $r$ from one week to the next, and comes with a standard error to indicate the imprecision in the estimate. The remaining number of deaths can then be estimated from the number of deaths in the most recent week $D$ (note that here I always use the most recent week up until the day of the analysis, even when the rate is estimated using data from longer ago). The number next week is expected to be $r D$, then $r^2 D$, $r^3 D$, and so on. Summing this geometric progression gives the estimate of all future deaths as
$$ \frac{r D}{1-r}\, . $$
95% confidence intervals on the rate of decline give 95% confidence intervals on the ratio, which can be used to provide a 95% prediction interval on the expected future number of deaths in the first wave.

Note that these prediction intervals do not take all the uncertainties into account. The most recent week's deaths are also subject to error in a sense. Further, the same figures that are used to calculate the sum of the most recent week's deaths are often also used in the estimation of the rate. I can see no simple way to take these additional factors into account, but they emphasise the need for caution in interpreting the prediction intervals. A second point along the same lines is that these are approximate _prediction intervals_, but even if they are exactly right, the actual number of deaths will have random fluctuations around those predictions.

This exercise has been carried out in Tables 4 and 5. It has already been stressed that the estimates depend on the ratio not changing. The prediction intervals are made wide by our uncertainy in estimating the decline. Tables 4 and 5 use different sequences of days, but all the sequences are fairly short, so the estimate of rate of decline is rather fuzzy. It is not so long since the deaths started to decline, so it is nor surprising we can't estimate the rate sharply. Tables 4 and 5 further restrict their use of data to after the sudden change in day-of-the-week effects on 17th April. Allowing parameters for that change would effectively reduce the number of datapoints, which could only be made up for by going more than one week further back -- but, by then, the epidemic was only just turning round. There is a tension between including more data to get more accurate estimates, on the one hand, and using only recent data so we are more sure the evidence is relevant to future rates.


```{r echo=FALSE, error=FALSE, message=FALSE, warning=FALSE}
# ```{r echo=FALSE}
## These two functions look in a defined subset of the data (from day=start to day=start+duration-1)
##  on the basis of a quasi-Poisson model and a negative binomial model. They return the slope of Deaths
##  against time (controlling for day of the week), the residual deviance, the residual degrees of freedom,
##  the confidence interval for the slope, and a p-value testing for a quadratic term in time.
dofn<-function(start,duration){
	if (start+duration-1>nrow(df)) stop("Period oversteps end of data")
	dftemp<-df[start:(start+duration-1),]
	m1<-glm(Deaths~dow,data=dftemp,family=quasipoisson)
	m2<-update(m1,~.+t)
	m3<-update(m2,~.+I(t^2))
	myan<-anova(m2,m3,test="F")
	return(list(slope=m2$coef["t"], resdev=m2$deviance, resdf=m2$df.residual,ci=confint(m2)["t",],pt2=myan$`Pr`[length(myan$`Pr`)]))
}

dofn.nb<-function(start,duration){
	if (start+duration-1>nrow(df)) stop("Period oversteps end of data")
	dftemp<-df[start:(start+duration-1),]
	m1<-glm.nb(Deaths~dow,data=dftemp,init.theta=20)
	m2<-update(m1,~.+t)
	m3<-update(m2,~.+I(t^2))
	myan<-anova(m2,m3,test="Chisq")
	## print(paste0("start at ",start,", theta = ",m2$theta,", with SE", m2$SE.theta))
	return(list(slope=m2$coef["t"], resdev=m2$deviance, resdf=m2$df.residual, theta=m2$theta, ci=confint(m2)["t",],pt2=myan$`Pr`[length(myan$`Pr`)],theta=m2$theta))
}

## This code uses dofn and dofn.nb and applies them to (i) all fortnights in the period of the second day-of-the-week pattern, then
##  all periods of a least a fortnight ending with the most recent day, and staring within the second day-of-the-week pattern.
##  Quasi-Poisson first, then negative binomial.
realstart<-44 ## SO, unreliable up to 43, and then reliable from 44 onwards
mydur<-14
np<-nrow(df)-mydur+1-(realstart-1)
sla<-array(NA,dim=np)
deva<-array(NA,dim=np)
dfa<-array(NA,dim=np)
cia<-array(NA,dim=c(np,2))
pt2a<-array(NA,dim=np)
 for (s in realstart:(nrow(df)-mydur+1)) {dofn1<-dofn(s,14);sla[s-realstart+1]<-dofn1[["slope"]];deva[s-realstart+1]<-dofn1[["resdev"]];dfa[s-realstart+1]<-dofn1[["resdf"]];cia[s-realstart+1,]=dofn1[["ci"]];pt2a[s-realstart+1]<-dofn1[["pt2"]]}
 
# print(cia)
 
mymindur<-14
np<-nrow(df)-mymindur+1-(realstart-1)
slb<-array(NA,dim=np)
devb<-array(NA,dim=np)
dfb<-array(NA,dim=np)
cib<-array(NA,dim=c(np,2))
pt2b<-array(NA,dim=np)
 for (s in realstart:(nrow(df)-mymindur+1)) {dofn1<-dofn(s,nrow(df)-s+1);slb[s-realstart+1]<-dofn1[["slope"]];devb[s-realstart+1]<-dofn1[["resdev"]];dfb[s-realstart+1]=dofn1[["resdf"]];cib[s-realstart+1,]=dofn1[["ci"]];pt2b[s-realstart+1]<-dofn1[["pt2"]]}

```

```{r echo=FALSE}
## This function takes the total number of Deaths in a week ("oneweek") and an estimated slope for day, and works out
##  the total number of remaining deaths on the assumption that slope remains the same, and there is no error in the model from
##  now on.
trd<-function(oneweek,tslope) {ratio<-exp(7*tslope); return(oneweek*ratio/(1-ratio))}
 
startnames<-c("18-Apr","19-Apr","20-Apr","21-Apr","22-Apr","23-Apr","24-Apr","25-Apr","26-Apr",
              "27-Apr","28-Apr","29-Apr","30-Apr","1-May","2-May","3-May","4-May","5-May","6-May","7-May",
              "8-May","9-May","10-May")
## First is quasi-Poisson, fortnight-length, varying endpoint
mytb1<-cbind(apply(cia,c(1,2),function(x){trd(sum(df$Deaths[(nrow(df)-6):nrow(df)]),x)}),deva/dfa,pt2a)
## Then always ending at last day, but starting from day 1 up to fortnight before end
mytb2<-cbind(apply(cib,c(1,2),function(x){trd(sum(df$Deaths[(nrow(df)-6):nrow(df)]),x)}),devb/dfb,pt2b)
colnames(mytb1)<-c("API.lo","API.hi","sc.f.","t-Sq?")
colnames(mytb2)<-colnames(mytb1)
rownames(mytb1)<-startnames[1:nrow(mytb1)]
rownames(mytb2)<-rownames(mytb1)
print(mytb1)
```

Table 4. Minimum approximate prediction intervals for expected remaining number of deaths based on estimate and SE of time-slope in a log-linear quasi-Poisson model. Each row is based on data from a fortnight beginning on the day specified on each row, ending with the most recent fortnight. The third column gives the scale factor in the quasi-Poisson model, and the fourth the p-value for adding a squared term in time.

```{r echo=FALSE}
print(mytb2)
```

Table 5. As Table 4, but the data used always ends with the most recent day, while the starting day gets one day later with each row. The final row uses exactly a fortnight's data. This uses more data than Table 4, but it is not balanced over days of the week and, by virtue of being longer, the periods are more likely to have substantial mis-specification.


Table 4 uses exactly one fortnight's data in each row, but looks at different fortnights back to 18th April. This exactly balances day of the week effects, but keeps the period short and so keeps the uncertainty high. Table 5 uses all the data from a starting point up to the most recent data, and varies that starting point from 18th April up until a fortnight ago. Table 5 thus uses more data, and the prediction intervals should be narrower, especially higher in the table. If the rate of decline is still changing over time, this strictly invalidates the models behind both tables, but Table 4 will be less affected than Table 5, because it discards older data.

The slow rate of decline sadly predicts many more deaths. On 10th May, the prediction intervals go between about 7,000 to over 25,000. These predictions should go down over time, because the future becomes the past, and the _remaining_ deaths should go down. On 17th May, recent poor figures have extended the prediction interval to over 100,000. By 19th May, both the fortnight analysis and the analysis of all days since the change in day of the week pattern both show a 95% confidence interval with an upper limit of over 100,000. Two poor days (this term detracts from the personal tragedy of each death) have made the picture much worse.

The final conclusion is an obvious and very gloomy one, that the epidemic is declining very slowly from a considerable height. We would do well to impose a lockdown much sooner in future waves, and also to take steps if possible to make the lockdown more rigorous, so that the decline happens more quickly.


### Some technical points

I end with some technical points about Tables 4 and 5. The scale factor measures the degree of over-disperion. This is high in the higher parts of both tables, and the reason can be seen in the variability in ratio in the final columns of Table 2. These ratios seem to settle down and become less variable. In the analyses beginning on 24th April, the scale factors suddenly plunge below 1, and stays there for two days before going right back up. There is clearly some lumpiness that is not determined by day-of-the-week -- I wonder about the cause. It seems when there are no lumps, or possibly when the lumps all occur on matching days of the week, the variability is close to Poisson, but just one lump, or one mis-matched lump, puts the scale factor up to 5.

The final column in Tables 4 and 5 looks at evidence for non-linearity (on the log scale) in the effect of time. The value is a p-value testing for a quadratic effect. With multiple p-values, it is important not to get too excited over occasional values less than 0.05. The two "significant" values on 24th and 25th April correspond to two disappointing ratios on 7th and 8th May, so it looks as though the possible quadratic term would be reducing the rate of decline. The general reason for providing this p-value is that it is one way of looking for mis-specification of the linear model from which the estimate and confidence interval are taken. At some speeds of change, the older Table 5 results would become unreliable and this could be indicated by the p-value for non-linearity.