Here I describe the statistical modelling behind the predictions about the remaining deaths in the first wave of the UK coronavirus epidemic. The data are available at https://coronavirus.data.gov.uk/archive. The general linear model approach is taken, to fit a log-linear model to the daily number of deaths. The predictors will be time (a continuous variable t, in days, starting from 1), and day-of-the-week (a categorical variable dow, starting from 5Thursday on day 1, which is 6th March). The functions make it easy to find one’s way through fitting the model t+dow, and then adding terms of the following kinds
It is natural to represent the time pattern as splines, and so expect a number of time kinks that match empirically the arbitrary curve of the course of the epidemic. However, if a time step seems powerfully suggested by the data on statistical grounds, then it would be wrong to omit it. The day of the week effects will change when the reporting methods change, which they have done at least twice. This is because the delays in reporting depend on shift patterns and the bureaucratic effects of weekends. The initial model has no “dow kink”, so the initial day of the week effects are assumed to be constant over time. But if the reporting lags drift over time, there could be a kink. And once the figures include all deaths, not just hospital deaths, it may be that the different sources of information have different lags, and the changing balance of sources could result in a time trend in the day of the week effects.
The functions use a “pattern”, which is a specification of four sets of days, the days of time steps, the days of time kinks, the days of dow steps and the days of dow kinks. add.terms takes a pattern, and fits all possible further steps and kinks, one at a time, and shows the “best” in each category. Relevant information is provided for each of the four “best” changes. One can then be chosen, and added to the pattern, and add.terms can be run again. At each stage, and certainly at the end, it is useful to run lsearch, which takes a pattern, and checks all combinations of days that are found by allowing each separate day to remain the same, or to move up or down by one day. lsearch reports which the best model is from all of those possibilities. If the best outcome matches it, then the input pattern is a locally best model by varying the days.
Here, first, is the code defining the functions. It also works out details of changes in the archive record, but the printing has been suppressed. So far, these haven’t suggested doing other than analysing the most recent data. But it would revise our understanding of previous outcomes if the data was revised a lot. To use this code yourself, you will need to get the data in somehow. The simplest is to have all the deaths data from the https://coronavirus.data.gov.uk/archive in directory, and replace my directory as as argument to get_data_from_folder with your own directory. It takes the most recent file in that directory and uses that data for all days. Everything should work after that.
## ============== Here below is the edited version from 2020_06_02 ==========
get_data_from_folder<-function(dir,cols,text="Select one of the data files",addIDcol=FALSE,IDcolName="filename"){
ordir<-getwd()
if (missing(dir)) {
message("before flush.console\n")
cat("\n")}
# message("before flush.console")
# This crashed R
# flush.console()
# message("after flush.console")
if (missing(dir)) {
a<-file.choose()
dir<-dirname(a)}
files<-list.files(dir,no..=TRUE)
setwd(dir)
# Second read each one in turn, and do the small tasks
for (i in files) {
cat(i)
resin<-read.csv(i)[,cols]
if (addIDcol) {resin[,ncol(resin)+1]<-i; if (is.element(IDcolName,colnames(resin))) stop("IDcolName is already a column name in the datasets"); colnames(resin)[ncol(resin)]<-IDcolName}
if (i!=files[1]) resdf<-rbind(resdf,resin) else resdf<-resin
}
# Third, remove the unneeded original copy of the last file and dataframe, and reset wd
remove(resin)
setwd(ordir)
return(resdf)}
## This line gets the data for me -- it is commented out so if necessary, it will prompt you for the relevant folder,
## which should contain DHSC datafiles from from after the 29th April 2020
deathspost<-get_data_from_folder("~/OneDrive - Nexus365/Coronavirus stuff/Data/deaths/posteq29",addIDcol=TRUE)
## coronavirus-deaths_202004292132.csvcoronavirus-deaths_202004301702.csvcoronavirus-deaths_202005011647.csvcoronavirus-deaths_202005021621.csvcoronavirus-deaths_202005031513.csvcoronavirus-deaths_202005041501.csvcoronavirus-deaths_202005051554.csvcoronavirus-deaths_202005061549.csvcoronavirus-deaths_202005071534.csvcoronavirus-deaths_202005081443.csvcoronavirus-deaths_202005091431.csvcoronavirus-deaths_202005101500.csvcoronavirus-deaths_202005111457.csvcoronavirus-deaths_202005121454.csvcoronavirus-deaths_202005131522.csvcoronavirus-deaths_202005141344.csvcoronavirus-deaths_202005151534.csvcoronavirus-deaths_202005161352.csvcoronavirus-deaths_202005171653.csvcoronavirus-deaths_202005181505.csvcoronavirus-deaths_202005191711.csvcoronavirus-deaths_202005201630.csvcoronavirus-deaths_202005211346.csvcoronavirus-deaths_202005221347.csvcoronavirus-deaths_202005231511.csvcoronavirus-deaths_202005241458.csvcoronavirus-deaths_202005251534.csvcoronavirus-deaths_202005261537.csvcoronavirus-deaths_202005271508.csvcoronavirus-deaths_202005281507.csvcoronavirus-deaths_202005291612.csvcoronavirus-deaths_202005301512.csvcoronavirus-deaths_202005311434.csvcoronavirus-deaths_202006011604.csvcoronavirus-deaths_202006021540.csvcoronavirus-deaths_202006031412.csvcoronavirus-deaths_202006041356.csvcoronavirus-deaths_202006051441.csvcoronavirus-deaths_202006061346.csvcoronavirus-deaths_202006071502.csvcoronavirus-deaths_202006081341.csvcoronavirus-deaths_202006091440.csvcoronavirus-deaths_202006101522.csvcoronavirus-deaths_202006111329.csvcoronavirus-deaths_202006121339.csvcoronavirus-deaths_202006131430.csv
## Adding day and pubday - this converts the dates within the files to days counting from 6th March 2020, and converts
## the filenames into pubdays i.e. publication days. Days are the day on which the deaths were reported, while pubday
## is the the datafile was released. The point of the distinction is that sometimes a figure is revised after first
## publication, and this is seen on later pubdays.
add.days<-function(daf){
repyear<-with(daf,as.numeric(substr(Reporting.date,1,4)))
repmonth<-with(daf,as.numeric(substr(Reporting.date,6,7)))
repday<-with(daf,as.numeric(substr(Reporting.date,9,10)))
day<-repday-5
day<-ifelse(repmonth>=4,day+31,day)
day<-ifelse(repmonth>=5,day+30,day)
day<-ifelse(repmonth>=6,day+31,day)
day<-ifelse(repmonth>=7,day+30,day)
day<-ifelse(repmonth>=8,day+31,day)
daf$day<-day
pyear<-with(daf,as.numeric(substr(filename,20,23)))
pmonth<-with(daf,as.numeric(substr(filename,24,25)))
pday<-with(daf,as.numeric(substr(filename,26,27)))
pubday<-pday-5
pubday<-ifelse(pmonth>=4,pubday+31,pubday)
pubday<-ifelse(pmonth>=5,pubday+30,pubday)
pubday<-ifelse(pmonth>=6,pubday+31,pubday)
pubday<-ifelse(pmonth>=7,pubday+30,pubday)
pubday<-ifelse(pmonth>=8,pubday+31,pubday)
daf$pubday<-pubday
return(daf)
}
deathspost<-add.days(deathspost)
myxt<-xtabs(~day+pubday+Area.name,data=deathspost)
#max(myxt)
#min(myxt)
#sum(myxt)
#dim(deathspost)
## This should be max=1, min=0.
## This looks for variation in the reports for one day, across different publication days. And then looks
## for the datapoints that show the difference. At the time of writing, there is only one large revision
## in all the datafiles, but of course it persists through all subsequent publication days. There are two
## small changes in Northern Ireland, where it looks like one death was initially misallocated by one day.
myag<-aggregate(deathspost$Daily.change.in.deaths,by=with(deathspost,list(day=day,Area.name=Area.name)),FUN=sd)
myag0<-subset(myag,x!=0)
#print(myag0)
# for (j in (1:nrow(myag0))) print(subset(deathspost,day==myag0$day[j] & Area.name==myag0$Area.name[j]))
#subset(deathspost,day==55 & Area.name=="United Kingdom")
lastf<-deathspost$filename[nrow(deathspost)]
dft<-subset(deathspost,filename==lastf)
dft$filename<-NULL
dft<-subset(dft,Area.type=="UK")
Deaths<-dft$Daily.change.in.deaths[nrow(dft):1]
####### Deaths<-c(Deaths,412,377) ## a cheat for when the archive was unavailable
## We now have all the data we are carrying forward in Deaths
## but we also indicate first and last dates of the data
datafirstdate<-"2020-03-06"
datalastdate<-paste0(substr(lastf,20,23),"-",substr(lastf,24,25),"-",substr(lastf,26,27))
####### datalastdate<-"2020-05-27" ## a cheat for when the archive was unavailable
ndays<-length(Deaths)+100
downames<-c("1Sunday","2Monday","3Tuesday","4Wednesday","5Thursday","6Friday","7Saturday")
dow<-factor(rep(downames[c(6:7,1:5)],length=ndays))
dowSun<-ifelse(dow==downames[1],1,ifelse(dow==downames[7],-1,0))
dowMon<-ifelse(dow==downames[2],1,ifelse(dow==downames[7],-1,0))
dowTue<-ifelse(dow==downames[3],1,ifelse(dow==downames[7],-1,0))
dowWed<-ifelse(dow==downames[4],1,ifelse(dow==downames[7],-1,0))
dowThu<-ifelse(dow==downames[5],1,ifelse(dow==downames[7],-1,0))
dowFri<-ifelse(dow==downames[6],1,ifelse(dow==downames[7],-1,0))
dowdummies<-"(dowSun+dowMon+dowTue+dowWed+dowThu+dowFri)"
downames<-c("1Sunday","2Monday","3Tuesday","4Wednesday","5Thursday","6Friday","7Saturday")
dowdummies<-"(dowSun+dowMon+dowTue+dowWed+dowThu+dowFri)"
weeknames<-c("6 Mar-","13 Mar-","20 Mar-","27 Mar-","3 Apr-","10 Apr-","17 Apr-","24 Apr-","1 May-","8 May-", "15 May-","22-May","29-May")
makedf<-function(ndays){
dow<-factor(rep(downames[c(6:7,1:5)],length=ndays))
dowSun<-ifelse(dow==downames[1],1,ifelse(dow==downames[7],-1,0))
dowMon<-ifelse(dow==downames[2],1,ifelse(dow==downames[7],-1,0))
dowTue<-ifelse(dow==downames[3],1,ifelse(dow==downames[7],-1,0))
dowWed<-ifelse(dow==downames[4],1,ifelse(dow==downames[7],-1,0))
dowThu<-ifelse(dow==downames[5],1,ifelse(dow==downames[7],-1,0))
dowFri<-ifelse(dow==downames[6],1,ifelse(dow==downames[7],-1,0))
contrasts(dow)<-contr.sum(7)
t<-c(1:ndays)
week<-rep(1:100,each=7,length=ndays)
fulldf<-data.frame(t=t,dow=dow,week=week,wts=1,dowSun=dowSun,dowMon=dowMon,
dowTue=dowTue,dowWed=dowWed,dowThu=dowThu,dowFri=dowFri)
return(fulldf)
}
df<-makedf(length(Deaths))
df$Deaths<-Deaths
rm(Deaths,dow,dowSun,dowMon,dowTue,dowWed,dowThu,dowFri)
## Now prepare for fitpat
nulpat<-list(t.steps=c(),t.kinks=c(),dow.steps=c(),dow.kinks=c())
mytstep<-function(t,where,dir) if (dir>=0) ifelse(t-where>=0,1,0) else ifelse(t-where<0,1,0)
mytkink<-function(t,where,dir){ ifelse(dir*(t-where)>=0,(t-where),0) }
masch<-function(i){if (i>=0) return(as.character(i)) else return(paste0("m",as.character(-i)))}
## This used to be in fitpat, but I need to run it for predictions with indf=fulldf
## I am suspending the backwards calculation for dow.kinks, in light of marginality and
## the base model not having dow:t in it. 2020_06_10
expatdf<-function(indf,pat,breakdir=1){
modeldev<-"~."
# Now t.steps
for (ix in seq_along(pat$t.steps)) {
indf[,ncol(indf)+1]<-mytstep(indf$t,pat$t.steps[ix],breakdir)
backpart<-if (breakdir>0) "" else "back."
varname<-paste0(backpart,"tstep",masch(pat$t.steps[ix]))
colnames(indf)[ncol(indf)]<-varname
modeldev<-paste0(modeldev,"+",varname)
}
# Now t.kinks
for (ix in seq_along(pat$t.kinks)) {
indf[,ncol(indf)+1]<-mytkink(indf$t,pat$t.kinks[ix],breakdir)
backpart<-if (breakdir>0) "" else "back."
varname<-paste0(backpart,"tkink",masch(pat$t.kinks[ix]))
colnames(indf)[ncol(indf)]<-varname
modeldev<-paste0(modeldev,"+",varname)
}
# Now dow.steps
for (ix in seq_along(pat$dow.steps)) {
indf[,ncol(indf)+1]<-mytstep(indf$t,pat$dow.steps[ix],breakdir)
backpart<-if (breakdir>0) "" else "back."
varname<-paste0(backpart,"dowstep",masch(pat$dow.steps[ix]))
colnames(indf)[ncol(indf)]<-varname
modeldev<-paste0(modeldev,"+",varname,":",dowdummies)
}
# Now dow.kinks
for (ix in seq_along(pat$dow.kinks)) {
indf[,ncol(indf)+1]<-mytkink(indf$t,pat$dow.kinks[ix],1) ## The last "1" was breakdir until 2020_06_10
backpart<-if (breakdir>0) "" else "back."
varname<-paste0(backpart,"dowkink",masch(pat$dow.kinks[ix]))
colnames(indf)[ncol(indf)]<-varname
modeldev<-paste0(modeldev,"+t:",varname,":",dowdummies)
}
return(list(df=indf,modeldev=modeldev))
}
## Fits a pattern, and does some other things if required...
fitpat<-function(model,model2,from=1,to=nrow(df),pat=nulpat,breakdir=1,outp){
# print(breakdir)
dftemp<-df[c(from:to),]
# if (!missing(control)) mycontrol<-control else mycontrol<-glm.control()
m<-glm(model,family=quasipoisson,weights=wts,data=dftemp)
initrd<-m$deviance
initrdf<-m$df.residual
## if (missing(breaklocations)) stop("fap4.p requires breaklocations to be specified") -- Delete soon!
epd<-expatdf(dftemp,pat,breakdir)
dftemp<-epd$df
modeldev<-epd$modeldev
if (!missing(model2)) modeldev<-paste0(modeldev,"+",model2)
if (length(unlist(pat))>0) m<-update(m,modeldev)
m$st.res<-(m$y - m$fitted.values)/sqrt(m$fitted.values)
if (!missing(outp)) if(outp) {
print(anova(m,test="F"))
print(summary(m))
plot(dftemp$t,m$st.res)
}
return(list(m=m,dftemp=dftemp))
}
### add.term takes a pattern, and adds each possible new time step, time kink, dow step and dow kink, one at a time,
### and presents the best of each. "Each possible" is provided as a parameter patpos, and excludes values too close
### to existing values (in patav). breakdir determines whether the time breaks are fitted forwards, so the coefficient
### of t is the initial slope of the epidemic (breakdir>0), or backwards, when that coefficient is the final slope
### (breakdir<0). The output $inpat gives the pattern that was provided as a parameter, which is handy for checking
### with lsearch, or for storing, when the decision is not to add any more days before checking.
add.term<-function(pat,patpos=patpos0,patav=patav0,breakdir=1) {
smear<-function(i,j) {c((i-j):(i+j))}
for (ip in ipvec) {
for (it in seq_along(pat[[ip]])) {
givenplus<-unique(as.vector(vapply(pat[[ip]],FUN=function(x) smear(x,patav[[ip]]),FUN.VALUE=rep(3,2*patav[[ip]]+1))))
patpos[[ip]]<-setdiff(patpos[[ip]],givenplus)
}
}
patout<-nulpat
for (ip in ipvec) {
patout[[ip]]$day<-array(NA,dim=length(patpos[[ip]]))
patout[[ip]]$pvalue<-array(NA,dim=length(patpos[[ip]]))
patout[[ip]]$dev.drop<-array(NA,dim=length(patpos[[ip]]))
patout[[ip]]$res.dev<-array(NA,dim=length(patpos[[ip]]))
patout[[ip]]$df.drop<-array(NA,dim=length(patpos[[ip]]))
patout[[ip]]$res.df<-array(NA,dim=length(patpos[[ip]]))
patout[[ip]]$mdd<-array(NA,dim=length(patpos[[ip]]))
}
#print("first glm about to happen")
fpbase<-fitpat(Deaths~t+dow, pat=pat, breakdir=breakdir)
#print("first glm just gone")
dftemp<-fpbase$dftemp
for (ip in ipvec) {
keepday<-0
for (bp in patpos[[ip]]) {
#print(paste(ip,bp,keepday))
keepday<-keepday+1
if (ip=="t.steps") {newvar<-mytstep(dftemp$t,bp,breakdir); newmod<-"~.+newvar"}
if (ip=="t.kinks") {newvar<-mytkink(dftemp$t,bp,breakdir);newmod<-"~.+newvar"}
if (ip=="dow.steps") {newvar<-mytstep(dftemp$t,bp,breakdir);newmod<-paste0("~.+newvar:",dowdummies)}
if (ip=="dow.kinks") {newvar<-mytkink(dftemp$t,bp,1);newmod<-paste0("~.+newvar:t:",dowdummies)}
## ## The last "1" was breakdir until 2020_06_10 -- because of marginality, dow.kinks is different
newm<-update(fpbase$m,newmod)
patout[[ip]]$day[keepday]<-bp
patout[[ip]]$res.dev[keepday]<-newm$deviance
patout[[ip]]$res.df[keepday]<-newm$df.residual
patout[[ip]]$dev.drop[keepday]<-fpbase$m$deviance - newm$deviance
patout[[ip]]$df.drop[keepday]<-fpbase$m$df.residual-newm$df.residual
num<-patout[[ip]]$dev.drop[keepday]/patout[[ip]]$df.drop[keepday]
den<-patout[[ip]]$res.dev[keepday]/patout[[ip]]$res.df[keepday]
patout[[ip]]$pvalue[keepday]<-1-pf(num/den,patout[[ip]]$df.drop[keepday],patout[[ip]]$res.df[keepday])
patout[[ip]]$mdd[keepday]<-patout[[ip]]$dev.drop[keepday]/patout[[ip]]$df.drop[keepday]
}
for (j in seq_along(patout[[ip]])) { names(patout[[ip]][[j]])<-patout[[ip]]$day }
}
best<-nulpat
for (ip in ipvec){ ## seq_along(patout) changed to ipvec as part of 2020_06_01 error fixing, but for clarity not substance
bpix<-which(patout[[ip]]$mdd==max(patout[[ip]]$mdd,na.rm=TRUE))
best[[ip]]<-list(day=patout[[ip]]$day[bpix],mdd=patout[[ip]]$mdd[bpix], pvalue=patout[[ip]]$pvalue[bpix], dev.drop=patout[[ip]]$dev.drop[bpix], df.drop=patout[[ip]]$df.drop[bpix], res.dev=patout[[ip]]$res.dev[bpix], res.df=patout[[ip]]$res.df[bpix])
if (min(patout[[ip]]$res.df)!=max(patout[[ip]]$res.df)) print(paste0("Beware -- residual df vary for ",names(patout)[[ip]]))
#print(unlist(best[ip]))
}
besta<-array(NA,dim=c(4,7))
rownames(besta)<-ipvec
colnames(besta)<-c("day","mdd","pvalue","dev.drop","df.drop","res.dev","res.df")
for (ip in ipvec) for (j in 1:7) besta[ip,j]<-best[[ip]][[j]] ## i changed to ip over ipvec for clarity not substance over 2020_06_01 errors
print(unlist(pat)) ## Note from 2020_06_01 error fixing: this will be in the original (and possibly irregular) order
print(besta) ## This will be in the regular ipvec order, but any change in order will not reflect an error in calculations
return(list(inpat=pat,best=besta,patout=patout))
}
putintopat<-function(v,pat){
pat1<-list()
i<-0
for (ip in ipvec) for (j in seq_along(pat[[ip]])) pat1[[ip]][[j]]<-v[i<-i+1]
return(pat1)
}
## Key vector that orders pat effectively inside functions, and provides a framework within which missing types
## can be handled.
ipvec<-c("t.steps","t.kinks","dow.steps","dow.kinks")
# lcheck varies each day in the pattern independently with -1,+0,+1, so there are 3^(number of break days)
# models to fit. It's return includes the pattern for the best of these models (as $bestone).
lcheck<-function(pat,patwid=patwid0,spa){
list.of.lists<-function(len,width){
if (len==0) return ("empty")
if (len==1) return (as.list(-width:width))
outlist<-lapply(list.of.lists(len-1,width),function(x) append(x,-width,0))
for (i in (-width+1):width) outlist<-append(outlist,lapply(list.of.lists(len-1,width),function(x) append(x,i,0)))
return(outlist)
}
empty<-function(x) identical(x,"empty")
mll<-nulpat
for (ip in ipvec) mll[[ip]]<-list.of.lists(length(pat[[ip]]),patwid[[ip]])
devdims<-c()
for (ip in ipvec) devdims<-c(devdims,c(rep(2*patwid[[ip]]+1,length(pat[[ip]]))))
## The ipvec in the preceding line fixes the error I found on 2020_06_01. And it is right because
## the code for devind below uses the ipvec order and not the seq_along(pat) order
if (missing(spa)) devaa<-array(NA,dim=devdims) else devaa<-spa
devdf<-devaa
trypat<-nulpat
for (j1 in seq_along(mll[["t.steps"]])) {
if (empty(mll[["t.steps"]])) j1<-NULL else {
trypat[["t.steps"]]<- if (is.null(pat[["t.steps"]])) nulpat[["t.steps"]] else pat[["t.steps"]]+mll[["t.steps"]][[j1]]}
for (j2 in seq_along(mll[["t.kinks"]])) {
if (empty(mll[["t.kinks"]])) j2<-NULL else {
trypat[["t.kinks"]]<- if (is.null(pat[["t.kinks"]])) nulpat[["t.kinks"]] else pat[["t.kinks"]]+mll[["t.kinks"]][[j2]]}
for (j3 in seq_along(mll[["dow.steps"]])) {
if (empty(mll[["dow.steps"]])) j3<-NULL else {
trypat[["dow.steps"]]<- if (is.null(pat[["dow.steps"]])) nulpat[["dow.steps"]] else pat[["dow.steps"]]+mll[["dow.steps"]][[j3]]}
for (j4 in seq_along(mll[["dow.kinks"]])) {
if (empty(mll[["dow.kinks"]])) j4<-NULL else {
trypat[["dow.kinks"]]<- if (is.null(pat[["dow.kinks"]])) nulpat[["dow.kinks"]] else pat[["dow.kinks"]]+mll[["dow.kinks"]][[j4]]}
# print(unlist(trypat))
fp<-fitpat(Deaths~t+dow,pat=trypat)
inds<-c("t.steps"=j1,"t.kinks"=j2,"dow.steps"=j3,"dow.kinks"=j4)
devind<-c()
for (ip in ipvec) devind<-c(devind,if (is.null(pat[[ip]])) NULL else patwid[[ip]]+1+mll[[ip]][[inds[ip]]])
# print(devind)
# print(inds)
devaa[matrix(devind,nrow=1)]<-fp$m$deviance
devdf[matrix(devind,nrow=1)]<-fp$m$df.residual
# print(sum(is.na(devaa)))
}}}}
index<-0
for (ip in ipvec) {
for (i in seq_along(pat[[ip]])) {
index<-index+1
dimnames(devaa)[[index]]<-dimnames(devdf)[[index]]<-paste0(ip,".",pat[[ip]][[i]]+c(-patwid[[ip]]:patwid[[ip]]))
}
}
bestpoints<-which(devaa==min(devaa),arr.ind=TRUE)
# print(bestpoints)
patwidadj<-patwid
for (ip in ipvec) patwidadj[[ip]]<-if (is.null(pat[[ip]])) NULL else rep(patwid[[ip]],length(pat[[ip]]))
bestpoints<-bestpoints+array(rep(unlist(pat)-unlist(patwidadj)-1,each=nrow(bestpoints)),dim=dim(bestpoints))
# print(bestpoints)
if (min(devdf)!=max(devdf)) print("Beware -- residual df vary")
if (nrow(bestpoints)==1) {
print("one best")
best<-as.vector(bestpoints)
best<-putintopat(best,pat) ## relies on me having ordered the space by the sequence of ipvec.
# print(best)
print(unlist(pat)) ## An irregularly ordered pat will show up here irregularly, but it will be fixed
print(unlist(best)) ## in order in best -- but the sums will all have been right. See 2020_06_01 error.
if (all(unlist(best)==unlist(pat))) print("locally best")
# print(best)
return(list(bestone=best,arr=devaa,arr.df=devdf))
} else {
print("multiple bests"); return(list(bests=bestpoints,arr=devaa,arr.df=devdf))
}
}
make.pred<-function(pat,ndaysextra=7*26) {
sumbyweek<-function(x) {
outvec<-array(NA,dim=c(length(x)%/%7))
for (i in 1:(length(x)%/%7)) {
outvec[i]<-sum(x[(1+7*(i-1)):(7*i)])
# print(outvec[i])
}
return(outvec)}
m.1a<-fitpat(Deaths~t+dow,pat=pat,breakdir=1)
m.1b<-fitpat(Deaths~t+dow,pat=pat,breakdir=-1)
cat(paste0("\n","The model fits ",nrow(df)," days of data from ",datafirstdate," to ",datalastdate," and has the following pattern:\n"))
print(unlist(pat))
cat(paste0("and leaves residual deviance ",
format(m.1a$m$deviance,nsmall=1,digits=1),
" on ",m.1a$m$df.residual," degrees of freedom.\n"))
weekly.ratio<-exp(7*m.1b$m$coef["t"])
slope.figs<-c(m.1a$m$coef["t"],m.1b$m$coef["t"],-m.1a$m$coef["t"]/m.1b$m$coef["t"])
names(slope.figs)<-c("initial slope","final slope","ratio")
dwsf<-sum(df$Deaths[length(df$Deaths)+(-6:0)])
dwsfpred<-sum(m.1b$m$fitted.values[length(m.1b$m$fitted.values)+(-6:0)])
cat(paste0("\n","Total number in last week = ",dwsf,", and fitted number = ",format(dwsfpred,nsmall=1,digits=1),"\n\n"))
if (weekly.ratio<1) {
## weekly ratio less than one
weekly.predictions<-c(weekly.ratio,weekly.ratio/(1-weekly.ratio),dwsf*weekly.ratio/(1-weekly.ratio),dwsfpred*weekly.ratio/(1-weekly.ratio))
names(weekly.predictions)<-c("weekly ratio","rem. mult.","n pred","fit pred")
print(noquote(format(slope.figs,digits=2,nsmall=3)))
cat("\n")
# cat(paste0("\n"))
# print(weekly.predictions)
# cat("\n")
cit95<-suppressMessages(confint(m.1b$m,"t"))
wr.lo<-exp(7*cit95[1])
wr.hi<-exp(7*cit95[2])
weekly.pred.pi.lo<-weekly.pred.pi.hi<-weekly.predictions
weekly.pred.pi.lo<-c(wr.lo,wr.lo/(1-wr.lo),dwsf*wr.lo/(1-wr.lo),dwsfpred*wr.lo/(1-wr.lo))
weekly.pred.pi.hi<-c(wr.hi,wr.hi/(1-wr.hi),dwsf*wr.hi/(1-wr.hi),dwsfpred*wr.hi/(1-wr.hi))
wpia<-rbind(lo.95=weekly.pred.pi.lo,pred=weekly.predictions,hi.95=weekly.pred.pi.hi)
colnames(wpia)<-names(weekly.predictions)
print(noquote(format(wpia,digits=4,nsmall=1)))
cat("\n")
fulldf<-makedf(nrow(df)+ndaysextra)
epd<-expatdf(fulldf,pat,breakdir=1)
mp<-suppressWarnings(predict(m.1a$m,epd$df[c((nrow(df)+1):nrow(epd$df)),],se.fit=TRUE,type="response"))
preds.week<-sumbyweek(mp$fit)
cat(paste0("The predicted future 26 weekly totals are:\n"))
cat(format(preds.week,nsmall=1,digits=1))
cat(",\n")
cat(paste0("which sum to ", format(sum(preds.week),nsmall=1,digits=1),"."))
return(invisible(list(wpia=wpia,daily=mp,weekly=preds.week))) } else {
## weekly ratio greater than or equal to one
cat(paste0("The weekly ratio on the basis of the final time trend is ",format(weekly.ratio,digits=2,nsmall=3),"\n\n"))
cit95<-suppressMessages(confint(m.1b$m,"t"))
wr.lo<-exp(7*cit95[1])
wr.hi<-exp(7*cit95[2])
cat(paste0("with a 95% confidence interval of (",format(wr.lo,digits=2,nsmall=3),", ",format(wr.hi,digits=2,nsmall=3),"). \n\n"))
cat(paste0("The weekly ratio estimate is greater than or equal to one, and therefore we cannot sensibly ",
"estimate the future number of deaths in this first wave. If the weekly ratio is genuinely above one, ",
"then on the current simple model, the number would be infinite. More realistically, the numbers would ",
"grow until a lockdown occurs or the number of people who have had the disease, assuming they are and ",
"remain immune, will dampen the rate of spread below 1 again. It is, of course, possible that the current ",
"estimate is wrong and that the true rate is less than one. We can hope."),fill=TRUE)
}
}
## This function (i) puts a single pattern into "regular order", namely t.steps, t.kinks, dow.steps, dow.kinks, and eliminates any
## duplication within each category. And it merges an arbitrary number of separate patterns into a single pattern. It is useful
## for finding a pattern to fit the data, building on the existing output and needing to include only the extra items in each
## command line. In principle, it would help with automating the search.
onepat<-function(pat1,pat2,...){
outpat<-list()
if (missing(pat2)) {
for (ip in ipvec) outpat[[ip]]<-sort(unique(c(pat1[[ip]])))
return(outpat) } else {
for (ip in ipvec) outpat[[ip]]<-sort(unique(c(pat1[[ip]],pat2[[ip]])))
return(onepat(outpat,...))
}
}
pat0<-list(t.steps=c(4),t.kinks=c(10),dow.steps=c(20),dow.kinks=c(30))
patpos0<-list(t.steps=c(3:(nrow(df)-2)),t.kinks=c(3:(nrow(df)-3)),dow.steps=c(15:(nrow(df)-14)),dow.kinks=c(0:(nrow(df)-14)))
patav0<-list(t.steps=1,t.kinks=1,dow.steps=6,dow.kinks=6)
patwid0<-list(t.steps=1,t.kinks=1,dow.steps=1,dow.kinks=1)
Here is an example of the use of the functions, to find a reasonable fit to the time course. It may or not be the use that led to today’s best time-pattern.
at1<-add.term(nulpat)#
## NULL
## day mdd pvalue dev.drop df.drop res.dev res.df
## t.steps 24 19437.1207 0.0000000 19437.1207 1 6754.565 91
## t.kinks 32 24314.8860 0.0000000 24314.8860 1 1876.799 91
## dow.steps 33 176.7888 0.7256402 1060.7330 6 25130.952 86
## dow.kinks 86 55.2902 0.9805846 331.7412 6 25859.944 86
at1<-add.term(list(t.kinks=c(32)))#
## t.kinks
## 32
## day mdd pvalue dev.drop df.drop res.dev res.df
## t.steps 44 143.33381 0.007662420 143.3338 1 1733.466 90
## t.kinks 22 147.85721 0.006726465 147.8572 1 1728.942 90
## dow.steps 47 63.81988 0.002956384 382.9193 6 1493.880 85
## dow.kinks 0 39.18583 0.070511980 235.1150 6 1641.684 85
at1<-add.term(onepat(list(t.kinks=c(22,32))))#
## t.kinks1 t.kinks2
## 22 32
## day mdd pvalue dev.drop df.drop res.dev res.df
## t.steps 38 203.65707 0.0008668241 203.6571 1 1525.285 89
## t.kinks 44 212.65409 0.0006538554 212.6541 1 1516.288 89
## dow.steps 47 65.84026 0.0010712115 395.0416 6 1333.901 84
## dow.kinks 0 39.03679 0.0514512147 234.2208 6 1494.721 84
at1<-add.term(onepat(list(t.kinks=c(22,32,44))))#
## t.kinks1 t.kinks2 t.kinks3
## 22 32 44
## day mdd pvalue dev.drop df.drop res.dev res.df
## t.steps 40 58.03186 0.064617611 58.03186 1 1458.256 88
## t.kinks 25 10.00682 0.446553061 10.00682 1 1506.281 88
## dow.steps 47 54.31850 0.002203684 325.91102 6 1190.377 83
## dow.kinks 0 46.00342 0.009029873 276.02054 6 1240.268 83
at1<-add.term(onepat(at1$inpat,list(dow.steps=c(47))))#
## t.kinks1 t.kinks2 t.kinks3 dow.steps
## 22 32 44 47
## day mdd pvalue dev.drop df.drop res.dev res.df
## t.steps 83 41.623060 0.08853151 41.623060 1 1148.7540 82
## t.kinks 46 9.044604 0.43044451 9.044604 1 1181.3325 82
## dow.steps 79 35.224625 0.01715029 211.347751 6 979.0293 77
## dow.kinks 80 13.391700 0.47916418 80.350202 6 1110.0269 77
nrow(df)
## [1] 100
at1<-add.term(onepat(at1$inpat,list(dow.steps=c(79))))#
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2
## 22 32 44 47 79
## day mdd pvalue dev.drop df.drop res.dev res.df
## t.steps 80 25.089595 1.614990e-01 25.089595 1 953.9397 76
## t.kinks 46 8.676953 4.123054e-01 8.676953 1 970.3524 76
## dow.steps 86 77.833958 1.678829e-08 467.003746 6 512.0256 71
## dow.kinks 72 61.480302 5.450476e-06 368.881812 6 610.1475 71
lc1<-lcheck(at1$inpat)#
## [1] "one best"
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2
## 22 32 44 47 79
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2
## 23 33 44 47 79
lc1<-lcheck(lc1$bestone)#
## [1] "one best"
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2
## 23 33 44 47 79
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2
## 24 33 44 47 79
lc1<-lcheck(lc1$bestone)#
## [1] "one best"
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2
## 24 33 44 47 79
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2
## 25 33 44 47 79
lc1<-lcheck(lc1$bestone)#
## [1] "one best"
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2
## 25 33 44 47 79
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2
## 25 33 44 47 79
## [1] "locally best"
pat.1<-lc1$bestone
at1<-add.term(lc1$bestone)
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2
## 25 33 44 47 79
## day mdd pvalue dev.drop df.drop res.dev res.df
## t.steps 9 32.84100 1.059515e-01 32.84100 1 932.4060 76
## t.kinks 10 19.40193 2.156483e-01 19.40193 1 945.8451 76
## dow.steps 86 77.70824 1.138986e-08 466.24943 6 498.9976 71
## dow.kinks 72 61.25975 4.420484e-06 367.55852 6 597.6885 71
at1<-add.term(onepat(at1$inpat,list(dow.steps=c(86))))
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2 dow.steps3
## 25 33 44 47 79 86
## day mdd pvalue dev.drop df.drop res.dev res.df
## t.steps 9 32.75598 0.029828023 32.75598 1 466.2416 70
## t.kinks 10 19.32677 0.097530465 19.32677 1 479.6708 70
## dow.steps 40 19.29622 0.007110432 115.77734 6 383.2202 65
## dow.kinks 65 10.17328 0.188940777 61.03970 6 437.9579 65
at1<-add.term(onepat(at1$inpat,list(dow.steps=c(40))))
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2 dow.steps3 dow.steps4
## 25 33 44 40 47 79 86
## day mdd pvalue dev.drop df.drop res.dev res.df
## t.steps 9 33.65418 0.01569005 33.65418 1 349.5661 64
## t.kinks 10 19.35107 0.06968195 19.35107 1 363.8692 64
## dow.steps 31 10.97796 0.07424781 65.86775 6 317.3525 59
## dow.kinks 65 10.05066 0.10746646 60.30396 6 322.9163 59
lc1<-lcheck(at1$inpat)
## [1] "Beware -- residual df vary"
## [1] "one best"
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2 dow.steps3 dow.steps4
## 25 33 44 40 47 79 86
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2 dow.steps3 dow.steps4
## 25 33 45 40 47 79 85
lc1<-lcheck(lc1$bestone)
## [1] "Beware -- residual df vary"
## [1] "one best"
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2 dow.steps3 dow.steps4
## 25 33 45 40 47 79 85
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2 dow.steps3 dow.steps4
## 24 33 46 40 47 79 85
lc1<-lcheck(lc1$bestone)
## [1] "Beware -- residual df vary"
## [1] "one best"
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2 dow.steps3 dow.steps4
## 24 33 46 40 47 79 85
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2 dow.steps3 dow.steps4
## 23 33 46 40 47 79 85
lc1<-lcheck(lc1$bestone)
## [1] "Beware -- residual df vary"
## [1] "one best"
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2 dow.steps3 dow.steps4
## 23 33 46 40 47 79 85
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2 dow.steps3 dow.steps4
## 23 33 46 40 47 79 85
## [1] "locally best"
at1<-add.term(onepat(lc1$bestone))
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2 dow.steps3 dow.steps4
## 23 33 46 40 47 79 85
## day mdd pvalue dev.drop df.drop res.dev res.df
## t.steps 32 23.335711 0.04171336 23.335711 1 345.8258 64
## t.kinks 9 9.780095 0.19162798 9.780095 1 359.3814 64
## dow.steps 31 12.313913 0.03432043 73.883481 6 295.2780 59
## dow.kinks 65 9.175103 0.13142552 55.050621 6 314.1109 59
pat.2<-at1$inpat
make.pred(pat.1)
##
## The model fits 100 days of data from 2020-03-06 to 2020-06-13 and has the following pattern:
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2
## 25 33 44 47 79
## and leaves residual deviance 965.2 on 77 degrees of freedom.
##
## Total number in last week = 1197, and fitted number = 1279.2
##
## initial slope final slope ratio
## 0.232 -0.030 7.710
##
## weekly ratio rem. mult. n pred fit pred
## lo.95 0.7944 3.8627 4623.6648 4941.0271
## pred 0.8102 4.2675 5108.2418 5458.8648
## hi.95 0.8261 4.7489 5684.4556 6074.6293
##
## The predicted future 26 weekly totals are:
## 1036.3 839.6 680.2 551.1 446.5 361.7 293.0 237.4 192.3 155.8 126.2 102.3 82.9 67.1 54.4 44.1 35.7 28.9 23.4 19.0 15.4 12.5 10.1 8.2 6.6 5.4,
## which sum to 5436.0.
These are the predictions with three t.kinks, which make the epidemic increase turn down and then flatten off a bit, with two changes of day-of-the-week patterns.
make.pred(pat.2)
##
## The model fits 100 days of data from 2020-03-06 to 2020-06-13 and has the following pattern:
## t.kinks1 t.kinks2 t.kinks3 dow.steps1 dow.steps2 dow.steps3 dow.steps4
## 23 33 46 40 47 79 85
## and leaves residual deviance 369.2 on 65 degrees of freedom.
##
## Total number in last week = 1197, and fitted number = 1244.8
##
## initial slope final slope ratio
## 0.254 -0.032 8.046
##
## weekly ratio rem. mult. n pred fit pred
## lo.95 0.7895 3.7496 4488.3270 4667.7270
## pred 0.8016 4.0395 4835.3075 5028.5763
## hi.95 0.8137 4.3674 5227.8285 5436.7865
##
## The predicted future 26 weekly totals are:
## 997.8 799.8 641.1 513.9 411.9 330.2 264.7 212.1 170.1 136.3 109.3 87.6 70.2 56.3 45.1 36.2 29.0 23.2 18.6 14.9 12.0 9.6 7.7 6.2 4.9 4.0,
## which sum to 5012.6.
The second set allows two further changes in day-of-the-week pattern.