DOI: https://doi.org/10.21203/rs.3.rs49697/v1
Covid19 has now taken a frightening form. As the day passes, it is becoming more and more widespread and now it has become an epidemic. The death rate, which was earlier in the hundreds, changed to thousands and then progressed to millions respectively. If the same situation persists over time, the day is not far when the humanity of all the countries on the globe will be endangered and we yearn for breath. From January 2020 till now, many scientists, researchers and doctors are also trying to solve this complex problem so that proper arrangements can be made by the governments in the hospitals and the death rate can be reduced. The presented research article shows the estimated mortality rate by the ARIMA model and the Regression model. This dataset has been collected precisely from DataHubNovel Coronavirus 2019  Dataset from 22^{nd} January to 29^{th} June 2020. In order to show the current mortality rate of the entire subject, the correlation coefficients of attributes (MAE, MSE, RMSE and MAPE) were used, where the average absolute percentage error validated the model by 99.09%. The ARIMA model is used to generate auto_arima SARIMAX results, auto_arima residual plots, ARIMA model results, and corresponding prediction plots on the training data set. These data indicate a continuous decline in death cases. By applying a regression model, the coefficients generated by the regression model are estimated, and the actual death cases and expected death cases are compared and analyzed. It is found that the predicted mortality rate has decreased after May 2, 2020. It will learn help the government and doctors prepare for the next plans. Based on short period predictions these methods can use forecast for long period.
As indicated by the World Health Organization, the CoViD19 virus is a communicable disease that spreads from one person to another. Personal contact and small droplets in the breath of infected person can cause the virus to spread to others and cause severe acute respiratory syndrome [1]. In 2020 of January, WHO initially informed the humanity regarding pneumonia for obscure reasons, and the world came to know that this disease has spreading from person to person [2]. And the incident of this mysterious disease started from a city called Wuhan of China [3]. The status report of WHO say that till date from January 15 2020 to July 1, 2020; 507,435 people have lost their lives worldwide and this number is continuously increasing [4]. Group of RNA viruses are called corona virus [5]. In humans, it causes respiratory tract infections. It has SARS, MERS and COVID19 deadly varieties. There are no immunizations or antiviral medications to protect from corona contamination in people yet [29]. Common symptoms include high fever, tiredness, cough, shortness of breath, loss of taste and smell, and complications include pneumonia and acute breathing infection [6]. The confirmation of the first death case from the Asia to Europe was: initial confirmed casualty in Wuhan, China on 9^{th} of January, 2020, initial confirmed decease in Philippines outside the China on February 1, 2020, and on February 14, 2020 the first confirmed death in European country France [7]. The ratio of death rate is 5.4% till 16 June 2020 against 437,283 deaths for 8,051,732 cases. This number may vary from time to time and region to region [8].
The motto of our research is to predict the future of death cases based on machine learning regression model as well as time series ARIMA model. Both models are used to predict the future values. No eyecatching and extensively tested antibody against CoViD19 has been designed more importantly; the key part of the subsequent response to this pandemic is to reduce the top of the plague, or to smooth the pandemic curve. The work of information researchers and information mining analysts is to coordinate relevant information and it is more likely to innovate to understand the infection and its quality, which helps to make the right choices and solid activities. This will prompt people to take stronger measures to establish frameworks, medicine, antibodies and control of comparable plagues with greater prospects. The aim of the present investigation are as follows.
Except for the first part of the introduction, the structure of the remaining research papers is as follows:
Section II: Background, Section III: Methodology, Section IV: Dataset description and analysis, Section V: Results and discussion, Section VI: Conclusion.
Although it has been about 6 months since the Covid19 pandemic has spread, many researchers have done a lot of work on it and it is being worked on continuously. The following is a description of some of the researches presented.
Benvenuto D., Giovanetti M., Vassallo L., Angeletti S., Ciccozzi M. [9] Carried out an ARIMA model forecast on COVID2019 information gathered from Johns Hopkins epidemiological of the predominance and rate. For additional correlation or from a future point of view, case definitions and information assortment must be kept up progressively.
Zeynep Ceylan [10] Comprehensive information related to CoViD19 was collected from WHO website from Feb. 21 to April 15, 2020. Some ARIMA models with different ARIMA boundaries were selected. Which includes ARIMA (0,2,1) for the lowest MAPE (4.7520) for Italy, similarly for Spain and France were selected separately with ARIMA (1,2,0) and ARIMA (0,2,1) and lowest MAPE (5.58486) and (5.6335) respectively. This test shows that ARIMA modal is appropriate to understand the effect of CoViD19. The aftereffects of the examination can reveal insight into understanding the patterns of the episode and give a thought of the epidemiological phase of these locales.
MHDM R., Silva R.G., Mariani V.C., Coelho L.S. [11] For the purpose of time series analysis, different models are used like ARIMA, CUBIST, RF, RIDGE, SVR and stackingensemble method are assessed. The created models can produce exact forecasting, accomplishing mistakes in a scope of 0.87%–3.51%, 1.02%–5.63%, and 0.95%–6.90% in one, three, and sixdaysahead, separately. The positioning of models, from the best to the most noticeably terrible with respect to precision, in all situations is SVR, stackinggathering learning, ARIMA, CUBIST, RIDGE, and RF models.
Pandey, G.; Chaudhary, P.; Gupta, R.; Pal, S. [12] In this inspection, until March 30, 2020, this suffering scene in India has been meticulous, and the number of cases in the next 14 days was evaluated. Taking into account the data accumulated from the Johns Hopkins University depository in the period from January 30, 2020 to March 30, 2020, the SEIR model and the regression model were used. RMSLE evaluated the introduction of the model, and the data of the SEIR model were 1.52 and 1.75, respectively. For the rear slip model. The RMSLE tightening rate between the SEIR model and the regression model is 2.01. In addition, the estimation of R 0 as the diffusion of pollution was analyzed to 2.02. It is foreseeable that in the next 14 days, the number of cases may rise among 50006000.
Chakraborty T., Ghosh I. [13] Collect the data as of April 4, 2020, it has caused a pandemic flareup with in excess of 11,16,643 affirmed diseases and in excess of 59,170 revealed passings around the world. The fundamental focal point of this paper is twooverlap: (a) producing present moment (constant) estimates of things to come COVID19 cases for various nations; (b) chance evaluation of the novel COVID19 for some significantly influenced nations. To take care of the primary issue, they introduced a half breed approach dependent on ARIMA model and Waveletbased forecasting model that can create present moment (ten days ahead) conjectures of the quantity of day by day affirmed cases for Canada, France, India, South Korea, and the UK. They applied an ideal relapse tree calculation to discover basic causal factors that altogether influence the case casualty rates for various nations.
Chintalapudi N, Battineni G, Amenta F. [14] From midFebruary to the end of March, CoViD19, which experienced tolerance data, deleted cases registered and restored onsite by the Italian Ministry of Health. Appointment of the accidental ARIMA vision group using R real model was completed. The accuracy of the enrollment case model reached 93.75%, and the accuracy of the recovery case model reached 84.4%. At the end of May, a decision of a serious patient may occur, estimated at 182,757, and the recovered case may include an estimated 81,635. Their findings indicate that it is possible to reduce enrollment cases by approximately 35% and improve recovery cases by approximately 66%.
Vardavas CI, Nikitara K. [15] From March 18, 2020, a total of 194909 COVID19 representatives participated, including 7876 passes, a large part of which were in China (3242) and Italy (2505). In their multivariate key backslip test, chronic diseases of smoking are risk factors for disease development (OR = 14.28; 95% CI: 1.5825.00; p = 0.018). In the scattered data, they found that smokers are increasingly 1.4 events. Possible (RR = 1.4, 95% CI: 0.982.00) with abnormal symptoms of COVID19, usually 2.4 events will inevitably be sent to the ICU, requiring mechanical ventilation or passage, which is different from nonsmokers ( RR = 2.4, 95% CI: 1.434.04).
Yan CH, Faraji M, Prajapati DP, Boone CE. [16] The calculated relapse rate is used to sense signs associated with CoViD19 positive. Between March 3, 2020 and March 29, 2020, a total of 1,480 patients with influenzalike reactions underwent the CoViD19 test. Our assessment yielded 59 out of 102 (58%) CoViD19 positive patients and 203 out of 1378 COVID19 negative patients (15%). CoViD19 positive subjects accounted for 68% (40/59) and 71% (42/59) of odor and taste events, respectively, and 16% (33/203) and 17% (35/203) of subjects different. CoViD19 negative patients (p <0.001). In addition, odor incontinence and COVID19 motives persisted without restraint (insomnia: adjusted possibility range [aOR] 10.9; 95% CI, 5.0823.5; age imbalance: aOR 10.2; 95% CI, 4.7422.1 ), but the sore throat is related to the COVID19 enemy (aOR 0.23; 95% CI, 0.110.50). Of the patients who reported loss of olfaction associated with COVID19, 74% (28/38) of the patients found insomnia with clinical goals.
In this section, we collected data from DataHubNovel Coronavirus 2019Dataset. The data set includes information on patients with COVID19 dated from January 22, 2020 to June 20, 2020. The data set has the attributes of globally confirmed cases, rehabilitation cases, death cases and COVID19 prevalence. There are basically two methods for analyzing the outbreak of a pandemic. Both ARIMA and regression models are used to predict future value. In this sense, we have basically analyzed the correlation between mortality and all precious attributes.
ARIMA Model
Since the administrator needs to carefully consider the time of sick leave, this exploratory paper proposes an inspection of the autoregressive merged moving normal model. The ARIMA model is additionally utilized as a proficient device to design assets, for example, pandemic and groups for the crisis department [17, 18]. Another relevance of the ARIMA model is to foresee and contemplate the impact of COVID19 [1921].
Time Series Forecasting based specific sort of forecasting strategy called ARIMA modeling. ARIMA or “Auto Regressive Integrated Moving Average” is really a class of models that clarifies a given time arrangement dependent on its own past qualities, that is, its own slacks and the slacked forecast errors, with the goal that condition can be utilized to figure future values. Mathematically non seasonal ARIMA model define as:
An ARIMA model is portrayed by 3 terms p, d, q
Where,
p – Order for the Auto Regressive expression
q – Order for the Moving Average expression
d – Number of differencing required making the time arrangement fixed.
The estimation of d is the base number of differencing expected to make the differencing fixed. What's more, on the off chance that the time differencing is now fixed, at that point d = 0.
“p” is the request for the AR term. It alludes to the quantity of slacks of Y to be utilized as indicators. Furthermore, “q” is the request for the MA term. It alludes to the quantity of slacked forecast errors that ought to go into the ARIMA Model.
An unadulterated AR model is one where Y_{t} relies just upon its own slacks. That is, Y_{t} is an element of the 'slacks of Y_{t}'.
Where, Y_{t1} is the lag1 of the arrangement, β_{1} is the coefficient of lag1 that the model evaluations and α is the block term, additionally assessed by the model.
Moreover an unadulterated MA model is one where Y_{t} relies just upon the slacked forecast errors.
Where, the error terms are the errors of the autoregressive model of the particular slacks. The mistakes ϵ_{t} and ϵ_{t1} are the errors from the accompanying conditions:
An ARIMA model is one where the time series was differenced in any event once to make it fixed and we consolidate the AR and the MA expressions. So the condition becomes
There are some huge differences in the arrangement of the explanation model, which is the premise of at least 50 kinds of recognition. In order to reduce the burden on officials, almost no large amounts of data are needed before the critical month of mediation. After that, the model can place inevitable models, which may interfere with the entire range of boundary activities [22].
Regression Model
Linear regression is a prescient measurable methodology for displaying connection between a dependent variable with a given arrangement of autonomous factors. It is a direct way to deal with displaying the connection between a dependent variable and at least one independent variable. At the point when we have just a single independent variable it is as called straightforward linear regression. For more than one independent factor, the procedure is called multiple linear regressions. This investigation has utilized linear regression and multiple regressions for expectation of CoViD19 cases [23].
The linear regression description includes a linear condition that adds a specific information literacy particular arrangement x, whose response is the predictable return y of the data particular arrangement (y). The linear condition gives each information value or part a scale factor, called the coefficient, which is represented by the Greek word Beta "β". Including an additional coefficient in the same way provides additional degrees of freedom for the line and is repeatedly called the intercept or offset coefficient.
In a straightforward regression issue, the type of the model would be:
y = β_{0}+ β_{1}x
Where,
β_{0 –} intercept
β_{1 –} coefficient
x – independent variable
y – dependent variable
In higher estimates, when we have multiple information x, the line is called a plane or hyperplane. Described in this way are the kinds of conditions and specific characteristics for the coefficients (β_{0} and β_{1}).
The General condition for a multiple linear regression with n, independent factors is:
y = β_{0}+ β_{1}x_{1 }+ β_{1}x_{2 + ….. + }β_{n}x_{n + }ϵ
Where,
β_{0, }β_{1, }β_{2… }β_{n – }coefficients
x_{1, }x_{2, }_{…. }x_{n}_{ – }xvariable
y – yvariable
ϵ – random error “noise”
Dataset description and analysis
The Covid19 data set is taken from the DataHubNovel Coronavirus data set from January 22, 2020 to June 29, 2020. It contains five independent attributes, such as date, confirmed cases, rehabilitation cases, death and growth rate, and 160 instances. As we have seen in the data set, the death toll has increased over time until June 29. This is confirmed by the following figure 2.
The earliest Covid19 patients were recorded in the data set on January 22, 2020. We have taken examples from January 22, 2020 to June 29, 2020. It consists of 160 instances and five attributes. These attributes have information about the date of recording, confirmed cases, recovered cases, deaths, and growth rates related to CoViD19 patients. The following estimates are made from the data set to explore and extract useful information.
Correlation coefficients
The statistical measure correlation coefficient is the strength of the relationship between the relative motions of two variables. The range is defined as 1 to +1. Incorrect correlation measurement occurs when values greater than +1 and less than 1. The correlation measurement at 1 is completely negative, the correlation measurement at +1 is positive, and the value at 0.0 is the nonlinear relationship between the two variables [24].
Related statistics can be used to define the relationship between different attributes of the disease. A correlation coefficient can be calculated to determine the correlation level between the confirmed cases and the recovered cases under the current pandemic situation and the rate of increase in deaths and mortality, as shown in Table 1 and Figure 3. We found that in Covid19 confirmed case and recovered case the correlation between these two variables is highly positive.
Table 1: Correlation Coefficients of attributes

Confirmed 
Recovered 
Deaths 
Increase rate 
Confirmed 
1.000000 
0.986051 
0.988177 
0.378478 
Recovered 
0.986051 
1.000000 
0.950569 
0.337027 
Deaths 
0.988177 
0.950569 
1.000000 
0.401742 
Increase rate 
0.378478 
0.337027 
0.401742 
1.000000 
ARIMA Model Results
In the ARIMA model, we choose the parameters p, d, q [28]. For this reason, even without drawing graphics, we use auro_arima to find the appropriate parameters. The auro_arima work works by directing differencing tests like Kwiatkowski–Phillips–Schmidt–Shin, Augmented DickeyFuller or Phillips– Perron to decide the request for differencing, d, and afterward fitting models inside scopes of characterized start_p, max_p, start_q, max_q ranges [25]. In the event that the occasional discretionary is empowered, auto_arima likewise tries to distinguish the ideal P and Q hyperboundaries in the wake of directing the CanovaHansen to decide the ideal request of occasional differencing, D. The following figure 4 shows the parameters obtained by the auro_arima model.
When viewing the residual plot from the auto_arima model, as shown in Figure 5.
The output of the auto_arema model is explained as follows:
Standardized residual: The error of the residual is near the mean of the zero line and has a uniform variance.
Histogram and density plot: In the figure below, the density plot shows the equal distribution around the zero line average.
QQplot: In the QQ chart, all blue dots (ordered distribution of residuals) are on the red line, and any deviations will be skewed by the line. It is usually distributed along N (0, 1) and is considered to be uniformly distributed.
Correlogram: Correlogram or ACF plots show that the residual error isn't autocorrelated. Any autocorrelation implies that Residual error.
The optimal values of p, d, and q obtained by the auto_arima model are 1, 2, and 2, respectively. Now, using the best parameters obtained (1, 2, 2) to create an ARIMA model, the results are shown in figure 6.
Figure 6 above shows the importance of the ARIMA model. In this figure, we will focus on the coefficient table. The coef section shows the weight of each element and how each element affects the time series. P>  z  this section provides advice on the importance of the weight of each element. Here, the pvalue of each weight is less than or close to 0.05, so it is wise to include each weight in our model.
These views make us think that our model can create a good fit, which can help us understand time series information and calculate future value. Although we have a reasonable fit, we can occasionally change some limitations of the ARIMA model to improve the model's aggressiveness. We have obtained a model for the time series and can now use it to create estimates [26]. We first compare the predicted value with the actual estimated value of the time series, which will help us understand the accuracy of the prediction. The numbers and associated confidence intervals we have now created can now be used to additionally understand time series and predict what to store. Our data shows that relying on time series can maintain a consistent growth rate.
As our predictions for the future say, it is normal to be less optimistic about our values. This is reflected by the deterministic interval generated by our model, as we further develop, the deterministic interval will become larger and larger. We start predicting death cases in a test data set that maintains 95% confidence. Figure 7 below shows the prediction results.
In the figure below, the actual death of the training data set is shown by the blue line, and the predicted death is shown by the red line. The prediction of death on the red line has dropped, which means that in the future, the incidence of deaths will become shorter and shorter, as more and more people recovered quickly, and people maintained the social distance in this pandemic situation.
By using statistical data, we created summary metrics that classify and collect residuals into single value, which are related to the model's a predictive ability.
In order to judge the prediction results, let us apply commonly used accuracy indicators, the results are shown in table 2.
Table 2: Correlation Coefficients of attributes
Measures of Accuracy 
Value 
Mean Absolute Error (MAE)

0.12044588473307338

Mean Squared Error (MSE)

0.023012953284359018

Root Mean Squared Error (RMSE)

0.15170020858376898

Mean Absolute Percentage Error (MAPE)

0.009196691386663233

The MAE of our model is 0.1204, which is quite small suppose our data death case starts at 0.01.
For MSE, the value 0.0230 is less than MAE. We found this to be the case: MSE is an order of magnitude smaller than MAE.
The value 0.1517, of RMSE is similar to standard deviation and is a measure of how much the residual distribution is.
Around 0.91% MAPE implies the model is about 99.09% accurate in predicting the test set observations.
Regression Model Results
In order to find out which factor has the most significant influence on the forecasted output and how the various factors identify each other, we will consider different input functions such as "confirmation case", "recovered case" and "increase rate". Based on these characteristics, we will predict the deaths of Covid19 patients. The data set splited into 80%:20% as training and testing respectively.
In multiple linear regression, then regression the model has selected the best coefficients for all attributes [27]. The coefficients of the regression model are shown in Table 3 below.
Table 3: coefficients of regression model
Attributes 
Coefficient 
Confirmed 
0.103305 
Recovered 
0.100568 
Increase rate 
69.616876 
From the table 3, it is clear that if increase in “recovered case” by 1 unit, there is decrease of “death case” by 0.1005 units vice versa. Similarly, increase in “confirmed case” and “increase rate” by 1 unit, there is increase in “death case” by 0.1033 units and 69.6168 units respectively.
Now we predict the test data to check the difference between the actual value and the predicted value in Table 4 below.
Table 4: Difference between the actual value and predicted value
Instance Number 
Actual Value

Predicted Value 
110

286697

221975.301362 
112

297539

286646.565236

143

430047

423127.482077 
7

133

6528.684075

44

3459

2713.950271 
101

244129

236968.993751

122

342565

329894.990367 
66

31990

47224.597929

85

148157

160515.287829 
86

157022

167041.159151

133

386298

376198.729391

92

193926

198189.689192

26

1868

1385.556916

146

443685

438945.896459

119

328483

318945.015040

62

19026

25233.066196

51

5411

808.770349

97

221109

221511.564448

128

365380

355638.073651

90

180475

187102.115303

When plotting and comparing the actual value and the predicted value, as shown in Figure 8.
As shown in the multiple regression model shown in Table 4 and Figure 8, the initial predicted number of deaths has increased compared with actual deaths, but as we progress in the data table, compared with actual deaths, the predicted deaths the number has decreased from the month of May 2^{nd} 2020.
Overall, this study shows that the reduction in deaths worldwide is a good sign for human society.
In this study, two AI models ARIMA and regression models were used to decompose and predict changes in the spread of CoViD19 infection. We have investigated this information and predicted that the number of deaths will be reduced compared to the overall situation. The decline shown in the ARIMA model graph (Figure 7) indicates that the future mortality rate will decrease (based on the current situation). The training data set verified by the mean absolute percentage error (MAPE = 99.09%) indicates the accuracy of the model. The regression model also indicated an increase in the initial number of deaths, but over time, it predicted fewer deaths than actual deaths (Table 4 and Figure 8) from 2^{nd} May 2020.
Based on the above results and discussion, through ARIMA and regression models, we can conclude that there is a possibility of reducing deaths worldwide and should be reduced. Over time, there must be new opportunities to deal with this pandemic. Many researchers, scientists, doctors, nurses, medical support staff, and government agencies are all playing their roles. However, we ourselves have a responsibility to follow the guidelines provided by these agencies. If we do not maintain social estrangement, gather in public places, and do not keep the neighborhood clean, how can we overcome the CoViD19 pandemic?
Competing interests:
The authors declare no competing interests.