Instructions
1. An individual written paper on your paper or video/webinar: (this should be about 2 pages
of written text, it can be more depending on your paper and whether or not you do a
demonstration of the technique)
Section #1: will briefly summarize and describe the paper?s objective.
Section #2: will briefly describe the SAS procedures/techniques (or SAS code) used, provide a
description of any examples/applications and perhaps illustrate your own demo of the
procedures. If you feel the method is beyond your current skill level you can just describe what
they did. Some use of the SAS documentation may be required if the syntax is not well
explained in the paper.
Section #3: will discuss the questions you were asked and your answers to those questions.
Section #4: will briefly describe why the paper is of interest to you and provide commentary on
the paper. The commentary should include some ways the paper could be improved, and some
sentences on whether you think this paper and/or technique will be useful to you or a general
SAS user.
The grade will be done based on:
-Was the paper well written and grammatically correct?
-Were all sections present?
-Were written explanations clear?
-Was a demonstration done if appropriate?
Section 3:
? Can you brief me about the industry profiling analysis in this paper?
Page number 18 in the paper
? What approach did you follow to remove negative values while forecasting the number of cases
per hour?
Page number 11 in the paper
UNCLASSIFIED
1
Paper 1047-2021
SAS? Time Series Analysis & Forecasting
(TSAF) at the Canada Revenue Agency
(CRA), with COVID impacts
Jason A. Oliver, Senior Compliance Analyst, Canada Revenue Agency (CRA)
ABSTRACT
It may well be a recurring theme of this year’s SAS Global Forum that we are faced with
more pressure to use flexible thinking – not just critical thinking – and when it comes to
time series analysis and forecasting (TSAF) in SAS, it’s all about “rethinking the curve”.
At the Canada Revenue Agency (CRA) Compliance Programs Branch (CPB), we have
grappled with reliable forecasting for macro-level tax variables on a month-to-month
basis, even before the COVID-19 pandemic hit. But now we face a particularly difficult
challenge. As with many large organizations, it is not easy to foretell what the fallout
may be from such a cataclysm.
In setting up SAS to right the trajectory, we must be extra cautious about some of the
fallacies in applying TSAF in this context: the lagged effect for tax revenues realized
based on audits of the previous tax year, the need to differentiate average tax recovery
per case from sum of tax recovery (month-to-month), realizing that industry sectors
are not “one size fits all”, and accounting for relatively temporary effects of staffing re-
orientation in the conversion to a virtual workplace versus the more enduring effects of
business disruptions. With SAS Enterprise Miner’s abilities to continuously adjust
forecasts, sub-categorize datapoints by tax office or industry sector, and apply lagged
cross-correlation analysis, we are suitably equipped with the right tools and this can
provide abstract learnings for other large organizations.
INTRODUCTION
The Canada Revenue Agency (CRA) is Canada?s federal tax administration. As with all tax
jurisdictions, the CRA has been challenged to keep pace with COVID-19 shocks and
manifestations, which began in March 2020 (the last month of our fiscal year).
Fortunately, SAS? Enterprise Miner? has been an invaluable aid in gauging these impacts.
Enterprise Miner? includes a highly versatile set of functional nodes for configuring and
processing time series data. It can decompose time series components such as seasonality
and trend, show trend lines and expected forecast within configurable prediction intervals,
and demonstrate complex correlation analyses.
While this has been of great benefit to the CRA in gauging the trajectory of macro-variables
related to tax revenues and auditor performance, the findings of this research paper could
UNCLASSIFIED
2
conceivably be applied in the abstract to large organizations with process-oriented
functions, and not just to other foreign tax jurisdictions.
Let us provide a Glossary of terms to set the stage:
? TSAF: Time Series Analysis & Forecasting.
? TEBA: tax earned by audit, which is the amount of tax collectible that is agreed
upon in the course of a taxpayer audit. It is in NPV (Net Present Value).
? TAR: the tax-at-risk, which is the amount that CRA risk assessors arrive at as
the precursor to auditing activity.
? C/AR ratio: the ratio of [audit] cases completed, to action requests [submitted]
for assistance. It is a tentative measure of auditor productivity.
? Integras: the tool used by CRA auditors to process cases.
TIME SERIES FUNCTIONAL NODES & SETUP
In SAS? Enterprise Miner?, you have six TSAF nodes in the ?Time Series? ribbon; but we?re
only going to use four of them. Below is the Time Series ribbon with the functional nodes in
question:
Figure 1. Time Series Functional Nodes
? TS Data Preparation: this node allows you to specify basic time series properties
including interval, cycle, start/end time, and accumulation (i.e. by total, min or max,
mean, etc.)
o Below, the interval is ?automatic?, so we specify ?Month? as the interval.
o We can leave the seasonal cycle and start/end time as ?Default?, as SAS?
Enterprise Miner? will auto-determine these parts from the data.
o In our case, the data was pre-accumulated in SAS? Enterprise Guide? row-
by-row on a per-month basis, so we can leave Accumulation = ?Total? (else,
we would have to set it ?Average?).
Figure 2. TS Data Preparation node ? basic properties
UNCLASSIFIED
3
? TS Decomposition: this node allows you to specify similar basic settings to that of
the TS Data Prep node, but the Number of Periods can be configured, and moreover,
you can configure which Export Components you want to display.
o By default, it will only display ?Trend-Cycle? component (=Yes), which is
generally regarded as the most salient one.
o However, in our case, we want to view ALL Components, so we would set that
value to ?Yes?.
Figure 3. TS Decomposition node ?properties
TS Correlation: this node allows you to set up your TSA for autocorrelation analysis, or
alternatively for CCA (Cross-correlation analysis). When you select one of those methods,
the other one?s properties will be greyed out.
Figure 4. TS Correlation node ?properties
Both the TS Correlation and TS Decomposition nodes must be preceded by a TS Data
Preparation node (which occurs right after the source data node).
UNCLASSIFIED
4
TS Exponential Smoothing: this node allows you to conduct forecasting based on your
known data; as such, you would connect it to a TS Data Preparation node, not directly to
your source data node.
? The interval is automatic (which will be month in the case of our pre-accumulated
data), and the accumulation defaults to ?Total? (which is OK in our case, for the
same reason).
? SAS will pick what it deems to be the best forecasting method.
? The default selection criterion is MSE, or Mean Squared Error.
? We will see more on the Forecast lead, back, and significance level parameters
during the forecast demonstration in this paper.
Figure 5. TS Exponential Smoothing node ?properties
For our initial workspace setup, we can scrutinize on the C/AR (Case to Action Request)
ratio, which as per our glossary is a tentative measure of tax auditor performance. The
initial diagram workspace is called ?Aggreg_Integras_27mths?, which runs from January
2018 to March 2020. This is arranged this way for a reason: because it ends on the month
of the COVID shutdown.
Our dataset name is ?TSA_AGGREG_SINGLE_LINE_27MTHS?.
So, when I bring this in, I need to set all variables to Role = ?Rejected? except a) C/AR ratio
and b) my MONTH (Time ID) variable.
Figure 6. Variable Role selection from data source
UNCLASSIFIED
5
You would set your variables once you bring the data source to your diagram (workspace).
Figure 7. TS Data Source to Diagram flow
NOTE: I do not cover the mechanics behind bringing in a data source, as the principal focus
is on conducting TSAF in SAS? Enterprise Miner?. All we need to be concerned with is that
as Data Sources become available in the top-left menu, we can drag-and-drop them to our
diagram workspace (which are also created by right-clicking ?Diagrams? in the left panel).
In examining the TS Data Preparation node, it is fairly simple: we see the known trajectory
of the C/AR variable, simply by right-clicking the node ? Run ? Results.
Figure 8. Time Series Plot, for C/AR ratio variable
We can see that the C/AR ratio has fallen off as of mid-2018, and continued on a very
gradual downward path. Which means that case auditors are completing disproportionately
less cases to the action requests they submit for help, albeit with a seasonal factor and
some rebounding of the trend-line in March 2020.
So, we can scrutinize on the more specific components of the time series line by using a TS
Decomposition node.
UNCLASSIFIED
6
DECOMPOSITION OF TIME SERIES
In running our TS Decomposition node, and viewing the results, the first one to examine
is the Seasonal Component Plot. When it comes to the C/AR ratio, the seasonal index range
is between a high of about 1.3 down to about 0.75.
Figure 9. Seasonal Component Plot, for C/AR ratio variable
During the months of March and December, we see fairly high seasonality. This is normal
for the time, since the push to complete cases is higher at the end of the CRA fiscal year
(March), and ostensibly at the end of the calendar year, also. Auditors are completing
proportionally more cases vs. the number of action requests they submit to the service
desk. So it is likely that they are fulfilling cases that do not require as many interventions
during those months. Even in March 2020, C/AR still remained high ? it was
resilient to the initial COVID effects, due to being a ratio variable and not an absolute
sum variable.
In the decomposed results, we can also examine combinatory components; for instance, the
Trend-Cycle Component Plot:
Figure 10. Trend-Cycle Component Plot, for C/AR ratio variable
UNCLASSIFIED
7
This tells us what we had surmised from the initial data preparation, that the series has
been on a steadily downwards trajectory. Now when it comes to tax-related time series
data, there is no real cycle per se; at best, it is an inherited cycle from world economy
fluctuations. The proper definition of cycle in a TSA context is not the entity?s operational
lifecycle; rather, it refers to the boom-and-bust business cycles which are largely
unpredictable. Ergo, we are mainly concerned about trend here.
Now, if we substitute the Average TEBA (tax earned by audit) variable for C/AR [using the
Data Source node shown in figure 6 earlier], we can see what emerges in our decomposed
time series results.
Figure 11. Paneled Component Plots, TS Decomp. for Avg. TEBA
This time, as per the panel graph at bottom-left, we see that our seasonality index is
broader than that of C/AR ratio; it goes from a high of about 1.8 to a low of ~0.7. This is
largely attributable to the heightened pressures towards fiscal year-end to increase
realization of TEBA, which we see in Feb.-March. At the opposite end, we see rather low
seasonality for May, August, and November.
For the original series plot, bottom-right, the trend continues gradually upwards with
seasonality readily apparent. In the trend-cycle component plot, at top-left, we see that the
trend (with cycle, such as it is) is rising steadily upwards but then reaches a virtual plateau.
The key challenge then, has been to resolve and reconcile the expected forecast as of March
2020 with the new COVID-19 realities.
FORECASTING MACRO TAX VARIABLES
AVERAGE TEBA
We can proceed to evaluate the expected trajectory of the AVG. TEBA variable, on a
monthly interval. Recall that this variable is pre-accumulated at data source.
When we conduct our forecast, we use the TS Exponential Smoothing node.
UNCLASSIFIED
8
Figure 12. TS Exponential Smoothing node in the TSAF diagram
We let SAS? pick the best forecasting method, as well as selection criterion (forecast
measure). In this case, the latter value is the MSE [Mean Squared Error] as you can see at
the bottom of the properties of the node.
Figure 13. Properties of the TS Exponential Smoothing node
For our Significance Level, we set this to 0.5; it governs the blue bracket around the
forecast line, a.k.a. the prediction interval. So it is a confidence band of sorts. The way this
figure works is the opposite of what some of us might know from frequentist confidence
intervals; that is, the lower the ?alpha? value, the wider the band (prediction interval) so an
?alpha? of 0.01 would produce a very wide band, and an ?alpha? value = 0.99 would be
virtually limited to just the forecast line itself. So we aim in the middle (which actually is
closer to the outline of the trend line, as this figure is more ?log-like? in its manifestation).
Figure 14. TEBA_NPV_Mean: forecast line from trend
SAS logically expects the trend will continue upwards (while maintaining seasonality, of
course) due to ?series momentum?. Had we began our time series at, say, January 2016
rather than Jan. 2018, that momentum might have been more pronounced. The clich?s of
UNCLASSIFIED
9
?future behavior is governed by past behavior? and ?you can?t know where you?re going,
unless you know where you?ve been? have never been truer. However, enter COVID-19,
and that is a whole new wrench in the gears of the tax-auditing apparatus.
As for the selection of ?Best? Forecasting Method: you could try to experiment with
different models ? there are eight in all, as per fundamental TSAF science ? but I can tell
from the shape of the forecast line that it?s based, appropriately, on the Additive Winters
method1. I ascertained this by running the node with this method selected, and the
resulting graph was identical to ?best? method. Unlike the Multiplicative Winters method,
this forecast line is predicated on fairly consistent seasonal ?inverted V? shapes in the curve.
If those inverted V shapes became noticeable larger (or smaller), then Multiplicative Winters
would likely be the ?best? method that SAS would auto-select.
Figure 15. Available Forecasting Methods, properties of TS Exp. Smoothing node
We see that in the resulting forecast, it predicts ahead exactly 12 months. This is the
difference between the figures of ?Forecast Lead? and ?Forecast Back? in the properties. We
saw on the previous page that the ?Forecast Back? = 6; this acts as our validation partition,
using the last six months of known data (i.e. Oct. 2019 to March 2020). So this gets
subtracted from the ?Forecast Back? value of 18 to arrive at 12 periods out. Ideally, you
want your ?back? [validation] period to be between 20-25% of your known data, which it is
out of 27 months; even when we increase the known months to 30, it will still be 20% of
this.
SUM OF TEBA
When we run a TSAF experiment on the SUM of TEBA ? as opposed to its average ? we
realize a drastic difference in the scale. Because TEBA is a sum value, not a ratio (i.e.
C/AR, or [Average] TEBA/case), it is simply not as resilient to sudden shocks like COVID-19
? as we will later see when adjusting the forecast based on incremental months (April, May,
June) of known values.
1 The essence of the Winters method is to combine discernible trend with seasonality.
UNCLASSIFIED
10
Figure 16. TEBA SUM Forecast (post-March 2020)
Note that the MSE selection criterion (default) graphs a trend line around the known values
(which are represented by the red dots here). The SUM TEBA for Feb. 2020 is nearly double
what it was for March 2020, as you can see by the relatively large separation of the red dots
from the blue dots (on trendline) for those two months. Yet SAS? ?thinks? that the trend
will continue positively, as it is ?COVID-agnostic?.
What may also seem shocking to the reader is that the lower limit of the prediction interval
for April 2020 (at ~$674.5M) actually exceeds the actual value for April 2019, which was
slightly below $500 million. It is not until the fall until we see that the midpoint of actual
2019 data approximates the LCL (lower confidence limit) of the forecasted band for Sept.
2020. This is ostensibly due to the ?positive momentum? of the time series that I alluded to
earlier.
C/AR RATIO
Next, we switch out the SUM of TEBA for the C/AR ratio, once again. In forecasting a
relatively low continuous ratio variable such as C/AR, the prediction interval can be less
reliable. We have to examine the midpoint distribution. While the midpoint post-March
2020 tends to be at or above the 10.0 line, this is rare for 2019 datapoints.
Figure 17. C/AR ratio Forecast
UNCLASSIFIED
11
I used the Mean Relative Abs. Error as the forecast metric (selection criterion), which I
found to be more appropriate. Regardless, what we see in the actuals for the spring of 2020
is a very low C/AR ratio, telling us that case throughput has suffered as a result of the
pandemic AND that Action Requests for help did not decline proportionally; there was still
an apparent high need for action requests.
FORECASTING AVG. HOURS PER CASE
For forecasting average hours per [audit] case, I determined that the more ideal Selection
Criterion was ?Median Relative Abs. Error?. No matter what Selection Criterion I used (or
Significance Level), the prediction interval still dipped into the negative range. Sometimes,
this is unavoidable. But then the prediction interval becomes spurious; you can?t have
negative hours. So we tend to just focus on the midpoint values in this situation.
Figure 18. Average hours per case Forecast
We can see that the midpoint goes very subtly upwards for the first few forecasted points
(post-March 2020), then sharply up for summer. As it turns out, this is a fairly good
approximation of the reality, since the Avg. Hours per case during the middle of 2020 is
about 1.5-2.0 times that of the previous year. What is especially pronounced is that the
Average Hours of March 2019 were only 6.25, whereas for March 2020, it was 35.44. This
was predicated on an Agency policy-induced change; refer to the link and passage below:
https://www.mondaq.com/canada/audit/1030308/cra-moves-forward-with-international-audits-
despite-continued-backlog-?email_access=on
In March 2020, the CRA announced that it was suspending the vast majority of audit activity for a
minimum of four weeks, other than audits involving the very largest taxpayers. This suspension meant
that the CRA ceased requests for information relating to existing audits, finalizing existing audits, and
issuing reassessments. Further, deadlines for information or document requests were suspended and no
action was required from taxpayers under audit during this time. This suspension remained in effect until
June 2020, though audits of small and medium businesses did not resume until late fall.
This is also arguably responsible for the ?pulse? effect we see in actual Avg. TEBA for July
2020, as per the monthly incremental analysis that comes next.
UNCLASSIFIED
12
INCREMENTAL ALIGNMENT
APRIL 2020, KNOWN VALUES
Now when we add the month of April 2020 to our data (making it 28 mths total), we would
expect the AVG. TEBA actuals for subsequent months to become closer to / within forecast
range. As an example in the graph cross-section that follows, the forecast for September,
October, and December 2020 becomes more within range of later-known actuals, once we
add April 2020 data. However, the July 2020 actual (~$122,000) is still above the forecast
band for this incremental dataset?s forecast. This was likely due to the resumption of
standard large business audit as of June 2020 (see previous page article/passage).
Figure 19. Revised AVG. TEBA forecast, incremental inclusion of APRIL 2020
Again, we typically use the measure of MSE [Mean Squared Error] in gauging efficacy or
proximity of a forecast to actual [values]. See the Appendix tables at the end of this paper
for a breakdown of this analysis, where I illustrate monthly incremental effect on accuracy
of the last six months of the calendar year (i.e. from July to Dec. 2020).
MAY 2020, KNOWN VALUES
Clearly, the addition of April wasn?t enough to right the trajectory of the expanding ?COVID
window?. So in continuing our analysis of monthly incremental effect, I added May 2020?s
known data and I changed the forecast significance level from 0.5 to 0.25. But it makes no
difference: July actual is still out of forecast range. We must simply accept that July 2020
Avg. TEBA is an irregular value (~$122K), since July 2018 had Avg. TEBA =~$45K, and July
2019?s Avg. TEBA was ~$57K. It is clear that this is a COVID-adjustment spike.
Figure 20. Revised AVG. TEBA forecast, incremental inclusion of MAY 2020
UNCLASSIFIED
13
We can therefore define July 2020 as a pulse, or a one-time brief event, that caused a
spike in the accumulated time series value for that month. This emphasis on larger
business for audit while suspending SMB audits at the time is further substantiated by the
fact that in July 2020, there was an average of 50.75 hrs per case completed, which is
extremely high. For April, which had a very high Average TEBA of $185.5K, the figure was
52.16 average hours per case.
JUNE 2020, KNOWN VALUES
Predictably, for the addition of June 2020, it didn?t improve the forecast band to include the
actual Avg. TEBA for July. So this strengthens the theory that July?s value was a one-time
event, or pulse, in the time series. It also strengthens the theory that Avg. TEBA was more
resilient to initial COVID-19 transition measures (being a ratio value, in essence). To wit:
observe below that the April-May-June line for the original forecast (left) and actual data
points (right) is just above the $50K line, and follows the same trajectory.
Figure 21. Comparing Q1 of FY2020-21 forecast vs. actual data points
In taking MSE and RMSE (R is ?root?) measurements for both the as-of-March and as-of-
June forecasts, we only note a slight improvement (reduction) in that value. Which also
goes to show the resilience of this variable, and the ?pulse? nature of July?s spike.
MEASURE / as of MONTH MARCH 2020 JUNE 2020
AVG. TEBA (MSE) $ 954,467,257.64 $ 888,454,004.34
RMSE $ 30,894.45 $ 29,806.95
Table 1. Point-in-time [R]MSE for AVG. TEBA forecast-to-actual: July to Dec. 2020
Refer to the Appendix at the end of this paper for a more detailed month-by-month
breakdown of these calculations.
FALLACY: COMPARING SUM OF TEBA SHIFT TO AVG. TEBA CHANGES
TSAF works best when you accumulate data records by average, not by sum total. If we
tried this exercise using SUM TEBA per month, it would not turn out very well, because sum
totals are immediately impacted by any severe transition, i.e. auditor work re-arrangements
and temporary audit case policy due to COVID-19 fallout as of March 2020.
Evaluating the March 2019-2020 comparison in the following table, the TEBA_SUM and
Case Count have dropped significantly in March 2020, yet the C/AR ratio has augmented.
UNCLASSIFIED
14
Table 2. Year-over-Year March comparison, key macro-variables in TSA
However, as the staffing situation has attempted to stabilize in the intervening months
(April to June 2020), the C/AR ratio has dropped dramatically. (Not shown in above table.)
The same is true for the TEBA/AR pattern.
SUM OF TEBA: DRASTIC CHANGE
We now compare the SUM TEBA forecast as of March 2020 (left image) and that of June
2020 known data points (right image).
Figure 22. Comparison of SUM of TEBA forecast as of March vs. as of June (2020)
For the first image, none of the actuals of the last six months of 2020 fall in the forecast
band. Whereas, for the second image, two of the actuals of the last six months (Oct., Nov.)
fall in the forecast band.
Also observe how some of the accumulated data points in the forecast are more ?depressed?
in the latter graph; while there is a discernible peak, it doesn?t quite have the same
buoyancy or upwards momentum as the former graph. (We must keep in mind, though,
that this is still using the MSE method, i.e. taking a line of best fit, where the red dots are
the actual values.)
So, there is little point in using the MSE to gauge efficacy of the monthly adjustment, simply
because the values would be so huge (as opposed to those in the Avg. TEBA MSE).
UNCLASSIFIED
15
ADVERSE IMPACTS AND DELAYED EFFECTS
LATENT EFFECTS OF SHOCKS
We would also expect that lower Avg. TEBA wouldn?t manifest until much later in the fiscal
year 2020-21, due to most of 2020 consisting of past year audits. The graph below covers
known Avg. TEBA trend data points right up to December 2020, the lowest point.
Figure 23. Calendar-year-end (2020) Avg. TEBA; lowest point
This extremely low Average TEBA of ~$32,000 per case could be a harbinger of further
average TEBA decline, but we?d have to observe the last quarter of the fiscal year ? January
to March 2020, once available ? and validate that theory. (Then we might apply an
intervention to the time series line.)
Incidentally, when it comes to SUM of TEBA with actuals up to Dec. 2020, the forecast trend
line for 2021 is far more credible, showing all datapoints as being well under $1 billion, and
mostly under $500 million.
INTERVENTIONS
As alluded to before, a TSAF exercise may use interventions, if the extreme or irregular
event is known in advance (or shortly thereafter). This is an adjustment to the ?regular?
time series, using a ?dummy? variable for the period of observation. In this case study,
we?d recommend an intervention for the SUM of TEBA as of March 2020, and possibly for
AVG TEBA as of Dec. 2020. Plus, we might use a ?pulse effect? for July 2020. However,
programming an intervention requires SAS? Studio?, which is out of scope for this paper.
Figure 24. Basic denotation of input variables (interventions) by type
Lowest actual in 3
years; Dec. 2020
Avg. TEBA of $32,404
A step would work best as an intervention
(for March 2020 and Dec. 2020), since the
trend line shift is sudden and sustained; it
does not happen gradually then return to
baseline.
UNCLASSIFIED
16
TS CORRELATION NODE
AUTOCORRELATION
When we deal with a significant seasonal and/or trend component, we usually find a greater
degree of autocorrelation factor (abbreviated ?ACF?). As the name suggests, this is the
tendency of a variable to self-influence. It could also be regarded as momentum, or
?muscle memory?.
In a similar vein, when frontline auditing teams are performing well, some of that
momentum carries over from one period to the next, as they build ?muscle memory? and
are better-equipped to deal with more trying scenarios that have [abstract] aspects in
common with recent cases worked on. This presents opportunities for ?boilerplate? copying
and pasting of common findings from one case to another, adjusting for specifics, and
accelerating average time to complete as well as garnering more average TEBA per case.
Clearly, during the current COVID-19 climate at this writing, and the embargo of SMB case
audit during the spring 2020 period, we can expect some of that momentum to be adversely
impacted ? since auditors were working on more complex large business cases overall. But
first, let us examine a baseline from the years 2018-2019, below:
Figure 25. ACF Plot, three key tax-related macro-variables (2018-2019)
From the three variables