The New Profit and Loss Attribution Tests: Not Ready for Prime Time

Introduction

The market risk capital framework in the Basel proposal[1] includes new statistical tests designed to measure how effectively a bank’s risk management models capture the same risks as the bank’s trading desk models. The motivation for these tests is the observation that banks’ risk management models, which are used for capital calculations, often differ from the trading desk models. A bank’s risk management model may use slightly different data, fewer risk factors or different pricing models than the desk models. The reason for these differences is pragmatic: bank risk models generally must do significantly more computation compared to desk models and so must be faster and more efficient. Some differences between risk management and trading desk models would be expected, but if the differences are excessive, a bank’s risk management models may fail to capture its risks adequately. To help validate banks’ risk models, the Basel proposal introduced two new Profit and Loss Attribution (PLA) tests that are designed to measure the consistency of trading desk and risk management model Profit and Loss (P&L) estimates. If the risk models fail either test, the bank must stop using its risk management models for capital calculations and instead use the standardized approach.

The proposal’s discussion of the PLA tests is difficult for non-technical readers to follow, since it is filled with forbiddingly arcane, highly technical language. The purpose of this note is to explain to a reader not steeped in statistics how these PLA tests work and to assess whether they are fit for purpose. In this note, we will explain the tests using some simple examples. The PLA tests, when stripped of their technical baggage, are conceptually intuitive. As will become apparent, unfortunately, these tests suffer from a fatal flaw that should prevent them from being used as proposed: the more effectively a bank hedges its market risks, the more the tests will tend to fail. More troubling, these test failures often cannot be fixed even if a bank rectifies the problems that led to the test failures in the first place.

When the tests fail, banks are required to use the standardized approach for capital calculations, resulting in highly overestimated capital charges, especially for hedged portfolios. To measure the risk of hedged portfolios accurately, it is very important to use the correct correlations between the risk factors. However, the standardized approach requires the use of pre-defined correlations that may not represent the actual correlations. The PLA tests thus create extremely perverse incentives for banks. To avoid punitive capital penalties under the proposal, banks would be better off hedging less effectively. Random test failures that do not necessarily have any material significance can cause capital treatment to flip back and forth between the standardized and advanced approaches, a problem that gets worse as hedging gets better.

In light of these fundamental issues, the PLA tests are not ready for prime time and should not be used to validate banks’ internal risk models or determine eligibility for models-based approaches. Instead, the agencies should repurpose the tests to be for reporting and monitoring purposes only. The agencies could also consider eliminating the Spearman correlation test entirely, as it is the more problematic of the two tests.

The New PLA Tests

The proposal introduces two new statistical tests that are designed to compare the trading desk model’s daily P&L to the risk model’s daily P&L. The first PLA test, the Spearman correlation test, assesses whether the risk model’s P&L is highly correlated with the desk model’s P&L. If the correlation is too low, the risk model will fail the test. The second PLA test, the Kolmogorov-Smirnov (KS) test, checks whether the relative frequency of daily P&L values produced by the desk and risk models are sufficiently similar. If not, the risk model will fail the test.

The PLA tests use a red-amber-green traffic light system. If the risk model fails either PLA test, it falls into the red zone. When a risk model is in the red zone, a bank must use the standardized approach. If the bank’s risk model passes both tests, the model will fall into the green zone and can be used for capital calculations. If the bank’s models fail to be in the green zone for at least one test but are not in the red zone on any test, they are in the amber zone.  In that case, the bank can continue to use its internal risk management models but must pay a capital penalty that is a fraction of the difference between the standardized and internal model’s capital requirement. A bank can return to the green zone by subsequently passing both PLA tests as well as backtesting. Each PLA test is conducted quarterly and requires that the previous 250 trading days of daily P&L be used.

PLA Test 1: The Spearman Correlation

To understand how the Spearman correlation is calculated, we will work through a simple example using just 10 days rather than 250 days of hypothetical P&Ls. The calculations would work the same way for 250 days. Table 1 shows an example of 10 days of hypothetical P&Ls produced by the desk and risk management models.

Table 1

table 1 - desk pl risk pl

Table 1 depicts a typical situation in which the desk and risk model P&Ls differ from day to day but are in the same ballpark. To perform the Spearman test, we must first rank the desk and risk P&Ls. The lowest P&L would have a rank of 1, the next lowest would have a rank of 2, etc. Table 2 displays both P&L series ranked from smallest to largest.

Table 2

table 2 - desk pl risk pl

In table 2, we simply sorted the desk P&L and risk P&Ls from lowest to highest daily value, and then assigned ranks.

The next step is to find the correlation between the daily ranks under the two models (not between the actual P&L values). The idea behind the Spearman test is that if two P&L series are consistent, their daily P&L ranks should be highly correlated. When models are perfectly correlated, the rank of each day would be the same under both models, even if the P&L values themselves differ. The PLA test the proposal specifies is the Spearman correlation metric, which calculates the correlation between the ranks of two data sets. We can find the Spearman correlation coefficient rs, which measures the correlation of the ranks, using the following formula[2]

formula 1 pla test

where di is the difference in the ranks on each day and n is the number of data points. This formula makes intuitive sense. If the ranks of the risk and desk P&L were exactly the same on each day, then the difference of the ranks would be zero on each day: ∑ni=1 d2i =0 and rs, i.e., perfect correlation. When the ranks are mostly the same on each day but a few are a little different, then the second term would subtract from 1 and the correlation would go down from 1. As the P&L series become less correlated, the daily ranks would become more different from each other, and the second term would subtract more from 1, producing a lower correlation. At the extreme when the P&Ls are negatively correlated, the risk P&L with the highest rank would be paired with the desk P&L of the lowest rank, the risk P&L with the second highest rank would be paired with the desk P&L of the second lowest rank, etc. In that case, rs would approach -1 as n gets large.[3]

For the example above, Table 3 shows how to calculate ∑ni=1 d2i in the formula:

Table 3

table 3 dsk risk pl

Using the formula, rs= 1 – ((6*24)/(10(102-1))) = 0.85, which is the Spearman correlation coefficient for this series of data. The last step is to consult the table given in the Basel proposal[4] to see in which zone the Spearman correlation metric lies:

Table 4

table 4 pla test zone

Since rs was calculated to be 0.85, we are in the green zone and have passed the test. As can be seen in Table 4, the intuitive idea behind the Spearman test is that if the ranks of the risk and desk P&Ls are highly correlated, then that is evidence that the risk models are not too far off from the desk models. One major problem with this test, which will be discussed in more detail later, is that it is not clear how to set the traffic light thresholds in an objective manner.

PLA Test 2: The Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov (KS) test assesses the similarity of the frequency of daily P&Ls over a 250-day period. Since the KS calculations are straightforward but tedious, we will illustrate how the test works using a chart of the full 250 days of hypothetical P&L data.

Chart 1

chart 1 - risk desk pl

Chart 1 depicts 250 days of hypothetical P&Ls produced by the risk and desk models. To perform the KS test, we sorted and graphed the data so that for each P&L on the x-axis, the percentage of P&L values below that value is recorded on the y-axis. For example, at a P&L of zero, about 52 percent of both the desk and risk P&Ls are lower than zero and about 48 percent are higher than zero. At a P&L of $120 million, about 87 percent of the desk P&Ls are lower than $120 million whereas about 92 percent of the risk P&Ls are lower.

The KS test statistic measures the maximum difference between the two empirical frequency distributions of P&Ls, depicted by the green line on the graph. The maximum difference occurs at -$47.3 million P&L. In this case, about 28 percent of the desk P&L is lower while about 35.6 percent of risk P&L is lower. The KS statistic for this series of data is the difference: 35.6 percent – 28 percent = 7.6 percent. We then use the table provided in the proposal[5] to determine the zone in which the KS statistic falls. Table 5 shows the KS thresholds.

Table 5

table 5 - pla test zone

Since the KS statistic in this example is less than .09, or 9 percent, we would pass the test.

It is interesting to note that if we applied the Spearman test to the P&L data in Chart 1, we would get a Spearman correlation coefficient of only 0.07, implying a 7 percent correlation between the ranks of the two P&L distributions. This large Spearman test failure would put us in the red zone under Table 4 above. The tests disagree because they test different properties of the P&Ls. The hypothetical P&L data for the KS example was generated using two identical uncorrelated frequency distributions. Since the frequency distributions were the same, the KS test passed. The Spearman test failed since it tests whether the P&L distributions are highly correlated; however, the P&L distributions were constructed to be uncorrelated in this example. 

What Can Go Wrong With These Tests?

For unhedged portfolios, the PLA tests can be effective. However, the intrinsic flaw in the PLA tests is that when portfolios are hedged, as is typical for market-making banks, the tests can easily fail for spurious statistical reasons. In contrast to the unhedged example in Table 1, a hedged portfolio’s P&L fluctuates randomly around a small or zero value from day to day. Hedged desk and risk P&Ls can therefore look like minimally correlated random processes to the tests. The random fluctuations of the desk and risk P&Ls will not necessarily be highly correlated nor will they necessarily be the same, causing the tests to fail.

For example, risk departments might use vendor data for better consistency of coverage across desks, whereas trading desks might use their own trading observations, a source of random P&L variation. Data errors can also create additional randomness in the risk P&L. Sometimes, risk departments may collect data that has a different naming convention than those used by desk systems. Importantly, risk departments for practical reasons often may use a smaller number of risk factors or computationally less-intensive pricing models, since they need to get the entire daily risk calculation done in one day. The need to complete an enormous amount of computation within a 24-hour period can sometimes require risk models to price trades earlier in the day than desk models, implying that input market risk factors the risk models use are observed earlier as well. Each of these factors may by themselves be small in magnitude, but when added to a hedged P&L that is small or zero on average, it may make the risk and desk P&L distributions appear to a statistical test to be poorly correlated random processes. Not surprisingly, the Spearman test will often fail under these circumstances. Perversely, the better a bank hedges its market risk, the more likely it will be for the Spearman test to fail.  

Although not as obvious, the KS test suffers from a related difficulty produced by hedged portfolios. Given all the factors introducing random variation in the hedged P&Ls, there is no reason the random variation should be exactly same for the risk and desk models. Since the KS test measures whether the P&L frequency distributions of the risk and desk models are identical, even small differences can cause the KS test to fail. For example, suppose we create hypothetical P&Ls distributions in which the desk model has slightly smaller random variations in its hedged P&L than the does the hedged risk P&L.[6] For this example, we assume the typical P&L variation in the risk model is $1 million while the typical variation in the desk model is $750K. Chart 2 shows that the KS test fails with a statistic of 13.2 percent, the maximum difference having occurred at a P&L of -$700K.

Chart 2

chart 2 - risk desk pl

It is very important to observe that it only required a very small difference between the two P&L frequency distributions to produce a KS test failure. When portfolios are hedged, even slightly more randomness in the risk P&Ls can get amplified into distributions that are different enough for the KS test to fail, even if the differences on an absolute basis are not economically material. In a real-world setting, the effective false positive rate is too high.

Statistical tests, such as the KS test, also have a false negative rate. If the false negative rate is too high, especially when the test is distinguishing differences that are not economically material, then a statistical test can treat banks with the same underlying risks differently. To see this problem in the context of the KS test, if we simulate 1,000 desk and risk P&Ls with the hypothetical frequency distribution above, we find that the KS test fails 43 percent of the time, implying the KS test is falsely negative 57 percent of the time. The underlying risk and desk P&L distributions in this experiment were statistically different but the differences were not economically material. Thus, whether the KS test fails because of small differences between the desk and risk P&L distributions is just the luck of the draw. Bank A and Bank B could have the same fundamental desk and risk P&L distributions. Bank A could have bad luck, fail the KS test and be forced to pay the standardized capital penalty. Bank B, on the other hand, could have good luck, pass the KS test and be allowed to continue to use its internal models.

Can’t Banks Fix These Problems By Getting Better Data and Model Alignment Between Risk and Desk Models?

It might be argued that the tests are doing what they are supposed to be doing in these examples. The tests should fail when data or models are misaligned. The higher capital that results from the test failures will serve as a powerful incentive for banks to fix their problems.

However, this objection misunderstands the inherent problem in PLA tests. The tests do not account for the magnitudes of the random variations of a hedged portfolio around zero. The tests only look at the nature of the randomness of the P&L series. A bank could spend significant time and expense reducing data discrepancies between the desk and risk models, aligning pricing models, and reducing basis risks, but as long as some random variation remains, as it always will, the tests will fail in exactly the same way. Furthermore, these efforts would do nothing to reduce the actual risk faced by the bank, and therefore should not influence the bank’s ultimate capital charge, even if they were successful in improving performance under the P&L tests. Indeed, these efforts could increase the risks of banks by diverting resources and attention away from more significant risks.

To see how this could happen, suppose we take the KS example discussed previously but assume the bank conducted a massive overhaul of data and models, reducing data and model discrepancies significantly so that now the random differences from the average zero P&Ls are one tenth of the size assumed in the previous example.[7] Will the KS test now pass?

Chart 3

chart 3 risk desk pl


Chart 3 shows that the KS test would produce the same result as in chart 2, a KS statistic of 13.2 percent. A comparison of charts 2 and 3 reveals that they are the same chart; the only difference is that chart 3’s numbers are scaled down by a factor of 10. The KS test still fails because it does not consider the absolute magnitudes of the P&Ls but rather measures the relationship between the relative frequencies of the P&Ls, which have not changed.

The same phenomenon holds true for the Spearman test. Reducing the magnitudes of the desk and risk P&L random variation by getting better data and model alignment will not necessarily change the relative rankings of the P&Ls in the Spearman test. As long as the random variation in the risk and desk P&Ls continue to look minimally correlated, the Spearman test will fail, regardless of any improvements a bank might make.

Where Do the Test Thresholds Come From?

Table 6 puts together the proposed traffic light thresholds for the Spearman and KS tests into one table for easy comparison.

Table 6

tabel 6 test pla

One essential difference between the KS and Spearman tests is that the KS metric is a statistical test of an underlying hypothesis that the two P&L distributions are the same.[8] Using a KS threshold of 0.12 means that if we did an experiment in which we simulated identical risk and desk P&L distributions over and over, checking if the KS test each time was greater than 0.12, we would find that the KS statistic was greater than 0.12 about 5.5 percent of the time. Thus, we would falsely conclude the P&L distributions were different 5.5 percent of the time when they were in fact identical. The p-value in the table can thus be thought of as a false positive rate. Intuitively, if the KS statistic is greater than 12 percent, we have good statistical evidence that the P&L distributions really are different.

In the academic literature, a near rejection of a hypothesis is often taken to be 10 percent while full rejection is defined to be 5 percent. In contrast with Table 6, a KS traffic light table consistent with the typical standards in the academic literature would have been

Table 7

table 7 - pla test zone

The proposal’s motivation is unclear for using a p-value of 26.4 percent rather than 10 percent to separate the green zone from the amber zone. One way to select the thresholds would be to balance the false positive and false negative rates of the KS test on realistic hedged and unhedged P&L distributions that would be encountered in practice. The proposal provides no indication that such a calibration exercise was attempted, and it may not even be possible: as we saw earlier, a well-hedged portfolio could easily create a fairly large false negative rate, producing inconsistent capital treatment of banks. Whatever the reasons for the choice of the oddly precise p-value of 26.4 percent, one motivation seems to have been to produce simple thresholds. A p-value of 26.4 percent results in a threshold of 0.0899989 or approximately 9 percent. Similarly, a p-value of 5.5 percent results in a threshold of 0.11989 or approximately 12 percent. If the proposal had used the standard 5 percent p-value, the threshold would have been 0.1215 or 12.15 percent, not as concise a number.

The Spearman test statistic, in contrast, does not represent a statistical hypothesis. It is thus not clear how to set the thresholds in such a test. From a statistical point of view, the chosen thresholds for the various zones in the Spearman test appear to be intrinsically arbitrary. The KS test does have the essential advantage that a statistical interpretation can be attached to the thresholds, unlike the Spearman test, and so in principle the KS test threshold calibration might be supportable. However, as presented in the proposal, no such calibration was used to justify the thresholds for the KS test, which appear to be defined arbitrarily as well.

Conclusion

To summarize, the problems with the PLA tests are:

  • The PLA tests tend to fail more frequently when a bank improves its market risk management by hedging more effectively
  • Misleading PLA test failures on hedged portfolios cannot be fixed by banks better aligning data and models
  • PLA test failures can be caused by economically small changes in P&L distributions
  • PLA test failures can occur randomly, implying that banks with the same risks do not necessarily get the same capital treatment
  • The proposed PLA test thresholds are arbitrary, although the KS thresholds have the benefit of a statistical interpretation, allowing non-arbitrary threshold calibration in principle

As proposed, the PLA tests suffer from very significant deficiencies in the validation of risk models for hedged portfolios that market-making financial institutions typically hold. The tests produce perverse incentives for banks to reduce their hedging, since the tests fail more frequently the better a bank hedges, penalizing better market risk management with higher capital charges. As a result, the tests should not be used to validate internal bank risk models for capital purposes. Nonetheless, PLA tests should not be dropped either because they may have some value to detect problems in banks’ risk models if portfolio risk is directional. To use PLA tests more effectively, the agencies should dispense with the traffic light system and repurpose the tests to serve as additional reporting and diagnostic tools that could be used to assess the effectiveness of bank risk models. Because it is unclear how to calibrate the Spearman thresholds for diagnostic purposes, the agencies could also consider eliminating that test.


[1]  “Regulatory capital rule: Amendments applicable to large banking organizations and to banking organizations with significant trading activity,” available at https://www.govinfo.gov/content/pkg/FR-2023-09-18/pdf/2023-19200.pdf

[2] The Basel proposal states that the rs should be computed as Cov(Rankdesk,Rankrisk)/σ(Rankdesk)σ(Rankrisk) where cov() is the covariance of the ranks and σ() is the standard deviation of the ranks. The formula in the text above is equivalent if there are no ties in the ranking process. We use the formula in the text rather than the formula in the proposal because it is simpler to show example calculations.

[3] The mathematically-minded reader can verify this assertion by noting that for even n and perfect negative correlation, ∑ni=1 di2 = 1/3 (n-1) 3+(n-1)2+2/3 (n-1), so rs→-1 as n → ∞

[4] Basel Proposal at 64270

[5] Basel proposal at 64270

[6] We create the hypothetical P&L distributions by assuming they are drawn from a normal distribution with a mean of zero. The desk P&L standard deviation is $750K while the risk P&L standard deviation is $1 million.

[7]  The hypothetical overhauled P&L distributions would be drawn from a normal distribution with a mean of zero. The desk P&L standard deviation would be $75K while the risk P&L standard deviation would be $100K. To create an apples-to apples-comparison, we scale the previous hypothetical P&Ls by 1/10 rather than simulate new P&L distributions.

[8] A Ks statistic greater than 0.12 formally rejects the hypothesis that the risk and desk P&Ls come from the same distribution at the 5.5 percent confidence level (p-value).