How to Define the Scope of an Internal Model for Credit Risk

going through a profound change driven by technological progress. These changes affect all sectors, especially the banking industry. Data professionals must adapt quickly to be efficient, productive, and competitive.
For experienced professionals with a strong foundation in mathematics, statistics, and practice, this transition can be natural. However, it can be very challenging for beginners who have not yet mastered these basic skills.
In the area of credit risk, the development of these skills requires a clear understanding of the bank's exposures and the methods used to manage the associated risks.
My next articles will focus more on managing credit risk within a regulatory framework. The European Central Bank (ECB) allows banks to use internal models to assess the credit risk of their different exposures. This exposure can include loans given to companies to finance long-term projects or loans given to households to finance real estate projects.
These models aim to measure several important parameters:
- PD (Probability of Default): the probability that the borrower will not be able to meet his payment obligation.
- EAD (Automatic Exposure): amount of exposure during default.
- LGD (Default Given Loss): the magnitude of the loss in the event of an error.
So we can distinguish between PD models, EAD models, and LGD models. In this series, I will focus mainly on PD models. These models are used to assign ratings to borrowers and contribute to the calculation of regulatory capital requirements, which protect banks from unexpected losses.
In this first article, I will focus on defining and constructing the modeling scope.
Definition of default
The creation of a data model requires a clear understanding of the purpose of the modeling and an accurate definition of defaults. Assessing the potential for partner automation involves looking at the transition from a healthy state to a default state at a given location h. In the following, we will assume that this horizon is set to one year (h = 1).
The definition of default was harmonized and brought under regulatory oversight following the 2008 financial crisis. The aim was to establish a standard definition applicable to all banking institutions.
This definition is based on several criteria, including:
- a significant deterioration in the partner's financial condition,
- existence of past values,
- conditions of endurance,
- infection outcomes within the exposure group.
Historically, there was an earlier definition of default (ODOD), which gradually evolved into the current new definition of default (NDOD).
For example, a partner is considered to automatic where the debtor is more than 90 days in arrears on a visible debt obligation.
Once the definition of non-payment is clearly established, the institution can apply it to all its clients. It can then deal with a potentially diverse portfolio made up of large corporations, small and medium-sized enterprises (SMEs), retail clients, and private organizations.
To manage risk effectively, it is important to identify these different categories and create similar sub-portfolios. This phase then allows each portfolio to be modeled in a more efficient and accurate manner.
Definition of filters
Defining filters makes it possible to determine the scope of modeling and keep only homogeneous counterparts for analysis. Filters are variables used to separate this scope.
These changes can be identified by statistical methods, such as clustering techniques, or explained by subject matter experts based on business knowledge.
For example, if the focus is on large companies, revenue can serve as an appropriate size variable to establish a margin. One can choose to include only counterparties with an annual turnover of more than €30 million.
Additional variables can be used to further characterize this segment, such as industry sector, geographic region, financial ratios, or ESG indicators.
Another aspect of modeling would be to focus specifically on retail customers who have taken out loans to finance personal projects. In this case, income can be used as a filter, while other relevant factors may include employment status, type of collateral, and type of loan.
Once the objective is clearly defined, the default definition is well defined, and the scope is well organized with appropriate filters, building a modeling dataset becomes a natural next step.
Modeling Dataset Construction
Since the objective is to predict the default probability over a one-year period, each year (N), we must keep all healthy groups, meaning those that have not defaulted at any time during the year (N) (from 01/01/N to 12/31/N).
On December 31, N, the characteristics of these lively parties are observed and recorded. For example, if we focus on businesses, then by 12/31/N, the values of the following variables for each company are collected: profit, industry sector, and financial ratios.
Constructing a random variable for each of these groups, we then look at year (N+1). The variable takes the value 1 if the other party defaults at least once during the year (N+1), and 0 otherwise.
This is a variable, representing Y or deftarget evolution of the model. The chart below shows the process described above.
In summary, for each fixed year (N), we obtain a rectangular dataset where:
- Each row corresponds to a healthy group as of 12/31/N,
- Columns that include all explanatory variables measured on that day, shown (Xi) to partner (i),
- The last column corresponds to the target variable (Yi), indicating whether the affiliate (i) makes a mistake at least once during the year (N+1) (1) or not (0).
For example, if (N = 2015), the explanatory variable is measured from 12/31/2015, and the target variable is viewed in the year 2016.
The regulator requires data sets for models to be built using at least five years of historical data to capture different economic cycles. Since the models are calibrated over time, the regulator also requires the regulatory models to be Phry-the-Cycle (TTC), which means they must be relatively insensitive to macroeconomic short-term fluctuations.
Let's say we have client data spanning six years, from 01/01/2015 to 12/31/2020. Using the procedure described above for each year (N) between 2015 and 2019, five consecutive data sets can be constructed.
The first data set, corresponding to the year 2015, includes all remaining companies operating from 01/01/2015 to 12/31/2015. Their explanatory variables (Xi,…,Xk ) is measured as of 12/31/2015, while the default variable ( Y ) is measured during the year 2016. It takes a value of 1 if the other team makes a mistake at least once during 2016, and 0 otherwise.
The same process is repeated for subsequent years up to the 2019 dataset. This final dataset includes all teams that worked from 01/01/2019 to 12/31/2019. Their explanatory variables (X1,…,Xk) is measured as of 12/31/2019, and the default variable (Y) is observed in 2020. It takes the value 1 if the other member defaults at any time during 2020, and 0 otherwise.
The final modeling scope corresponds to a direct concatenation of all datasets generated as of 12/31/N. In our example, the range of N is from 2015 to 2019. The resulting dataset can be shown in the rectangular table below.

Each statistical observation is identified by a pair consisting of an identifier and the year (ID x year) in which the explanatory variable was measured (as of 12/31/N). And the number of rows means the number of observations.
For example, another group with an index (ID = 1) may appear in both 2015 and 2018. This corresponds to two separate and independent observations in the dataset, denoted respectively by the pairs (1 x 2015) and (1 x 2018).
This method offers several advantages. In particular, it prevents temporal overlap between ties and reduces automatic correlation between observations, as each record is uniquely identified by pairs (id x year).
In addition, it increases the chances of building a more robust and representative dataset. By pooling observations over many years, the number of spontaneous events becomes large enough to support reliable model estimation. This is especially important when analyzing portfolios of large companies, where default events are often rare.
Finally, the financial institution must implement appropriate organizational measures to ensure effective data management and security throughout the data cycle. To this end, the ECB requires financial entities to comply with common regulatory standards, such as the Digital Operational Durability Act (DORA).
Institutions should establish a comprehensive strategic framework for information security management, as well as a dedicated data protection framework that directly integrates the data used in internal models.
Furthermore, human supervision should always be central to these processes. Procedures should therefore be carefully documented, and clear guidelines should be developed that explain how and when human judgment should be used.
The conclusion
Defining model development and application scope, and documenting accordingly, are key steps to reduce model risk, not only in the design phase, but throughout the model's life cycle.
The main goal is to ensure that the development scope represents the intended portfolio, and, when necessary, to clearly identify any extensions, limitations, or limitations made when using the model compared to its original design.
Preparing a standard document that clearly defines the variables used to establish the scope is considered good practice. At a minimum, the following information should be readily visible: the technical name of the variable, its format, and its source.
In my next article, I will use a credit risk data set to demonstrate how to predict default probabilities for various counterparties. I will describe the steps necessary to properly understand the available dataset and, where possible, explain how to handle and process different variables.
References
European Central Bank. (2025). Supervisory Guide: Guide to the SSM Supervisory Review and Evaluation Process (SREP). European Central Bank.
Photo Credits
All images and visualizations in this article were created by the author using Python (pandas, matplotlib, seaborn, and plotly) and Excel, unless otherwise noted.
Disclaimer
I write to learn, so mistakes are common, although I try my best. Please let me know if you see anything. And I'm open to any suggestions for new articles!



