Using Transformers to Predict Incredibly Rare Solar Flares

Introduction (X-45)
predictions change dramatically whenever we try to predict a very rare event. We have to fundamentally change what we are simulating to focus on tail events. From model performance metrics and target definition to tail modeling and converter output heads, prediction of abnormal events is difficult. It's hard but worth it.
The Halloween storms of 2003 began as disturbances in the Sun, a single dark spot that formed one of the most intense weather events of the satellite era. In late October to early November, a series of large active regions circled the solar disk. This released powerful flares and clouds of magnetized plasma towards Earth. This event presented a unique beauty with the influence of radio waves.
Satellites malfunctioned, GPS and radios were disrupted, and airlines had to reroute flights from hot spots. According to NOAA, power grids around the world were affected, and some currents exceeded 100 amps, leading to the Malmö Blackout in Sweden. At 20:07 UT, a power outage hit the region, leaving an estimated 50,000 customers without power for 20 to 50 minutes.
Photo credit: NASA / Solar Dynamics Observatory (SDO) / AIA. Public domain
An international shock, the event overwhelmed the X-ray sensors of GOES, so the true size of the flare can only be calculated by reconstruction. It is often called X-45, after its size, 450 times larger than M-1, medium aperture. The table below shows the Flare Richter Scale.

The Prediction Problem
The ironic problem with disasters is that the more catastrophic they are, the rarer they tend to be. Think floods, snowstorms and avalanches. Every 50 years it happens once in fifty years. This is usually a good thing, but because of their rarity, they are very difficult to predict.
There are several things that make predicting rare events a very interesting challenge in machine learning:
- Our model testing metrics must change
- Features need to be built from magnetic data
- Model the tail to capture rare events
- Combine the tail model with the full distribution model using a transformer
Note accuracy, which is a good metric for binary classification. We could get 99% accuracy in missing all solar flares in 10,000 predictions if we only had 100 major flares. We can only guess. It won't always happen.
Accuracy = (10,000-100)/10,000 = 9900/10,000 = 0.99 = 99%
True conditions = 0
The data
If you want to know where this data comes from, all the information we have about solar flares is from a completely different layer of the sun than where the flare comes from. The information we have about solar radiation comes from the photosphere, the first visible layer of the sun.
Flares from the Corona and Chromosphere. The data is collected by the Solar Dynamics Observatory (SDO), a NASA spacecraft that constantly looks at the Sun to monitor its activity. Using the Helioseismic and Magnetic Imager (HMI).

Model input
Fortunately, thanks to NASA, our satellite's construction, deployment, and journey to the Sun is complete, and we can now focus on our modeling. Vector magnetogram measures the vector of the magnetic field B. First impressions come in two flavors:

From this point on, the Space Weather HMI Active Region Patch does two things:
- Localization
- Feature engineering
it means choosing active regions on the Sun (Localization) and computing the magnetic parameters that best describe the solar and magnetic structure (feature engineering).
The important lesson here is that, in order to deal with how rare the event we are trying to predict is, we focus on collecting data in areas where it is most likely. We take our initial measurement data on magnetic fields and calculate different properties such as:



Our input data becomes a function of time and engineering features:

If our model uses the last 24 hours, and 9 engineering features we will include

Model Target
We might make our goal more clear now. We define it as the probability of observing an M-1 class event in the next 24 hours, given the magnetic history. Here, magnetic history can be all our input data.


But there are many subtle design decisions we made that are made clear in the following table.
Note that there are many options when creating our goal. This is a big problem when comparing different models. It is important to note that taking more data is not correct, since events that occurred in the past are often less powerful predictors of future events. This introduces a noise-to-signal problem with respect to your training window.

Metric TSS
To solve the previously mentioned problem of having a model with 99% accuracy and zero recall, we introduce a new statistic called the True Skill Statistic (TSS), which is defined as the difference between the number of true positives and the rate of false positives. TSS rewards true positives while also penalizing false ones.

Modeling the tail
Due to the absence of a flare, if we use the following objective in the accident, we will find that normal events, where there is no sunlight, dominate the time of loss. Rare events do not play a role, as they occur very rarely, although they are more consistent with what we are trying to predict. A model can be very good at mass distributions while learning very little about extreme events, in which we are interested. That's why it makes sense to consider tailoring.

We can describe the problem more precisely by saying that our objective is frequency-weighted, which means that frequent events dominate the loss time, while infrequent (rare) events contribute less, although that is what our model needs to learn.

So our model can learn from unusual events. We choose a fixed limit for the continuous flux, such as the soft X-ray flux, anything that measures the intensity of the flare would work. We set our target to the difference between the threshold and our observed difference in intensity, and we only use data from the tail of the distribution..

Then the data we model is:

Using Transformers
We can now combine our original model and the tail model using a transformer to achieve a more robust solution, which learns exactly what happens below and above the rare event threshold. In other words, we would like the model to learn the origin of the different activity and the nature of the excess risk described by the tail model. In this case, we can use transformers with different heads. A model can start with historical magnetic data and code it into a representation h; different heads can measure different values like opening probability, tail pass of uncertainty and leading signal.

A classifier, which estimates the probability of our target given our data, is usually trained on binary cross-entropy, possibly scaled to deal with class inequality.
We can use the Generalized Pareto Distribution (GPD), which provides a compact model of extremes (our tail distribution). Here, σ controls the scale, and ξ controls the weight of the tail. The transformer produces a representation of the latest states of the sun h maps represent GPD parameters, different magnetic histories mean different tail distributions for one active region (sunspot).

The full goal includes two predictive functions. The divergence term teaches the model to estimate whether the flare exceeds a chosen threshold, while the tail term teaches it what the excess intensity looks like after that threshold has been exceeded. This is important because the model should not learn “to burn or not to burn.” It should also learn how big the event can be once it gets into the dangerous part of the distribution.




NASA, Sunspots 1302 Sep 2011 by NASA.jpgSeptember 24, 2011, via Wikimedia Commons. Public domain
The conclusion
When it comes to getting a good prediction for a very rare event using a transformer, it is not enough to simply connect the data and reduce the loss function. When it comes to predicting solar flares, localization and feature engineering techniques must first be applied to our data. Then we need to specify a directed model that can distinguish between positive and negative events. We must choose an appropriate metric that rewards both the good and the good and penalizes the false. Also, due to large class inequality, it makes sense to model the tail using a normal Pareto distribution to model the excess over the threshold. These strategies and loss functions can be used as different transformer heads that can both predict and measure, and learn how big the event can be once it enters the dangerous part of the distribution. What we get from this is improved predictive performance and a better specified model.
Website | LinkedIn | GitHub




