Using Transformers to Predict Incredibly Rare Solar Flares

0 2 7 minutes read

Using Transformers to Predict Incredibly Rare Solar Flares

Introduction (X-45)

predictions change dramatically whenever we try to predict a very rare event. We have to fundamentally change what we are simulating to focus on tail events. From model performance metrics and target definition to tail modeling and converter output heads, prediction of abnormal events is difficult. It's hard but worth it.

The Halloween storms of 2003 began as disturbances in the Sun, a single dark spot that formed one of the most intense weather events of the satellite era. In late October to early November, a series of large active regions circled the solar disk. This released powerful flares and clouds of magnetized plasma towards Earth. This event presented a unique beauty with the influence of radio waves.

Satellites malfunctioned, GPS and radios were disrupted, and airlines had to reroute flights from hot spots. According to NOAA, power grids around the world were affected, and some currents exceeded 100 amps, leading to the Malmö Blackout in Sweden. At 20:07 UT, a power outage hit the region, leaving an estimated 50,000 customers without power for 20 to 50 minutes.

The Sun is bursting with powerful magnetic activity, its corona glowing in excess ultraviolet light as active light regions and a powerful arc of burning organ above the sun's surface..
Photo credit: NASA / Solar Dynamics Observatory (SDO) / AIA. Public domain

An international shock, the event overwhelmed the X-ray sensors of GOES, so the true size of the flare can only be calculated by reconstruction. It is often called X-45, after its size, 450 times larger than M-1, medium aperture. The table below shows the Flare Richter Scale.

Solar flare classes are measured by the soft X-ray brightness on Earth. Each main character class is ten times stronger than the one before it. The number after the letter measures the openness within that class: X45 is 45 times more powerful than X1, 450 times more powerful than M1, and 4,500 times more powerful than C1.

The Prediction Problem

The ironic problem with disasters is that the more catastrophic they are, the rarer they tend to be. Think floods, snowstorms and avalanches. Every 50 years it happens once in fifty years. This is usually a good thing, but because of their rarity, they are very difficult to predict.

There are several things that make predicting rare events a very interesting challenge in machine learning:

Our model testing metrics must change
Features need to be built from magnetic data
Model the tail to capture rare events
Combine the tail model with the full distribution model using a transformer

Note accuracy, which is a good metric for binary classification. We could get 99% accuracy in missing all solar flares in 10,000 predictions if we only had 100 major flares. We can only guess. It won't always happen.

Accuracy = (10,000-100)/10,000 = 9900/10,000 = 0.99 = 99%

True conditions = 0

The data

If you want to know where this data comes from, all the information we have about solar flares is from a completely different layer of the sun than where the flare comes from. The information we have about solar radiation comes from the photosphere, the first visible layer of the sun.

Flares from the Corona and Chromosphere. The data is collected by the Solar Dynamics Observatory (SDO), a NASA spacecraft that constantly looks at the Sun to monitor its activity. Using the Helioseismic and Magnetic Imager (HMI).

Predicting solar flares measures the magnetic field directly in the photosphere, the visible part of the Sun, while the release of flare energy occurs high in the corona. Photospheric sunspot and magnetic field data are therefore used to estimate the formation of coronal magnetic pressure that can lead to reconnection and flares. Image made with the help of Chat GPT

Model input

Fortunately, thanks to NASA, our satellite's construction, deployment, and journey to the Sun is complete, and we can now focus on our modeling. Vector magnetogram measures the vector of the magnetic field B. First impressions come in two flavors:

From this point on, the Space Weather HMI Active Region Patch does two things:

Localization
Feature engineering

it means choosing active regions on the Sun (Localization) and computing the magnetic parameters that best describe the solar and magnetic structure (feature engineering).

The important lesson here is that, in order to deal with how rare the event we are trying to predict is, we focus on collecting data in areas where it is most likely. We take our initial measurement data on magnetic fields and calculate different properties such as:

Four magnetic values are used to understand the active regions that produce flares: the magnetic flux shows how the field lines connect the sunspot polarities, the electric current traces the flow that generates energy in those fields, the magnetic twist shows the helical bending inside the flow tube, and the magnetic twist describes the large-scale coupling, weaving, and dimming of the coronal magnetic fields. Image made with the help of Chat GPT

A solar flare starts when magnetic fields build up in stressed field lines above the sunspot region. As the field reconnects, the stored energy is released in the form of intense radiation, plasma bursts, and post-discharge magnetic loops. Image made with the help of Chat GPT

Our input data becomes a function of time and engineering features:

If our model uses the last 24 hours, and 9 engineering features we will include

Model Target

We might make our goal more clear now. We define it as the probability of observing an M-1 class event in the next 24 hours, given the magnetic history. Here, magnetic history can be all our input data.

But there are many subtle design decisions we made that are made clear in the following table.

Note that there are many options when creating our goal. This is a big problem when comparing different models. It is important to note that taking more data is not correct, since events that occurred in the past are often less powerful predictors of future events. This introduces a noise-to-signal problem with respect to your training window.

Metric TSS

To solve the previously mentioned problem of having a model with 99% accuracy and zero recall, we introduce a new statistic called the True Skill Statistic (TSS), which is defined as the difference between the number of true positives and the rate of false positives. TSS rewards true positives while also penalizing false ones.

Modeling the tail

Due to the absence of a flare, if we use the following objective in the accident, we will find that normal events, where there is no sunlight, dominate the time of loss. Rare events do not play a role, as they occur very rarely, although they are more consistent with what we are trying to predict. A model can be very good at mass distributions while learning very little about extreme events, in which we are interested. That's why it makes sense to consider tailoring.

Objective/Physical Risk (what most ML minimizes)

We can describe the problem more precisely by saying that our objective is frequency-weighted, which means that frequent events dominate the loss time, while infrequent (rare) events contribute less, although that is what our model needs to learn.

NASA's Solar Dynamics Observatory captured the opening moments of the X4.9-class solar system on Feb. 24, 2014, seen here at multiple wavelengths as a bright burst on the left side of the Sun. The fire peaked at 7:49 pm EST; loops of hot plasma appear above the active region in the corona. Credit: NASA/SDO. License: NASA Image Use Policy Public domain.

So our model can learn from unusual events. We choose a fixed limit for the continuous flux, such as the soft X-ray flux, anything that measures the intensity of the flare would work. We set our target to the difference between the threshold and our observed difference in intensity, and we only use data from the tail of the distribution..

Then the data we model is:

Using Transformers

We can now combine our original model and the tail model using a transformer to achieve a more robust solution, which learns exactly what happens below and above the rare event threshold. In other words, we would like the model to learn the origin of the different activity and the nature of the excess risk described by the tail model. In this case, we can use transformers with different heads. A model can start with historical magnetic data and code it into a representation h; different heads can measure different values like opening probability, tail pass of uncertainty and leading signal.

A classifier, which estimates the probability of our target given our data, is usually trained on binary cross-entropy, possibly scaled to deal with class inequality.

We can use the Generalized Pareto Distribution (GPD), which provides a compact model of extremes (our tail distribution). Here, σ controls the scale, and ξ controls the weight of the tail. The transformer produces a representation of the latest states of the sun h maps represent GPD parameters, different magnetic histories mean different tail distributions for one active region (sunspot).

The full goal includes two predictive functions. The divergence term teaches the model to estimate whether the flare exceeds a chosen threshold, while the tail term teaches it what the excess intensity looks like after that threshold has been exceeded. This is important because the model should not learn “to burn or not to burn.” It should also learn how big the event can be once it gets into the dangerous part of the distribution.

Sunspot AR 1302 on the Sun, photographed on September 24, 2011. NASA has described the active region as producing large solar flares during Solar Cycle 24.
NASA, *Sunspots 1302 Sep 2011 by NASA.jpg*September 24, 2011, via Wikimedia Commons. Public domain

The conclusion

When it comes to getting a good prediction for a very rare event using a transformer, it is not enough to simply connect the data and reduce the loss function. When it comes to predicting solar flares, localization and feature engineering techniques must first be applied to our data. Then we need to specify a directed model that can distinguish between positive and negative events. We must choose an appropriate metric that rewards both the good and the good and penalizes the false. Also, due to large class inequality, it makes sense to model the tail using a normal Pareto distribution to model the excess over the threshold. These strategies and loss functions can be used as different transformer heads that can both predict and measure, and learn how big the event can be once it enters the dangerous part of the distribution. What we get from this is improved predictive performance and a better specified model.

Website | LinkedIn | GitHub