Log Link vs Log Transformation in R – Difference Differences Your Data Analysis

The distribution is a highly used, a large number of real national data is unusual. When faced with the most checkable data, they salute us to use the log conversion to normalize distribution and make a difference. I just worked on the project analyzing project for AI models, using data from Epoch Ai [1]. No official data in the use of each model power, so I counted it by repeating the power of each model during its training time. New variations, power (in KWH), would be done well, along with some excessive retailers and overdue (figurative figure (Fig. 1).
Dealing with this drink and htoskedesticity, my first nature was to put log changes to the change of energy. The Log Distribution of the Log (Energy) is typically normal (Figure 2), and Shapiro-Wilk test confirmed Borderline (P ≤ 0,5).

Modem Dilemma: Log transformation vs log Link
Sight appears to be good, but when I passed on to a smile, I faced a problem: I have to imitate the Convertible convertible conversations (log(Y) ~ X
Selected, or I have to imitate -thange using a LOG coordinate activity (Y ~ X, link = “log"
Selected? I also looked at two distribution – Gaussian (General) distribution and the distribution of Gamma – and one combined in each distribution of both entry. This gave me four different models as below, all installed using standard models er Gregech (GLM):
all_gaussian_log_link <- glm(Energy_kWh ~ Parameters +
Training_compute_FLOP +
Training_dataset_size +
Training_time_hour +
Hardware_quantity +
Training_hardware,
family = gaussian(link = "log"), data = df)
all_gaussian_log_transform <- glm(log(Energy_kWh) ~ Parameters +
Training_compute_FLOP +
Training_dataset_size +
Training_time_hour +
Hardware_quantity +
Training_hardware,
data = df)
all_gamma_log_link <- glm(Energy_kWh ~ Parameters +
Training_compute_FLOP +
Training_dataset_size +
Training_time_hour +
Hardware_quantity +
Training_hardware + 0,
family = Gamma(link = "log"), data = df)
all_gamma_log_transform <- glm(log(Energy_kWh) ~ Parameters +
Training_compute_FLOP +
Training_dataset_size +
Training_time_hour +
Hardware_quantity +
Training_hardware + 0,
family = Gamma(), data = df)
Comparing Model: AIC Zones and Notifications
I compared four models using Akaike Information Criterion (AIC), an average of predicting error. Usually, lower than AIC, preferably appropriate model.
AIC(all_gaussian_log_link, all_gaussian_log_transform, all_gamma_log_link, all_gamma_log_transform)
df AIC
all_gaussian_log_link 25 2005.8263
all_gaussian_log_transform 25 311.5963
all_gamma_log_link 25 1780.8524
all_gamma_log_transform 25 352.5450
Among the four models, models use the converted log-transformation results with a lot of low prices lower than those using log links. Since the difference in AIC-Links and Log-Link was a major (311 and 352 vs 1780 and 2005), I also check the diagnostic models to ensure the relevant log-transformed models:




Based on AIC and Diagnostic values, I have decided to move forward with Gampa Model for the Gamma, because it had the lowest AIC value and its remains of the farm looks better than that gaussian arrangement changed.
I looked at what kind variations were helpful and what collision might be important. The last model I chose is:
glm(formula = log(Energy_kWh) ~ Training_time_hour * Hardware_quantity +
Training_hardware + 0, family = Gamma(), data = df)
Interpreting Coefficients
However, when I first interpret the model coefficients, something heard something heard. Only the answer only the answer was changed, forecasting results are the gatherings, and we need to produce coefficients to turn it back on the original scale. The increase in one unit increases effect 𝓎 in an Exp (β), or each additional unit in 𝓍 leading to (Exp (β) – 1) × 100% change in 𝓎 [2].
If you look at the table of the results of the model below, we have Training_Time_Hour, Hardware_quantity, and their Teraction Stures Training_Time_Hour: Hardware_quantity They varies on continual, so the their coefficients represent the slopes. At that time, as I specify +0 in model formula, all grinding levels Training_khardare Act as disruption, which means that each hardware type has served as antercept β₀ where its corresponding dummy variable was working.
> glm(formula = log(Energy_kWh) ~ Training_time_hour * Hardware_quantity +
Training_hardware + 0, family = Gamma(), data = df)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Training_time_hour -1.587e-05 3.112e-06 -5.098 5.76e-06 ***
Hardware_quantity -5.121e-06 1.564e-06 -3.275 0.00196 **
Training_hardwareGoogle TPU v2 1.396e-01 2.297e-02 6.079 1.90e-07 ***
Training_hardwareGoogle TPU v3 1.106e-01 7.048e-03 15.696 < 2e-16 ***
Training_hardwareGoogle TPU v4 9.957e-02 7.939e-03 12.542 < 2e-16 ***
Training_hardwareHuawei Ascend 910 1.112e-01 1.862e-02 5.969 2.79e-07 ***
Training_hardwareNVIDIA A100 1.077e-01 6.993e-03 15.409 < 2e-16 ***
Training_hardwareNVIDIA A100 SXM4 40 GB 1.020e-01 1.072e-02 9.515 1.26e-12 ***
Training_hardwareNVIDIA A100 SXM4 80 GB 1.014e-01 1.018e-02 9.958 2.90e-13 ***
Training_hardwareNVIDIA GeForce GTX 285 3.202e-01 7.491e-02 4.275 9.03e-05 ***
Training_hardwareNVIDIA GeForce GTX TITAN X 1.601e-01 2.630e-02 6.088 1.84e-07 ***
Training_hardwareNVIDIA GTX Titan Black 1.498e-01 3.328e-02 4.501 4.31e-05 ***
Training_hardwareNVIDIA H100 SXM5 80GB 9.736e-02 9.840e-03 9.894 3.59e-13 ***
Training_hardwareNVIDIA P100 1.604e-01 1.922e-02 8.342 6.73e-11 ***
Training_hardwareNVIDIA Quadro P600 1.714e-01 3.756e-02 4.562 3.52e-05 ***
Training_hardwareNVIDIA Quadro RTX 4000 1.538e-01 3.263e-02 4.714 2.12e-05 ***
Training_hardwareNVIDIA Quadro RTX 5000 1.819e-01 4.021e-02 4.524 3.99e-05 ***
Training_hardwareNVIDIA Tesla K80 1.125e-01 1.608e-02 6.993 7.54e-09 ***
Training_hardwareNVIDIA Tesla V100 DGXS 32 GB 1.072e-01 1.353e-02 7.922 2.89e-10 ***
Training_hardwareNVIDIA Tesla V100S PCIe 32 GB 9.444e-02 2.030e-02 4.653 2.60e-05 ***
Training_hardwareNVIDIA V100 1.420e-01 1.201e-02 11.822 8.01e-16 ***
Training_time_hour:Hardware_quantity 2.296e-09 9.372e-10 2.450 0.01799 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma family taken to be 0.05497984)
Null deviance: NaN on 70 degrees of freedom
Residual deviance: 3.0043 on 48 degrees of freedom
AIC: 345.39
When converting slopes has changed the percentage of variable variables, each continuous flexibility result was almost zero, even a little negative:
All the club were converted and turned back around 1 kWh with the first measure. The results do not make sense at least at least the slopes to grow and intense use of power. I wondered if I used the panel connected model with the same predictors they might produce different results, so I agree with the model and:
glm(formula = Energy_kWh ~ Training_time_hour * Hardware_quantity +
Training_hardware + 0, family = Gamma(link = "log"), data = df)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Training_time_hour 1.818e-03 1.640e-04 11.088 7.74e-15 ***
Hardware_quantity 7.373e-04 1.008e-04 7.315 2.42e-09 ***
Training_hardwareGoogle TPU v2 7.136e+00 7.379e-01 9.670 7.51e-13 ***
Training_hardwareGoogle TPU v3 1.004e+01 3.156e-01 31.808 < 2e-16 ***
Training_hardwareGoogle TPU v4 1.014e+01 4.220e-01 24.035 < 2e-16 ***
Training_hardwareHuawei Ascend 910 9.231e+00 1.108e+00 8.331 6.98e-11 ***
Training_hardwareNVIDIA A100 1.028e+01 3.301e-01 31.144 < 2e-16 ***
Training_hardwareNVIDIA A100 SXM4 40 GB 1.057e+01 5.635e-01 18.761 < 2e-16 ***
Training_hardwareNVIDIA A100 SXM4 80 GB 1.093e+01 5.751e-01 19.005 < 2e-16 ***
Training_hardwareNVIDIA GeForce GTX 285 3.042e+00 1.043e+00 2.916 0.00538 **
Training_hardwareNVIDIA GeForce GTX TITAN X 6.322e+00 7.379e-01 8.568 3.09e-11 ***
Training_hardwareNVIDIA GTX Titan Black 6.135e+00 1.047e+00 5.862 4.07e-07 ***
Training_hardwareNVIDIA H100 SXM5 80GB 1.115e+01 6.614e-01 16.865 < 2e-16 ***
Training_hardwareNVIDIA P100 5.715e+00 6.864e-01 8.326 7.12e-11 ***
Training_hardwareNVIDIA Quadro P600 4.940e+00 1.050e+00 4.705 2.18e-05 ***
Training_hardwareNVIDIA Quadro RTX 4000 5.469e+00 1.055e+00 5.184 4.30e-06 ***
Training_hardwareNVIDIA Quadro RTX 5000 4.617e+00 1.049e+00 4.401 5.98e-05 ***
Training_hardwareNVIDIA Tesla K80 8.631e+00 7.587e-01 11.376 3.16e-15 ***
Training_hardwareNVIDIA Tesla V100 DGXS 32 GB 9.994e+00 6.920e-01 14.443 < 2e-16 ***
Training_hardwareNVIDIA Tesla V100S PCIe 32 GB 1.058e+01 1.047e+00 10.105 1.80e-13 ***
Training_hardwareNVIDIA V100 9.208e+00 3.998e-01 23.030 < 2e-16 ***
Training_time_hour:Hardware_quantity -2.651e-07 6.130e-08 -4.324 7.70e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma family taken to be 1.088522)
Null deviance: 2.7045e+08 on 70 degrees of freedom
Residual deviance: 1.0593e+02 on 48 degrees of freedom
AIC: 1775
By this time, Training_Time including Hardware_quantity It will enhance complete energy use by 0.18% per hour additional hour and 0.07% of each Chip, respectively. At that time, their cooperation can reduce power consumption by 2 × 10⁵%. These results even have additional mind as Training_Time can reach 7000 hours and Hardware_quantity Up to 16000 units.

Picture visualization is a better difference, creating two farms by comparing predictions (shown as per lines) from both models. The left panel used the modified Gamma Glm model, where safe lines were nearly a secondary and near zero, nowhere from a strong green data lines. On the other hand, the right panel used by the Gamma Glgm model linked to log-Gamma Glm, where the heading lines are closely accompanied by real lines.
test_data <- df[, c("Training_time_hour", "Hardware_quantity", "Training_hardware")]
prediction_data <- df %>%
mutate(
pred_energy1 = exp(predict(glm3, newdata = test_data)),
pred_energy2 = predict(glm3_alt, newdata = test_data, type = "response"),
)
y_limits <- c(min(df$Energy_KWh, prediction_data$pred_energy1, prediction_data$pred_energy2),
max(df$Energy_KWh, prediction_data$pred_energy1, prediction_data$pred_energy2))
p1 <- ggplot(df, aes(x = Hardware_quantity, y = Energy_kWh, color = Training_time_group)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
geom_smooth(data = prediction_data, aes(y = pred_energy1), method = "lm", se = FALSE,
linetype = "dashed", size = 1) +
scale_y_log10(limits = y_limits) +
labs(x="Hardware Quantity", y = "log of Energy (kWh)") +
theme_minimal() +
theme(legend.position = "none")
p2 <- ggplot(df, aes(x = Hardware_quantity, y = Energy_kWh, color = Training_time_group)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
geom_smooth(data = prediction_data, aes(y = pred_energy2), method = "lm", se = FALSE,
linetype = "dashed", size = 1) +
scale_y_log10(limits = y_limits) +
labs(x="Hardware Quantity", color = "Training Time Level") +
theme_minimal() +
theme(axis.title.y = element_blank())
p1 + p2

Why is the log change fail
To understand the reason why the model modified with log-transformed cannot capture lower results such as log-connected, let's go what happens when installing log change:
Suppose it is equal to some X and Error error:

When we use log transform to y, we actually press both f (x) and error:

That means we symbolize several variables, log (Y). When we are connecting for our g (x) – in my case g (x) = training_hour * Hour * quantity + training_khardare– Trying to capture the consequences of both “Shrunk” f (x) error term.
On the contrary, when we use the log link, we can still model the original Y, not a converted version. Instead, the model indicates our G (X) work to predict.

The model then reduces the difference between the actual y and y.

Store
Converting Log-Translopform is not like using the log link, and may always allow reliable results. Under the Hood, the Log Change is transforming the flexibility itself and is distorted variations and noise. Understanding this subtle difference of mathematics after your models is as important as trying to find the appropriate model.
[1] Epoch Ai. Data in Ai Antable Ai models . Returned from
[2] The University of Virginia Library. Translating logs to modify in a straight model.Returned from