Ukuhlaziywa kabusha kokubuyiselwa kwemigqa eminingi | Kubheke kwisayensi yedatha

Ikhodi ephelele yalesi sibonelo ezansi kwalokhu okuthunyelwe.
Kusetshenziswa ama-regression amaningi lapho impendulo yakho eguquguqukayo Y iyaqhubeka futhi une-k Imininingwane ingeyofomu:
(Y₁, x₁), …, (yᵢ, xᵢ), …, (yₙ, xₙ)
Lapho i-Xᵢ = (Xᵢ₁, …, i-Xᵢₖ) iyi-veter of Covariates ne-n inani lokubonwa. Lapha, i-XI yivemvane lamanani we-K ekozalo ngokubona okuthile.
Ukuqonda imininingwane
Ukwenza lokhu kukhonkolo, cabanga ngalesi simo esilandelayo:
Uyakujabulela ukugijima nokulandela ukusebenza kwakho ngokurekhoda ibanga oligijimayo nsuku zonke. Izinsuku ezingaphezu kuka-100 zilandelana, uqoqa izingcezu ezine zolwazi:
- Ibanga oligijimayo,
- Inani lamahora owasebenzisile ukugijima,
- Inani lamahora owalale izolo kusihlwa,
- Kanye nenani lamahora owasebenzeli
Manje, ngosuku lwe-101st, uqophe konke kuphela ibanga oyigijimile. Ufuna ukulinganisa ukuthi inani elilahlekile lisebenzisa imininingwane onayo: inani lamahora owasebenzisile ukugijima, inani lamahora owalala ngalo ngalolo suku.
Ukuze wenze lokhu, ungathembela kwimininingwane kusuka ezinsukwini eziyi-100 ezedlule, ezithatha ifomu:
(Y₁, x₁), …, (yᵢ, xᵢ), …, (y₁₀₀, x₁₀₀)
Lapha, ngamunye ᵢ ibanga ogijime ngalo ngosuku minakanye ne-vector ngayinye ye-covational Xᵢ = (xᵢ₁, xᵢ₂, xᵢ₃) Ihambelana nalokhu:
- Xᵢ₁: Inani lamahora asetshenziswe egijima,
- Xᵢ₂: Inani lamahora alale ubusuku bangaphambilini,
- Xᵢ₃: Inani lamahora asebenze ngalolo suku.
Inkomba i = 1, …, 100 Kubhekiswa ezinsukwini eziyi-100 ngedatha ephelele. Ngalesi dathale, manje usungafanelana nemodeli yokuhlehlisa umugqa eminingi ukulinganisa ukuguquguquka okungekho emthethweni kosuku lwe-101.
Ukucaciswa kwemodeli
Uma sithatha ubudlelwano obuqondile phakathi kokuguqulwa kwempendulo kanye ne-Covoriates, ongalinganisa usebenzisa i-Pearson Corretation, singacacisa imodeli njengoba:
Ngoba i = 1, …, n lapho e (ᵢ ᵢ | Xᵢ₁, …, Xᵢₖ). Ukubheka i-intercept, ukuguquguquka kokuqala kusethelwe ku Xᵢ₁ = 1, ngoba i = 1, …, n. Ukulinganisa i-coefficient, imodeli ivezwa ku-matrix notation.

Futhi izindlu zizokhonjiswa ngu:


Ngemuva kwalokho, singabhala kabusha imodeli njengoba:
Y = xβ + ε
Ukulinganisa kwama-coefficients
Ukucabanga ukuthi i- (K + 1)

Singathola ukulinganisa komsebenzi wokuphindaphinda, ukulinganisa okungahlangene kwe-σ˚, kanye nesikhathi sokuzethemba esingu-1-α se-βⱼ:
- Ukulinganisa komsebenzi wokuphindaphinda: r (x) = Σ ⱼ₌₁ᵏ βⱼ xⱼ
- Σ² = (1 / N – K))
- Kanye ne-β ⱼ ± tₙ₋ₖ, ₁₋α/ × Mahhala (βⱼ) kulinganiselwa (1 – α) ukuqiniseka kwesikhashana. Lapho i-SE (βⱼ) iyinto ye-jth diagonal ye-matrix σ² (xᵀ x) ⁻¹
Isibonelo sesicelo
Ngoba asiqopha imininingwane yokusebenza kwethu okusebenzayo, sizosebenzisa i-dataset yobugebengu kusuka ezizweni ezingama-47 ngonyaka we-1960 ezingatholakala lapha. Ngaphambi kokuthi sifanelwe kabusha umugqa oqondile, kunezinyathelo eziningi okufanele sizilandele.
Ukuqonda okuguquguqukayo okuhlukahlukene kwemininingwane.
Ukubonwa kokuqala okungu-9 kwemininingwane kunikezwe ngu:
R Age S Ed Ex0 Ex1 LF M N NW U1 U2 W X
79.1 151 1 91 58 56 510 950 33 301 108 41 394 261
163.5 143 0 113 103 95 583 1012 13 102 96 36 557 194
57.8 142 1 89 45 44 533 969 18 219 94 33 318 250
196.9 136 0 121 149 141 577 994 157 80 102 39 673 167
123.4 141 0 121 109 101 591 985 18 30 91 20 578 174
68.2 121 0 110 118 115 547 964 25 44 84 29 689 126
96.3 127 1 111 82 79 519 982 4 139 97 38 620 168
155.5 131 1 109 115 109 542 969 50 179 79 35 472 206
85.6 157 1 90 65 62 553 955 39 286 81 28 421 239
Imininingwane inokuhluka oku-14 okuqhubekayo (ukuphendula okuguquguqukayo r, okuguquguqukayo okungu-12 kwezibikezeli, kanye nama-s affractical ahlukahlukene aphakathi:
- R: Izinga lobugebengu: # yamacala abikwe emaphoyiseni ezwe ngalinye
- Ubudala: Inani labesilisa abaneminyaka yobudala obungu-14-24 ngenani eli-1000
- S: Ukuguquguquka kwesikhombisi se-Southern States (0 = Cha, 1 = Yebo)
- Ed: Kusho # iminyaka yokufunda i-X 10 yabantu abaneminyaka yobudala engama-25 noma ngaphezulu
- I-EX0: 1960 Ukusetshenziswa kwemali nge-capita emaphoyiseni nguhulumeni wasekhaya
- I-EX1: 1959 Ukusetshenziswa kwemali ngeCapita ngamaphoyisa ngohulumeni wasekhaya nohulumeni wasekhaya
- I-LF: Izinga lokuzibandakanya labasebenzi nge-1000 Civil Age Age 14-24
- M: Inani labesilisa nge-1000 abesifazane
- N: Usayizi wabantu bezwe ngamakhulu ezinkulungwane
- I-NW: Inani labangewona abelungu ngenani labantu abayi-1000
- U1: Izinga lokungasebenzi labesilisa basemadolobheni nge-1000 yobudala be-14-24
- U2: Izinga lokungasebenzi labesilisa basemadolobheni nge-1000 yobudala 35-39
- W: Inani eliphakathi kwezimpahla ezidlulisekayo nezimpahla noma imali yomndeni kumashumi ama- $
- X: Inani lemindeni nge-1000 ethola ngaphansi kwe-1/2 Imali Ephakathi
Imininingwane ayinawo amanani alahlekile.
Ukuhlaziywa kwesithombe kobudlelwano phakathi kwe-Covaricates X kanye nokuphendula y
Ukuhlaziywa kwesithombe kobudlelwano phakathi kokuguquguquka okuchazayo kanye nokuguquguquka kwempendulo kuyisinyathelo lapho wenza ama-regression aqondile.
Kuyasiza ukubona ngeso lengqondo izitayela eziqondile, zibone ama-anomalies, futhi zihlole ukuhambisana kokuguquguqukayo ngaphambi kokwakha noma iyiphi imodeli.

Okunye okuguquguqukayo kuhlobene kahle nezinga lobugebengu, kanti abanye bahlobene kabi.
Isibonelo, sibona ubudlelwane obuqinile obuphakathi kwe-R (izinga lobugebengu) ne-ex1.
Ngokuphambene nalokho, iminyaka ibukeka kabi ngobugebengu.
Ekugcineni, i-boxplot ye-binary affracguqukayo s (ekhombisa isifunda: enyakatho noma eningizimu) iphakamisa ukuthi izinga lobugebengu liyafana phakathi kwezifunda ezimbili. Ngemuva kwalokho, singakwazi ukuhlaziya matrix wokuhlangana.
I-HeatMap ye-Pearson colrelation matrix
I-matrix yokuhlangana isivumela ukuba sifunde amandla wobudlelwano phakathi kwezinto eziguqukayo. Ngenkathi ukuxhumeka kwePearson kuvame ukusetshenziswa ukukala ubudlelwano obuqondile, ukuxhumeka komkhonto kufanelekile lapho sifuna ukuthwebula ubudlelwane beMonotonic, obungenzeki phakathi kokuhlukahluka.
Kulokhu kuhlaziywa, sizosebenzisa ukuxhumeka komkhonto ku-akhawunti engcono ngezinhlangano ezingezona ezilambile.

Umugqa wokuqala we-matrix yokuxhumanisa ukhombisa amandla wobudlelwano phakathi kwe-covariate ngayinye kanye nokuphendula okuhlukile kwe-R.
Isibonelo, ex0 ne-ex1 bobabili bakhombisa ukuxhumeka okungaphezulu kwama-60% nge-R, okukhombisa inhlangano enamandla. Lezi zinhlobonhlobo zibonakala zingababikezeli bezinga lobugebengu.
Kodwa-ke, njengoba ukuxhumeka phakathi kwe-ex0 ne-ex1 kucishe kuphelele, kungenzeka ukuthi badlulise imininingwane efanayo. Ukugwema ukubuyiselwa emuva, singakhetha eyodwa yazo, mhlawumbe eyodwa enokuhlobana okunamandla noR.
Lapho okuguquguqukayo okuningana ixhumeke kakhulu komunye nomunye (ukuxhumeka kwama-60%, ngokwesibonelo)bavame ukuphatha imininingwane engafuneki. Ezimweni ezinjalo, sigcina munye kuphela kubo – lowo oxhumene kakhulu nokuphendula okuguquguqukayo uR. Lokhu kusivumela ukuthi sinciphise i-multicollinearity.
Lo msebenzi usivumela ukuthi sikhethe lezi zinhlobonhlobo: [‘Ex1’, ‘LF’, ‘M’, ’N’, ‘NW’, ‘U2’].
Ukutadisha kwe-mulkollinearity usebenzisa i-VIF (VIFE FORATION TIPS)
Ngaphambi kokufanelekela ukubuyiselwa okunengqondo, kubalulekile ukutadisha i-mulkollinearity.
Lapho ukuhlangana kutholakala phakathi kwababikezeli, amaphutha ajwayelekile ezilinganiselwa ngokulingana akhuphuka, eholela ekukhuphukeni kokwehluka kwabo. I-VIF ye-inflation factor (i-VIF) iyithuluzi lokuxilonga elisetshenziselwa ukukala ukuthi umehluko omkhulu we-coeffliction wabikezela ngenxa ye-multicollinearity, futhi imvamisa kunikezwe ekuphumeni kokubuyiselwa emuva ngaphansi kwekholomu “ye-vif” ngaphansi kwekholomu ye- “VIF”.

Le vif ibalwa kwisifo ngasinye kumodeli. Indlela yokubuyisa i-I-Th Abweler ihlukahlukana nabanye ababikezeli. Sibe sesithola i-Rᵢ², esingasetshenziselwa ukufaka i-vif usebenzisa ifomula:

Ithebula elingezansi liveza amanani we-VIF ngokuhlukahluka okusele eziyisithupha, konke okungaphansi kwe-5. Lokhu kukhombisa ukuthi ama-multicollinearity akusona ukukhathazeka, futhi singakwazi ukuqhubeka ngokufanelekela imodeli yokuhlebulwa kabusha.

Kufanelekile umugqa oqondile ngokuhlukahluka okuyisithupha
Uma silingana nokubuyiselwa okuqondile kwezinga lobugebengu ezindaweni ezi-10, sithola okulandelayo:

Ukuxilongwa Kwezinsalela Zezinsalela
Ngaphambi kokuhumusha imiphumela yokubuyisa, kufanele siqale sihlole ikhwalithi yezinsalela, ikakhulukazi ngokubheka i-autocorrelation, i-homoscedasticity (ukuhlukahluka okuqhubekayo), nokujwayelekile. Ukuxilongwa kwezinsalela kunikezwa yithebula elingezansi:

- I-Durbin-Watson ≈2 ikhombisa ukungabikho okuzenzakalelayo ezinsalweni ezisele.
- Ukusuka e-Omnibus kuya kuRnibus, wonke amanani akhombisa ukuthi izinsalela ziyi-symmetric futhi zinokusatshalaliswa okujwayelekile.
- Inombolo ephansi yesimo (3.06) iqinisekisa ukuthi akukho okuningana phakathi kwababikezelayo.
Amaphoyinti asemqoka ukukhumbula
Futhi singahlola ikhwalithi ephelele yemodeli ngezinkomba ezinjenge-R-station kanye ne-F-station, ekhombisa imiphumela egculisayo kuleli cala. (Bheka isithasiselo ukuthola eminye imininingwane.)
Manje sesikwazi ukutolika ama-coefficiefressions refficiefression ngombono wezibalo.
Sikhipha ngamabomu noma iyiphi incazelo ethile yebhizinisi yemiphumela.
Inhloso yalokhu kuhlaziywa ukukhombisa izinyathelo ezimbalwa ezilula nezibalulekile zokumodela inkinga usebenzisa ama-regression amaningi aqondile.
Kumazinga angama-5% abalule, ama-coefficients amabili abaluleke ngokwezibalo: ex1 ne-NW.
Lokhu akumangazi, njengoba lokhu bekuyizinto ezimbili eziguquguqukayo ezibonisa ukuhlanganiswa okungaphezulu kwama-40% ngempendulo eguquguqukayo ye-R.
Lokhu okuthunyelwe kukunika imihlahlandlela yokwenza kabusha iLinear Regression:
- Kubalulekile ukubheka ubunoriti ngokuhlaziywa kokuqhafaza nokutadisha Ukuhlobana phakathi kokuguqukela kwempendulo kanye nababikezeli.
- Ukuhlola ukuxhumeka phakathi kwezinto eziguqukayo kusiza ukunciphisa I-MultiCollineArity futhi isekela Ukukhetha okuguquguqukayo.
- Lapho ababikezeli ababili baxhumeka kakhulu, bangadlulisela Imininingwane engafuneki. Ezimweni ezinjalo, ungagcina lowo okukhona kuhlotshaniswa ngokuqinile nempendulonoma – Kususelwa kubuchwepheshe besizinda – lowo onamandla amakhulu Ukuhambisana Kwebhizinisi noma Ukuhunyushwa okusebenzayo.
- Le khasi I-Variance Inflation Factor (VIF) iyithuluzi eliwusizo ukukala futhi lihlole i-multicollinearity.
- Ngaphambi kokuhumusha ama-coefficients amamodeli ngokwezibalo, kubalulekile ukuqinisekisa ngokuzenzakalela i-autocorrelation, okujwayelekile, kanye ne-homoscedasticity yezinsalela zokuqinisekisa ukuthi ukucatshangelwa kwemodeli kuhlangatshezwayo.
Ngenkathi lokhu kuhlaziywa kunikeza ukuqonda okubalulekile, kubuye kube nemikhawulo ethile.
Ukungabikho kwamanani alahlekile kudathafathi kuyenza lula isifundo, kepha lokhu akuvamile ukuthi kube sezimweni zangempela zomhlaba.
Uma wakhe a Imodeli yokubikezelakubalulekile Hlukanisa idatha ku- ukulungiselela umdlalo, ukuhlolafuthi kungenzeka ukuthi Isethi yokuqinisekiswa kwesikhathi ukuqinisekisa ukuhlolwa okuqinile.
Ingomane Ukukhetha okuguquguqukayoamasu anjenge Ukukhetha okuqhubekayo nezinye izindlela zokukhetha izici zingasetshenziswa.
Uma uqhathanisa amamodeli amaningi, kubalulekile ukuchaza kufanelekile Amamethrikhi wokusebenza.
Endabeni yokuhlehlisa okuqondile, amamethrikhi asetshenziswayo afaka phakathi Kusho iphutha eliphelele (mae) kanye Kusho iphutha elikweletayo (MSE).
Izimpawu zesithombe
Zonke izithombe nokubonwayo kule ndatshana kudalwe ngumlobi kusetshenziswa i-Python (pandas, matplotlib, wasolwandle, kanye nokwenza amapulangwe) futhi kuchazwe ngenye indlela.
Ukunqubekela phambili
Warserman, L. (2013). Zonke izibalo: inkambo emfushane ekuhlolweni kwezibalo. I-Springer Science & Business Media.
Idatha nokulayisense
Idathasethi esetshenziswe kulesi sihloko iqukethe izibalo ezihlobene nobugebengu nezobuntu zabantu abangama-47 zase-US ngo-1960.
Isuka kuhlelo lokubika lobugebengu be-FBI olufanayo lwe-FBI (UCR) nemithombo eyengeziwe kahulumeni yase-US.
Njengomsebenzi kahulumeni wase-US, imininingwane isesiyalweni somphakathi ngaphansi kwekhodi engu-17 yase-US § 105 futhi imahhala ukuyisebenzisa, wabelane, futhi iveliswe ngaphandle kokuvinjelwa.
Imithombo:
Amakhodi
Ngenisa idatha
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
df = pd.read_csv('data/Multiple_Regression_Dataset.csv')
df.head()
Ukuhlaziywa okubukwayo kwezinto eziguqukayo
Create a new figure
# Extract response variable and covariates
response = 'R'
covariates = [col for col in df.columns if col != response]
fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(20, 18))
axes = axes.flatten()
# Plot boxplot for binary variable 'S'
sns.boxplot(data=df, x='S', y='R', ax=axes[0])
axes[0].set_title('Boxplot of R by S')
axes[0].set_xlabel('S')
axes[0].set_ylabel('R')
# Plot regression lines for all other covariates
plot_index = 1
for cov in covariates:
if cov != 'S':
sns.regplot(data=df, x=cov, y='R', ax=axes[plot_index], scatter=True, line_kws={"color": "red"})
axes[plot_index].set_title(f'{cov} vs R')
axes[plot_index].set_xlabel(cov)
axes[plot_index].set_ylabel('R')
plot_index += 1
# Hide unused subplots
for i in range(plot_index, len(axes)):
fig.delaxes(axes[i])
fig.tight_layout()
plt.show()
Ukuhlaziywa kokuhlangana phakathi kokuguquguqukayo
spearman_corr = df.corr(method='spearman')
plt.figure(figsize=(12, 10))
sns.heatmap(spearman_corr, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix Heatmap")
plt.show()
Ukuhlunga Izibikezeli nge-Intercorrelation ephezulu (ρ> 0.6)
# Step 2: Correlation of each variable with response R
spearman_corr_with_R = spearman_corr['R'].drop('R') # exclude R-R
# Step 3: Identify pairs of covariates with strong inter-correlation (e.g., > 0.9)
strong_pairs = []
threshold = 0.6
covariates = spearman_corr_with_R.index
for i, var1 in enumerate(covariates):
for var2 in covariates[i+1:]:
if abs(spearman_corr.loc[var1, var2]) > threshold:
strong_pairs.append((var1, var2))
# Step 4: From each correlated pair, keep only the variable most correlated with R
to_keep = set()
to_discard = set()
for var1, var2 in strong_pairs:
if abs(spearman_corr_with_R[var1]) >= abs(spearman_corr_with_R[var2]):
to_keep.add(var1)
to_discard.add(var2)
else:
to_keep.add(var2)
to_discard.add(var1)
# Final selection: all covariates excluding the ones to discard due to redundancy
final_selected_variables = [var for var in covariates if var not in to_discard]
final_selected_variables
Ukuhlaziywa kwe-multicollinearity usebenzisa i-vif
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.preprocessing import StandardScaler
X = df[final_selected_variables]
X_with_const = add_constant(X)
vif_data = pd.DataFrame()
vif_data["variable"] = X_with_const.columns
vif_data["VIF"] = [variance_inflation_factor(X_with_const.values, i)
for i in range(X_with_const.shape[1])]
vif_data = vif_data[vif_data["variable"] != "const"]
print(vif_data)
Lifanela imodeli yokuhlehlisa eliqondile ngokuhlukahluka kwesithupha ngemuva kokujwayelekile, hhayi ukuhlukanisa idatha esitimeleni bese uvivinya
from sklearn.preprocessing import StandardScaler
from statsmodels.api import OLS, add_constant
import pandas as pd
# Variables
X = df[final_selected_variables]
y = df['R']
scaler = StandardScaler()
X_scaled_vars = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled_vars, columns=final_selected_variables)
X_scaled_df = add_constant(X_scaled_df)
model = OLS(y, X_scaled_df).fit()
print(model.summary())
