Machine Learning

Ukuhlaziywa kabusha kokubuyiselwa kwemigqa eminingi | Kubheke kwisayensi yedatha

Ikhodi ephelele yalesi sibonelo ezansi kwalokhu okuthunyelwe.

Kusetshenziswa ama-regression amaningi lapho impendulo yakho eguquguqukayo Y iyaqhubeka futhi une-k Imininingwane ingeyofomu:

(Y₁, x₁), …, (yᵢ, xᵢ), …, (yₙ, xₙ)

Lapho i-Xᵢ = (Xᵢ₁, …, i-Xᵢₖ) iyi-veter of Covariates ne-n inani lokubonwa. Lapha, i-XI yivemvane lamanani we-K ekozalo ngokubona okuthile.

Ukuqonda imininingwane

Ukwenza lokhu kukhonkolo, cabanga ngalesi simo esilandelayo:

Uyakujabulela ukugijima nokulandela ukusebenza kwakho ngokurekhoda ibanga oligijimayo nsuku zonke. Izinsuku ezingaphezu kuka-100 zilandelana, uqoqa izingcezu ezine zolwazi:

  • Ibanga oligijimayo,
  • Inani lamahora owasebenzisile ukugijima,
  • Inani lamahora owalale izolo kusihlwa,
  • Kanye nenani lamahora owasebenzeli

Manje, ngosuku lwe-101st, uqophe konke kuphela ibanga oyigijimile. Ufuna ukulinganisa ukuthi inani elilahlekile lisebenzisa imininingwane onayo: inani lamahora owasebenzisile ukugijima, inani lamahora owalala ngalo ngalolo suku.

Ukuze wenze lokhu, ungathembela kwimininingwane kusuka ezinsukwini eziyi-100 ezedlule, ezithatha ifomu:

(Y₁, x₁), …, (yᵢ, xᵢ), …, (y₁₀₀, x₁₀₀)

Lapha, ngamunye ibanga ogijime ngalo ngosuku minakanye ne-vector ngayinye ye-covational Xᵢ = (xᵢ₁, xᵢ₂, xᵢ₃) Ihambelana nalokhu:

  • Xᵢ₁: Inani lamahora asetshenziswe egijima,
  • Xᵢ₂: Inani lamahora alale ubusuku bangaphambilini,
  • Xᵢ₃: Inani lamahora asebenze ngalolo suku.

Inkomba i = 1, …, 100 Kubhekiswa ezinsukwini eziyi-100 ngedatha ephelele. Ngalesi dathale, manje usungafanelana nemodeli yokuhlehlisa umugqa eminingi ukulinganisa ukuguquguquka okungekho emthethweni kosuku lwe-101.

Ukucaciswa kwemodeli

Uma sithatha ubudlelwano obuqondile phakathi kokuguqulwa kwempendulo kanye ne-Covoriates, ongalinganisa usebenzisa i-Pearson Corretation, singacacisa imodeli njengoba:

Ukucaciswa kwemodeli yokuhlehlisa eliqondile

Ngoba i = 1, …, n lapho e (ᵢ ᵢ | Xᵢ₁, …, Xᵢₖ). Ukubheka i-intercept, ukuguquguquka kokuqala kusethelwe ku Xᵢ₁ = 1, ngoba i = 1, …, n. Ukulinganisa i-coefficient, imodeli ivezwa ku-matrix notation.

QAPHELA.

Futhi izindlu zizokhonjiswa ngu:

X yiyona Design matrix (nge-Intercept and K Covoriates)
β yiveli yekholomu ye-coefficients, esetshenziswa kumodeli yokuhlehlisa eliqondile; I-ε iyi-ε yomhubhe wekholomu yemigomo yephutha engahleliwe, eyodwa ngokubona ngakunye.

Ngemuva kwalokho, singabhala kabusha imodeli njengoba:

Y = xβ + ε

Ukulinganisa kwama-coefficients

Ukucabanga ukuthi i- (K + 1)

Isilinganiso esincane sesikwele se-β.

Singathola ukulinganisa komsebenzi wokuphindaphinda, ukulinganisa okungahlangene kwe-σ˚, kanye nesikhathi sokuzethemba esingu-1-α se-βⱼ:

  • Ukulinganisa komsebenzi wokuphindaphinda: r (x) = Σ ⱼ₌₁ᵏ βⱼ xⱼ
  • Σ² = (1 / N – K))
  • Kanye ne-β ⱼ ± tₙ₋ₖ, ₁₋α/ × Mahhala (βⱼ) kulinganiselwa (1 – α) ukuqiniseka kwesikhashana. Lapho i-SE (βⱼ) iyinto ye-jth diagonal ye-matrix σ² (xᵀ x) ⁻¹

Isibonelo sesicelo

Ngoba asiqopha imininingwane yokusebenza kwethu okusebenzayo, sizosebenzisa i-dataset yobugebengu kusuka ezizweni ezingama-47 ngonyaka we-1960 ezingatholakala lapha. Ngaphambi kokuthi sifanelwe kabusha umugqa oqondile, kunezinyathelo eziningi okufanele sizilandele.

Ukuqonda okuguquguqukayo okuhlukahlukene kwemininingwane.

Ukubonwa kokuqala okungu-9 kwemininingwane kunikezwe ngu:

 R	   Age	S	Ed	Ex0	Ex1	LF	M	N	NW	U1	U2	W	X
79.1	151	1	91	58	56	510	950	33	301	108	41	394	261
163.5	143	0	113	103	95	583	1012 13	102	96	36	557	194
57.8	142	1	89	45	44	533	969	18	219	94	33	318	250
196.9	136	0	121	149	141	577	994	157	80	102	39	673	167
123.4	141	0	121	109	101	591	985	18	30	91	20	578	174
68.2	121	0	110	118	115	547	964	25	44	84	29	689	126
96.3	127	1	111	82	79	519	982	4	139	97	38	620	168
155.5	131	1	109	115	109	542	969	50	179	79	35	472	206
85.6	157	1	90	65	62	553	955	39	286	81	28	421	239

Imininingwane inokuhluka oku-14 okuqhubekayo (ukuphendula okuguquguqukayo r, okuguquguqukayo okungu-12 kwezibikezeli, kanye nama-s affractical ahlukahlukene aphakathi:

  1. R: Izinga lobugebengu: # yamacala abikwe emaphoyiseni ezwe ngalinye
  2. Ubudala: Inani labesilisa abaneminyaka yobudala obungu-14-24 ngenani eli-1000
  3. S: Ukuguquguquka kwesikhombisi se-Southern States (0 = Cha, 1 = Yebo)
  4. Ed: Kusho # iminyaka yokufunda i-X 10 yabantu abaneminyaka yobudala engama-25 noma ngaphezulu
  5. I-EX0: 1960 Ukusetshenziswa kwemali nge-capita emaphoyiseni nguhulumeni wasekhaya
  6. I-EX1: 1959 Ukusetshenziswa kwemali ngeCapita ngamaphoyisa ngohulumeni wasekhaya nohulumeni wasekhaya
  7. I-LF: Izinga lokuzibandakanya labasebenzi nge-1000 Civil Age Age 14-24
  8. M: Inani labesilisa nge-1000 abesifazane
  9. N: Usayizi wabantu bezwe ngamakhulu ezinkulungwane
  10. I-NW: Inani labangewona abelungu ngenani labantu abayi-1000
  11. U1: Izinga lokungasebenzi labesilisa basemadolobheni nge-1000 yobudala be-14-24
  12. U2: Izinga lokungasebenzi labesilisa basemadolobheni nge-1000 yobudala 35-39
  13. W: Inani eliphakathi kwezimpahla ezidlulisekayo nezimpahla noma imali yomndeni kumashumi ama- $
  14. X: Inani lemindeni nge-1000 ethola ngaphansi kwe-1/2 Imali Ephakathi

Imininingwane ayinawo amanani alahlekile.

Ukuhlaziywa kwesithombe kobudlelwano phakathi kwe-Covaricates X kanye nokuphendula y

Ukuhlaziywa kwesithombe kobudlelwano phakathi kokuguquguquka okuchazayo kanye nokuguquguquka kwempendulo kuyisinyathelo lapho wenza ama-regression aqondile.

Kuyasiza ukubona ngeso lengqondo izitayela eziqondile, zibone ama-anomalies, futhi zihlole ukuhambisana kokuguquguqukayo ngaphambi kokwakha noma iyiphi imodeli.

Amabhodlela Plots futhi ahlakaze iziza ezinemigqa yokubuyiselwa komugqa bonisa ukuthambekela phakathi kokuhlukahluka ngakunye futhi Um.

Okunye okuguquguqukayo kuhlobene kahle nezinga lobugebengu, kanti abanye bahlobene kabi.

Isibonelo, sibona ubudlelwane obuqinile obuphakathi kwe-R (izinga lobugebengu) ne-ex1.

Ngokuphambene nalokho, iminyaka ibukeka kabi ngobugebengu.

Ekugcineni, i-boxplot ye-binary affracguqukayo s (ekhombisa isifunda: enyakatho noma eningizimu) iphakamisa ukuthi izinga lobugebengu liyafana phakathi kwezifunda ezimbili. Ngemuva kwalokho, singakwazi ukuhlaziya matrix wokuhlangana.

I-HeatMap ye-Pearson colrelation matrix

I-matrix yokuhlangana isivumela ukuba sifunde amandla wobudlelwano phakathi kwezinto eziguqukayo. Ngenkathi ukuxhumeka kwePearson kuvame ukusetshenziswa ukukala ubudlelwano obuqondile, ukuxhumeka komkhonto kufanelekile lapho sifuna ukuthwebula ubudlelwane beMonotonic, obungenzeki phakathi kokuhlukahluka.

Kulokhu kuhlaziywa, sizosebenzisa ukuxhumeka komkhonto ku-akhawunti engcono ngezinhlangano ezingezona ezilambile.

A I-HeatMap ye-matrix yokuhlangana ePython

Umugqa wokuqala we-matrix yokuxhumanisa ukhombisa amandla wobudlelwano phakathi kwe-covariate ngayinye kanye nokuphendula okuhlukile kwe-R.

Isibonelo, ex0 ne-ex1 bobabili bakhombisa ukuxhumeka okungaphezulu kwama-60% nge-R, okukhombisa inhlangano enamandla. Lezi zinhlobonhlobo zibonakala zingababikezeli bezinga lobugebengu.

Kodwa-ke, njengoba ukuxhumeka phakathi kwe-ex0 ne-ex1 kucishe kuphelele, kungenzeka ukuthi badlulise imininingwane efanayo. Ukugwema ukubuyiselwa emuva, singakhetha eyodwa yazo, mhlawumbe eyodwa enokuhlobana okunamandla noR.

Lapho okuguquguqukayo okuningana ixhumeke kakhulu komunye nomunye (ukuxhumeka kwama-60%, ngokwesibonelo)bavame ukuphatha imininingwane engafuneki. Ezimweni ezinjalo, sigcina munye kuphela kubo – lowo oxhumene kakhulu nokuphendula okuguquguqukayo uR. Lokhu kusivumela ukuthi sinciphise i-multicollinearity.

Lo msebenzi usivumela ukuthi sikhethe lezi zinhlobonhlobo: [‘Ex1’, ‘LF’, ‘M’, ’N’, ‘NW’, ‘U2’].

Ukutadisha kwe-mulkollinearity usebenzisa i-VIF (VIFE FORATION TIPS)

Ngaphambi kokufanelekela ukubuyiselwa okunengqondo, kubalulekile ukutadisha i-mulkollinearity.

Lapho ukuhlangana kutholakala phakathi kwababikezeli, amaphutha ajwayelekile ezilinganiselwa ngokulingana akhuphuka, eholela ekukhuphukeni kokwehluka kwabo. I-VIF ye-inflation factor (i-VIF) iyithuluzi lokuxilonga elisetshenziselwa ukukala ukuthi umehluko omkhulu we-coeffliction wabikezela ngenxa ye-multicollinearity, futhi imvamisa kunikezwe ekuphumeni kokubuyiselwa emuva ngaphansi kwekholomu “ye-vif” ngaphansi kwekholomu ye- “VIF”.

Ukutolika kwe-VIF

Le vif ibalwa kwisifo ngasinye kumodeli. Indlela yokubuyisa i-I-Th Abweler ihlukahlukana nabanye ababikezeli. Sibe sesithola i-Rᵢ², esingasetshenziselwa ukufaka i-vif usebenzisa ifomula:

I-vif ye-ith eguqukayo

Ithebula elingezansi liveza amanani we-VIF ngokuhlukahluka okusele eziyisithupha, konke okungaphansi kwe-5. Lokhu kukhombisa ukuthi ama-multicollinearity akusona ukukhathazeka, futhi singakwazi ukuqhubeka ngokufanelekela imodeli yokuhlebulwa kabusha.

I-vif yokuguquguquka ngakunye ingaphezulu kwe-5.

Kufanelekile umugqa oqondile ngokuhlukahluka okuyisithupha

Uma silingana nokubuyiselwa okuqondile kwezinga lobugebengu ezindaweni ezi-10, sithola okulandelayo:

Ukukhishwa kokuhlaziywa kokubuyiselwa kwemigqa eminingi. Ikhodi ehambisanayo inikezwa esithasisweni.

Ukuxilongwa Kwezinsalela Zezinsalela

Ngaphambi kokuhumusha imiphumela yokubuyisa, kufanele siqale sihlole ikhwalithi yezinsalela, ikakhulukazi ngokubheka i-autocorrelation, i-homoscedasticity (ukuhlukahluka okuqhubekayo), nokujwayelekile. Ukuxilongwa kwezinsalela kunikezwa yithebula elingezansi:

Ukuxilongwa kwezinsalela. Woza ukufingqa kokubuyiselwa emuva
  • I-Durbin-Watson ≈2 ikhombisa ukungabikho okuzenzakalelayo ezinsalweni ezisele.
  • Ukusuka e-Omnibus kuya kuRnibus, wonke amanani akhombisa ukuthi izinsalela ziyi-symmetric futhi zinokusatshalaliswa okujwayelekile.
  • Inombolo ephansi yesimo (3.06) iqinisekisa ukuthi akukho okuningana phakathi kwababikezelayo.

Amaphoyinti asemqoka ukukhumbula

Futhi singahlola ikhwalithi ephelele yemodeli ngezinkomba ezinjenge-R-station kanye ne-F-station, ekhombisa imiphumela egculisayo kuleli cala. (Bheka isithasiselo ukuthola eminye imininingwane.)

Manje sesikwazi ukutolika ama-coefficiefressions refficiefression ngombono wezibalo.
Sikhipha ngamabomu noma iyiphi incazelo ethile yebhizinisi yemiphumela.
Inhloso yalokhu kuhlaziywa ukukhombisa izinyathelo ezimbalwa ezilula nezibalulekile zokumodela inkinga usebenzisa ama-regression amaningi aqondile.

Kumazinga angama-5% abalule, ama-coefficients amabili abaluleke ngokwezibalo: ex1 ne-NW.

Lokhu akumangazi, njengoba lokhu bekuyizinto ezimbili eziguquguqukayo ezibonisa ukuhlanganiswa okungaphezulu kwama-40% ngempendulo eguquguqukayo ye-R.

Lokhu okuthunyelwe kukunika imihlahlandlela yokwenza kabusha iLinear Regression:

  • Kubalulekile ukubheka ubunoriti ngokuhlaziywa kokuqhafaza nokutadisha Ukuhlobana phakathi kokuguqukela kwempendulo kanye nababikezeli.
  • Ukuhlola ukuxhumeka phakathi kwezinto eziguqukayo kusiza ukunciphisa I-MultiCollineArity futhi isekela Ukukhetha okuguquguqukayo.
  • Lapho ababikezeli ababili baxhumeka kakhulu, bangadlulisela Imininingwane engafuneki. Ezimweni ezinjalo, ungagcina lowo okukhona kuhlotshaniswa ngokuqinile nempendulonoma – Kususelwa kubuchwepheshe besizinda – lowo onamandla amakhulu Ukuhambisana Kwebhizinisi noma Ukuhunyushwa okusebenzayo.
  • Le khasi I-Variance Inflation Factor (VIF) iyithuluzi eliwusizo ukukala futhi lihlole i-multicollinearity.
  • Ngaphambi kokuhumusha ama-coefficients amamodeli ngokwezibalo, kubalulekile ukuqinisekisa ngokuzenzakalela i-autocorrelation, okujwayelekile, kanye ne-homoscedasticity yezinsalela zokuqinisekisa ukuthi ukucatshangelwa kwemodeli kuhlangatshezwayo.

Ngenkathi lokhu kuhlaziywa kunikeza ukuqonda okubalulekile, kubuye kube nemikhawulo ethile.

Ukungabikho kwamanani alahlekile kudathafathi kuyenza lula isifundo, kepha lokhu akuvamile ukuthi kube sezimweni zangempela zomhlaba.

Uma wakhe a Imodeli yokubikezelakubalulekile Hlukanisa idatha ku- ukulungiselela umdlalo, ukuhlolafuthi kungenzeka ukuthi Isethi yokuqinisekiswa kwesikhathi ukuqinisekisa ukuhlolwa okuqinile.

Ingomane Ukukhetha okuguquguqukayoamasu anjenge Ukukhetha okuqhubekayo nezinye izindlela zokukhetha izici zingasetshenziswa.

Uma uqhathanisa amamodeli amaningi, kubalulekile ukuchaza kufanelekile Amamethrikhi wokusebenza.

Endabeni yokuhlehlisa okuqondile, amamethrikhi asetshenziswayo afaka phakathi Kusho iphutha eliphelele (mae) kanye Kusho iphutha elikweletayo (MSE).

Izimpawu zesithombe

Zonke izithombe nokubonwayo kule ndatshana kudalwe ngumlobi kusetshenziswa i-Python (pandas, matplotlib, wasolwandle, kanye nokwenza amapulangwe) futhi kuchazwe ngenye indlela.

Ukunqubekela phambili

Warserman, L. (2013). Zonke izibalo: inkambo emfushane ekuhlolweni kwezibalo. I-Springer Science & Business Media.

Idatha nokulayisense

Idathasethi esetshenziswe kulesi sihloko iqukethe izibalo ezihlobene nobugebengu nezobuntu zabantu abangama-47 zase-US ngo-1960.
Isuka kuhlelo lokubika lobugebengu be-FBI olufanayo lwe-FBI (UCR) nemithombo eyengeziwe kahulumeni yase-US.

Njengomsebenzi kahulumeni wase-US, imininingwane isesiyalweni somphakathi ngaphansi kwekhodi engu-17 yase-US § 105 futhi imahhala ukuyisebenzisa, wabelane, futhi iveliswe ngaphandle kokuvinjelwa.

Imithombo:

Amakhodi

Ngenisa idatha

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('data/Multiple_Regression_Dataset.csv')
df.head()

Ukuhlaziywa okubukwayo kwezinto eziguqukayo

Create a new figure

# Extract response variable and covariates
response = 'R'
covariates = [col for col in df.columns if col != response]

fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(20, 18))
axes = axes.flatten()

# Plot boxplot for binary variable 'S'
sns.boxplot(data=df, x='S', y='R', ax=axes[0])
axes[0].set_title('Boxplot of R by S')
axes[0].set_xlabel('S')
axes[0].set_ylabel('R')

# Plot regression lines for all other covariates
plot_index = 1
for cov in covariates:
    if cov != 'S':
        sns.regplot(data=df, x=cov, y='R', ax=axes[plot_index], scatter=True, line_kws={"color": "red"})
        axes[plot_index].set_title(f'{cov} vs R')
        axes[plot_index].set_xlabel(cov)
        axes[plot_index].set_ylabel('R')
        plot_index += 1

# Hide unused subplots
for i in range(plot_index, len(axes)):
    fig.delaxes(axes[i])

fig.tight_layout()
plt.show()

Ukuhlaziywa kokuhlangana phakathi kokuguquguqukayo

spearman_corr = df.corr(method='spearman')
plt.figure(figsize=(12, 10))
sns.heatmap(spearman_corr, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix Heatmap")
plt.show()

Ukuhlunga Izibikezeli nge-Intercorrelation ephezulu (ρ> 0.6)

# Step 2: Correlation of each variable with response R
spearman_corr_with_R = spearman_corr['R'].drop('R')  # exclude R-R

# Step 3: Identify pairs of covariates with strong inter-correlation (e.g., > 0.9)
strong_pairs = []
threshold = 0.6
covariates = spearman_corr_with_R.index

for i, var1 in enumerate(covariates):
    for var2 in covariates[i+1:]:
        if abs(spearman_corr.loc[var1, var2]) > threshold:
            strong_pairs.append((var1, var2))

# Step 4: From each correlated pair, keep only the variable most correlated with R
to_keep = set()
to_discard = set()

for var1, var2 in strong_pairs:
    if abs(spearman_corr_with_R[var1]) >= abs(spearman_corr_with_R[var2]):
        to_keep.add(var1)
        to_discard.add(var2)
    else:
        to_keep.add(var2)
        to_discard.add(var1)

# Final selection: all covariates excluding the ones to discard due to redundancy
final_selected_variables = [var for var in covariates if var not in to_discard]

final_selected_variables

Ukuhlaziywa kwe-multicollinearity usebenzisa i-vif

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.preprocessing import StandardScaler

X = df[final_selected_variables]  

X_with_const = add_constant(X)  

vif_data = pd.DataFrame()
vif_data["variable"] = X_with_const.columns
vif_data["VIF"] = [variance_inflation_factor(X_with_const.values, i)
                   for i in range(X_with_const.shape[1])]

vif_data = vif_data[vif_data["variable"] != "const"]

print(vif_data)

Lifanela imodeli yokuhlehlisa eliqondile ngokuhlukahluka kwesithupha ngemuva kokujwayelekile, hhayi ukuhlukanisa idatha esitimeleni bese uvivinya

from sklearn.preprocessing import StandardScaler
from statsmodels.api import OLS, add_constant
import pandas as pd

# Variables
X = df[final_selected_variables]
y = df['R']

scaler = StandardScaler()
X_scaled_vars = scaler.fit_transform(X)

X_scaled_df = pd.DataFrame(X_scaled_vars, columns=final_selected_variables)

X_scaled_df = add_constant(X_scaled_df)

model = OLS(y, X_scaled_df).fit()
print(model.summary())
Izithombe ezivela kumbhali: Imiphumela Yokuhlehlisa Imiphumela

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button