Ngishaye i-XGBoost ngokumelene ne-Logistic Regression kuma-358 Matches. Imodeli Eyisicefe Iwinile.

yethu yabelana ngenkinga entsha yokumodela: finyelela imodeli ewinayo. Kulezi zinsuku lokho kukhula kwe-gradient, futhi i-reflex ngokuvamile ilungile – I-XGBoost izuza idumela layo ngohlu olumangalisayo lwezinkinga.
Ngakho-ke lapho ngiklelisa abaklami bezigaba abahlanu emsebenzini ofanayo futhi imodeli yomugqa owodwa ihlula umpetha we-Kaggle, umphumela waba uhlobo olungamangali muntu othumele amamodeli ngedatha yangempela, futhi cishe wonke umuntu usafunda.
Abahlukanisi abahlanu, umsebenzi ofanayo, izici ezifanayo: bikezela ukuthi ingabe umdlalo wamazwe ngamazwe uphela ngokuwina ekhaya, ngokulingana, noma ngokuwina ekuhambeni. Abaqhudelana nabo bagijima besuka ekuhlehleni okuphansi bedlula ehlathini elingahleliwe, i-KNN, inethiwekhi encane ye-neural, kanye ne-XGBoost.
Elula kakhulu iwinile. Okuthakazelisa kakhulu kunalokho lokho iwinile ngani – futhi kungani ungomunye wemibono ewusizo kakhulu ekufundeni komshini osetshenziswayo. Nasi isilingo, umphumela, kanye nethiyori eyivula kahle.
Ukusetha
Lokhu kuqhamuke ngokwakha iqoqo lezinhlobo eziyishumi nanye zeNdebe Yomhlaba, lapho ngangidinga khona umhleli wemiphumela futhi ngangifuna ukwazi ukuthi yimuphi umndeni okufanele ngiwuthembe. Imodeli ngayinye ibone izici ezintathu ezifanayo kubadlali bomlando bamazwe ngamazwe abangama-358 – Izindebe Zomhlaba zika-2010-2022 kanye nama-Euro ka-2020 no-2024: igebe lamandla phakathi kwamaqembu, amandla awo ahlangene, kanye nefulegi lokuwina. Okuhlosiwe umphumela wezindlela ezintathu.
Ngiwathole ngokuqinisekisa okuphindwe ka-5, futhi i-metric eyinhloko iwukulahlekelwa kwelogi, hhayi ukunemba. Lokho kukhetha kwenza umsebenzi omningi kulesi sihloko, ngakho-ke kufanelekile ukucacisa ngakho ngaphambili. Ukunemba kubuza kuphela ukuthi isigaba esiphezulu besilungile yini. Ukulahleka kwelogi amabanga the lonke i-vector yamathuba futhi ujezisa amaphutha azethemba kanzima:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import log_loss, accuracy_score
proba = cross_val_predict(model, X, y, cv=5, method="predict_proba")
print(log_loss(y, proba), accuracy_score(y, proba.argmax(1)))
Kumodeli yokubikezela umsebenzi wayo wonke uwukukhipha amathuba alinganiselwe, ukulahlekelwa kwelogi kuyikhadi elithembekile futhi ukunemba ukuhlola ingqondo. Inombolo okufanele uyigcine ephaketheni lakho ithi ln(3) ≈ 1.099 – ukulahlekelwa kwelogi ongakuthola ngokuhlehla nokubikezela umfaniswano ongu-1/3 kuzo zonke izigaba ezintathu. Beat 1.099 futhi imodeli yakho iyazi okuthile. Amaphuzu ngenhla futhi ubuzoba ngcono uma uqagela.
Umphumela
Kunezinto ezimbili emiphumeleni engezansi okufanele zikukhathaze.
Eyokuqala inkundla: ukuhlehla kwezinto ezibonakalayo kuthumele ukulahlekelwa kwelogi okungcono kakhulu, futhi i-XGBoost – imodeli ewina imiqhudelwano ye-Kaggle – yagcina. Okwesibili ayisihambi futhi kulula ukudlula ngokushesha. I-XGBoost ayizange nje ilahlekelwe; ithole amaphuzu angaphezu kuka-1.099, isisekelo sokuqagela iyunifomu. Imodeli enemba ehloniphekile engu-48%, ngokwemethrikhi ebalulekile lapha, kubi kunohlamvu lwemali olunezinhlangothi ezintathu.
| Imodeli | Ukulahleka kwelogi ye-CV (okuphansi kungcono) | Ukunemba kwe-CV |
|---|---|---|
| Ukuhlehla kwezinto | 1.001 | 54% |
| Ihlathi Elingahleliwe | 1.011 | 56% |
| KNN | 1.013 | 53% |
| Inethiwekhi ye-Neural | 1.115 | 52% |
| XGBoost | 1.169 | 48% |
Womabili la maqiniso anemsuka efanayo, futhi umqondo owusizo kakhulu kulesi sihloko sonke.
Kungani imodeli eyisicefe iphumelele: ukuchema nokuhluka
Indlela ehlanzekile yokucabanga ngalokhu ukubola kwe-bias-variance. Iphutha lemodeli elilindelekile ngaphandle kwesampula lihlukaniswa izingxenye ezintathu:
Error = Bias² + Variance + Irreducible noise
- Ukuchema kuyiphutha elivela ekuqageleni okungalungile – imodeli eqinile igeja isakhiwo sangempela kudatha.
- Ukwehluka iyiphutha ukusuka ekuzweleni ukuya kusampula elithile lokuqeqeshwa – imodeli evumelana nezimo kakhulu ilingana nomsindo ongeke uvele ngokuzayo.
- Umsindo onganqamuki ukungahleleki kwangempela kwento oyibikezelayo. Ebholeni kunjalo enkulu: isibhamu esisodwa esichezukile sinquma i-knockout tie. Ayikho imodeli ethinta leli gama, yingakho ngisho nesihlukanisi esingcono kakhulu lapha sihlala eduze nokunemba okungu-50%.
Wonke umdlalo ukuhweba phakathi kokuqala kokubili. Amamodeli asezingeni eliphezulu, njengezihlahla ezithuthukisiwe noma amanetha emizwa, athenga ukuchema okuphansi ngokuguquguquka ngokwanele ukuze agobe cishe kunoma yikuphi ukuma kudatha. Umthethosivivinywa walokho kuvumelana nezimo ukwahluka, futhi uza kuphela uma ungenayo idatha eyanele yokuphina imodeli phansi.
Futhi yileso impela isimo sethu. Ngezibonelo ezingu-358 ezihlukene phakathi kwethagethi yezindlela ezintathu, unokufana okungaba ngu-120 ikilasi ngalinye. Iqoqo le-XGBoost, okwamanje, selinayo izinkulungwane yamapharamitha asebenzayo asakazwa ezihlahleni zawo. Alukho nje uphawu olwanele lokuzikhuza zonke, ngakho-ke zibambelela ezintweni ezenzekayo ukuthi zivele esibayeni esisodwa sokuqinisekisa bese zinyamalala kokulandelayo. Lokho kuwukufaka ngokweqile kwezincwadi zokufunda, futhi kuchaza inkathazo yokuqala: ukuqinisekiswa okuphambene kwenza umsebenzi wako ngokubamba amamodeli aguquguqukayo ngesandla kudatha abangazange bayibone.
Pho kungani i-XGBoost yawa ngezansi okungahleliwe kunokumane uhlale maphakathi netafula? Yilapho ukukhetha kokulahlekelwa kwelogi kukhokha khona. Isijeziso sesibonelo esisodwa sithi −ln(p_true_class)futhi i-convex ngonya.
Qagela umphumela ekugcineni ku-0.5 ebiyelwe futhi uyadla −ln(0.5) = 0.69. Kubikezele ngo-0.1 ukuzethemba-kodwa-okungalungile futhi udle −ln(0.1) = 2.30 – izikhathi ezingaphezu kwezintathu ubuhlungu ngenxa yokuqiniseka nokungalungile. Imodeli eguquguquka ngokweqile kudatha encane ayenzi nje amaphutha; kuyabenza ngokuqinisekaikhipha amathuba abukhali angu-60–70% kanye nokwenza iphutha ngokwanele ukuthi inhlawulo ye-convex idonse isilinganiso sayo ngaphansi kwesisekelo esinamahloni esingu-1/3-1/3-1/3.
Igama elifanele lalokhu kwehluleka ukungaqini kahle ngokuzethemba, futhi liyisiginesha yemodeli eningi kakhulu yedatha encane kakhulu. Unqenqema lokunemba lwe-XGBoost ocingweni olunesibindi lwezikhathi ezithile alukwazanga ukubuyisela lokho ukuzethemba kwayo ngokweqile kubiza yonke indawo kwenye indawo.
Kungani i-logistic regression ikakhulukazi
Ukwazi ukuthi amamodeli aguquguqukayo azodonsa kanzima kuyingxenye yendaba. Imodeli yomugqa ayizange nje igweme ugibe – kwaba, kule nkinga, i kulungile ithuluzi. Amaqiniso amabili esakhiwo enza lokho:
- Ubudlelwano beqiniso busondele kumugqa kuma-log-odds. Okuningi kwalokho okubikezela umphumela “ukuthi likhulu kangakanani igebe lamandla,” futhi amathuba okuwina akhuphuka ngokushelela nangokuvumelana nakho – ncamashi uhlobo lokusebenza lokuhlehla kwezinto ezithathwayo. Uma ukuchema kwemodeli kufana nenqubo yokukhiqiza idatha, udinga idatha encane kakhulu ukuze uyilinganise kahle. Izihlahla, ngokuphambene, kufanele zithole ukuthi ijika elibushelelezi liphuma ekwehlukaneni okungaguquguquki, zisebenzisa idatha eyigugu ukulinganisa okuthile ukuhlehla okutholwayo mahhala.
- Izici ezintathu, ukusebenzisana okubuthakathaka. Izihlahla namanetha kuzuza ukuzigcina kwakho ngokuzingela ukusebenzisana phakathi kwezici eziningi. Ngezici ezintathu kuphela nokusebenzisana okuncane phakathi kwazo, akukho lutho ongatholwa yilowo mshini – ngakho wengeza umehluko ngaphandle kokwengeza noma iyiphi isignali yokubonisa yona.
Kunomthetho wesithupha ovela ezibalweni zasendulo okufanele uziphathe: ufuna ngokulandelana kokuqashelwa okungu-10–20 ngepharamitha ngayinye ukuze uthole izilinganiso ezizinzile.
Ukwehla kokuhleleka kulinganisela idlanzana lama-coefficients ngokumelene nokufanayo okungu-358 – kahle ngaphakathi kwaleso sabelomali. Iqoqo elithuthukisiwe liyimiyalo yobukhulu phezu kwayo. Ukungafani kwabhakwa ngaphambi kokuthi kuqeqeshwe imodeli eyodwa.
Ungalifunda kanjani ibhodi lamaphuzu ngokwethembeka
Ngaphambi kokuthatha iziphetho kulelo thebula, izixwayiso ezimbili mayelana nokulifunda – ngoba idathasethi encane efanayo ecwilisa i-XGBoost nayo yenza izinombolo zibe nomsindo kunokuba zibukeka.
Owokuqala ukwahluka kwayo imethrikhi. Ngemidlalo engama-358, ukugoqa ngakunye kwemihlanu kubamba imidlalo engu-~72 kuphela, ngakho amaphuzu eCV ngokwawo ayanyakaza. Izikhala phakathi kokuhlehla kwezinto, ihlathi elingahleliwe, ne-KNN – 1.001 vs. 1.011 vs. 1.013 – zingaphakathi kahle kulokho kuzamazama. Ziboshwe ngempumelelo.
Yini eqinile futhi ephindaphindwayo yiziphetho ezimbili zetafula: imodeli elula yomugqa inokwethenjelwa phezulu, futhi amamodeli aguquguqukayo kakhulu anokwethenjelwa phansi. Funda isiteji, hhayi isiphetho sesithombe.
Okwesibili yikholomu yokunemba, okufanele umelane nokufunda ngokweqile ngokuphelele. Imiphumela yebhola lezinyawo ezintathu inzima kakhulu ngoba ukutonyulwa kuwumphumela wangempela wesithathu ongenaso isibikezelo esiqinile – ngokomlando cishe amaphesenti angama-27 ale midlalo adlala ngokulingana, futhi ukutomula cishe akunakwenzeka ukubiza kusenesikhathi ngamandla eqembu kuphela.
Imodeli eyayazi iqembu ngalinye iqiniso Amathuba okuwina awakwazanga ukusunduza ukunemba kakhulu kudlule ama-50s aphezulu, ngoba igama lomsindo ongancibiliki likhulu kakhulu. Uma kubonwa ngaleyo ndlela, ukuhlehliswa kwezinto okungu-54% akukona okumaphakathi – kuseduze nophahla olusebenzayo lwale sethi yesici. Umehluko wangempela phakathi kwamamodeli akukaze kube ukuthi bakhetha kangaki owinile; bekuyi ukulinganisaokuyilokho kanye okufihlwayo izinyathelo zokulahleka kwelogi nokunemba. Ngakho: Hola ngomthetho ofanele wokushaya amaphuzu; gcina ukunemba njengokuhlola amathumbu.
Ingabe izihlahla zingahlengwa? Ngesiyalo, yebo.
Akukho kulokhu okuyicala le-XGBoost. Isitatimende esimayelana nokucushwa okuhlobene nosayizi wedatha – futhi i-algorithm efanayo, ephathwa ngendlela ehlukile, ingavala iningi legebe. I-lever iwukwenza njalo: Ukuhweba ngokuhluka okuncane emuva kokuchema okuncane.
- OkweXGBoost: izihlahla ezingashoni (
max_depth=2–3), enamandlamin_child_weight,subsamplefuthicolsample_bytreengaphansi koku-1, inhlawulo ye-L2 (lambda), izinga lokufunda eliphansi nokuma kusenesikhathi ekugoqweni kokuqinisekisa, nemijikelezo embalwa. - Ekuhlehleni kwezinto: inhlawulo ye-L2 (
C) isivele yenza ukujwayela okuthulile ngemuva – ingxenye yokuthi kungani izinzile ngaphandle kwebhokisi.
Ishunwe kanzima ngokwanele, imodeli ejwayelekile yokukhulisa i-gradient kungenzeka fanisa ukwehla kwezinto lapha. Kodwa qaphela ukuthi “ukufanisa umugqa owodwa ngemva kokushuna ngokucophelela” kuyisifundo, akusona isibonelo esiphikisayo.
(Isixwayiso ngakolunye uhlangothi: amamodeli amakhulu kakhulu, anepharamitha engaphezulu angaphinda angene kuhlelo “lokwehla okukabili” lapho iphutha liphinda lidlule umkhawulo wokuhumusha – kodwa elihlala esikalini sedatha nepharamitha kude le kude okufanayo okungu-358.)
Ngakho-ke ubungazi kanjani, ngokwamandla, lapho izihlahla ekugcineni zikufanele? Hlela ijika lokufunda: ukulahlekelwa kwelogi okubambezelekile ngokumelene nosayizi wokusetha ukuqeqeshwa, kumodeli ngayinye.
Amaphethini amabili okuxilonga. Imodeli yokuchema okuphezulu efana ne-logistic regression plateaus kusenesikhathi – idatha eningi ayisizi, ngoba iphansi lokuchema liyabusa. Imodeli yokwehluka okuphezulu njenge-XGBoost iqala kabi kakhulu kodwa iqhubeka ithuthuka njengoba idatha ikhula, ngoba izibonelo ezengeziwe yizo kanye ezinciphisa ukuhluka kwayo. Iphuzu lapho amajika amabili isiphambano isabelomali sedatha lapho imodeli eguquguqukayo iqala ukuwina.
Emidlalweni yamazwe angama-358 sihlezi ngokusobala kwesokunxele saleyo crossover. Okuphakelayo i-XGBoost amashumi ezinkulungwane zamameshi ekilabhu anezici ezicebile – i-xG, izinsuku zokuphumula, ukuhlela – futhi cishe izodlula. I-algorithm efanayo, isimiso sedatha esihlukile, isiphetho esiphambene. Leso simo esiphuthumayo siyiphuzu.
Iphuzu elibalulekile: Khetha imodeli enedatha yakho
Ubunkimbinkimbi bemodeli kufanele bufane nedatha, hhayi i-hype. Ezinkingeni ezinkulu, ezingcolile, ezinothile ngesici, ukukhuphuka kwe-gradient kanye namanetha ajulile avame ukubusa – yingakho edumile, futhi kungani i-reflex ukufinyelela kuzo ngokuvamile kuhle.
Kodwa enkingeni encane, ehlanzekile, enobukhulu obuphansi njengalena, i-reflex ayilungile, futhi isiyalo ukuqala okulula, ukusungula isisekelo esiqinile, ukulinganisa ngomthetho wokushaya amagoli, futhi wengeze inkimbinkimbi kuphela lapho idatha ebanjiwe ithi izuze indawo yayo. Ukwehla kwezinto akuwona umklomelo wokududuza lapha. Ngokunikezwa kwedatha, iyimpendulo efanele.
Lesi siyalo – qala kalula, siqinisekise ngokwethembeka ngokulahleka kwelogi kanye nokulinganisa, ubunkimbinkimbi besikali ngamabomu – sisebenza ezahlukweni zokumodela ze Izibalo Zebhola Ngokufunda Ngomshini (O'Reilly, 2026 – kusanda kuvela kwabezindaba!): ukwehla nokuhlukaniswa ngezigaba eSahlukweni 5, izindlela ezisuselwe esihlahleni (kufakwe ne-XGBoost) kanye nokuthi amandla abo okudubula angaphezulu akhokha nini eSahlukweni 6.
Ngakho-ke ngaphambi kokuthi ufinyelele imodeli enkulu kuphrojekthi yakho elandelayo, buza imibuzo emibili: ingakanani idatha onayo ngempela, futhi uzokwazi kanjani ukuthi inkimbinkimbi isizile? Kwesinye isikhathi umugqa olingana kahle kakhulu nawo uwumugqa wokugcina.



