Machine Learning

I-LLMS isebenza kanjani: Ukuqiniswa kokufunda, i-RLHF, i-Deepseek R1, Vulai O1, Alphago

Uyemukelwa engxenyeni 2 ye-My LLM Deep Dive. Uma ungafundanga ingxenye 1, ngikukhuthaza kakhulu ukuthi uyihlole kuqala.

Phambilini, sasihlanganisa izigaba ezimbili zokuqala zokuqeqesha i-LLM:

  1. Ukuqeqeshwa kwangaphambili – Ukufunda kusuka kumadamu amakhulu ukwakha imodeli eyisisekelo.
  2. Ukuqondisa ubuhle (Sft) – Ukucoca imodeli ngezibonelo ezikhethiwe ukuze zikwenze zibe lusizo.

Manje, singena esigabeni esikhulu esilandelayo: Ukugcizelela Ukufunda (RL). Ngenkathi ukuqeqeshwa kwangaphambili ne-SFT kusungulwe kahle, i-RL isavela kodwa ibe yingxenye ebucayi yepayipi lokuqeqeshwa.

Ngithathe inkomba evela ku-Andrerj Karthy's ethandwa kabanzi i-YouTube ye-Otube. U-Andrej uyilungu elisunguliwe le-Opelai, imininingwane yakhe ingumbono wakhe – uthola umbono.

Asihambe 🚀

Yini inhloso yokuqinisa ukuqiniswa (RL)?

Imininingwane yabantu kanye ne-LLMS inqubo ehlukile. Yini enembile kithi – njengezibalo eziyisisekelo – kungenzeka ingabi yi-LLM, ebona kuphela ukulandelana kwamathokheni. Ngakolunye uhlangothi, i-LLM ingakhiqiza izimpendulo ezisezingeni lesazi ezihlokweni eziyinkimbinkimbi ngoba nje ibone izibonelo ezanele ngesikhathi sokuqeqeshwa.

Lo mehluko ekucabangeni kwenza kube yinselele kubachazamazwe abangabantu ukuthi banikeze amalebula “aphelele” amalebula aqondisa njalo i-LLM ephendula ngempendulo.

Rl amabhuloho leli gebe ngokuvumela imodeli ukuba funda kokuhlangenwe nakho kwayo.

Esikhundleni sokuncika kumalebula acacile, imodeli ihlola ukulandelana okuhlukile kwethokheni futhi ithola impendulo – amasiginali womvuzo – lapho imiphumela iwusizo kakhulu. Ngokuhamba kwesikhathi, ifunda ukuvumelanisa kangcono ngenhloso yabantu.

Intuition ngemuva kwe-RL

I-LLMS i-Stochastic – okusho ukuthi izimpendulo zabo azilungisiwe. Noma nge-esheshayo efanayo, okuphumayo kuyahluka ngoba kusampule kusuka ekusatshalalisweni okungenzeka.

Singasibopha lokhu okungahleliwe ngokukhiqiza izinkulungwane noma izigidi zezimpendulo ezingaba khona Ngokufana. Cabanga ngakho njengemodeli ehlola izindlela ezahlukahlukene – ezinye ezinhle, ezinye zimbi. Umgomo wethu ukukhuthaza ukuthi uthathe izindlela ezingcono kaningi.

Ukuze senze lokhu, siqeqesha imodeli ekuphumeni kwamathokheni aholela emiphumeleni engcono. Ngokungafani nokuqondiswa okuhle okugadiwe, lapho ochwepheshe abangabantu bahlinzeka ngemininingwane enelebula, Ukugcizelela Ukufunda kuvumela imodeli ukuba Funda ngokwakho.

Imodeli ithola izimpendulo ezisebenza kahle kakhulu, nangemva kwesinyathelo ngasinye sokuqeqeshwa, sibuyekeza amapharamitha alo. Ngokuhamba kwesikhathi, lokhu kwenza imodeli ibe khona kakhulu ukukhiqiza izimpendulo ezisezingeni eliphakeme lapho zinikezwe okufanayo okufanayo ngokuzayo.

Kepha sinquma kanjani ukuthi iziphi izimpendulo ezihamba phambili? Futhi kufanele senze okungakanani rl? Imininingwane iyakhohlisa, futhi ibathole kahle akuyona into encane.

I-RL ayikho “okusha” – ingadlula ubuchwepheshe bomuntu (Alphago, 2016)

Isibonelo esikhulu samandla we-RL yi-apharo ye-Deepmind's Alphago, i-AI yokuqala ukunqoba isidlali se-GO Player futhi kamuva idlula ukudlala okuleveli yabantu.

Ephepheni lemvelo le-2016 (igrafu elingezansi), lapho imodeli iqeqeshwa ngamathani ama-SFT (enikeza amathani amathani wezibonelo ezinhle ongasilingisa kulo), imodeli yakwazi ukufinyelela ukusebenza kwezinga labantu, kodwa ungalokothi udlule.

Umugqa onamachashazi umele ukusebenza kuka-Lee Sedol – umdlali omuhle kakhulu we-Go emhlabeni.

Lokhu kungenxa yokuthi i-SFT imayelana nokuphindaphinda, hhayi ukusungula – akuvumeli imodeli ukuthi ithole amasu amasha angaphezu kolwazi lomuntu.

Kodwa-ke, i-RL inikwe amandla i-alphago ukuze idlale yona ngokwayo, icolise amasu ayo, futhi ekugcineni Idlula ubuchwepheshe bomuntu (umugqa oluhlaza okwesibhakabhaka).

Isithombe esithathwe ephepheni le-Alphago 2016

I-RL imele umngcele othokozisayo e-AI – lapho amamodeli angahlola amasu angaphezu kwemcabango womuntu lapho siwuqeqesha echibini elihlukahlukene nelinselele lezinkinga zokucwengisa amasu okucabanga.

Izisekelo ze-RL ziphinda ziphinda

Masisheshe kabusha kabusha izingxenye ezisemqoka zokusetha okujwayelekile kwe-RL:

Isithombe nguMlobi
  • Umenzeli Umfundi noma umenzi wezinqumo. Ibheka isimo samanje (-sho), ukhetha isenzo, bese uvuselela indlela yayo ngokususelwa kumphumela (vuza).
  • Indawo ezungezile– Uhlelo lwangaphandle lapho umenzeli esebenza khona.
  • -Sho – Isifinyezo semvelo esinyathelweni esinikeziwe t.

Esimweni ngasinye, umenzeli wenza i ukwenza Emvelweni ezoshintsha isimo semvelo sibe entsha. Umenzeli uzothola nempendulo ekhombisa ukuthi isenzo esihle noma sibi kangakanani.

Le mpendulo ibizwa nge vuzafuthi imelelwa ngendlela yezinombolo. Umvuzo omuhle ukhuthaza lokho kuziphatha, futhi umvuzo ongemuhle uyakudida.

Ngokusebenzisa impendulo evela ezifundeni nasezenzweni ezahlukahlukene, i-ejenti ifunda kancane kancane isu elifanele Khulisa umvuzo ophelele ngokuhamba kwesikhathi.

Umgomo

Inqubomgomo yicebo le-ejenti. Uma umenzeli elandela inqubomgomo enhle, izokwenza izinqumo ezinhle ngokungaguquki, iholele kwimivuzo ephakeme ngezinyathelo eziningi.

Ngamagama wezibalo, kuwumsebenzi onquma amathuba okuphuma okuhlukile kwesimo esinikeziwe –(Πθ (a | s)).

Umsebenzi wenani

Isilinganiso sokuthi kuhle kangakanani ukuba sesimweni esithile, uma ucabanga ngomvuzo olindelekile. Nge-LLM, umvuzo ungahle uqhamuke ngempendulo yabantu noma kwimodeli yomvuzo.

Izakhiwo Zokwakha Umlingisi

Kuyinto yokusetha ye-RL edumile ehlanganisa izakhi ezimbili:

  1. Umlingisi – Ifunda futhi ivuselelwa umgomo (Πθ), ukunquma ukuthi yisiphi isenzo okufanele sithathe isimo ngasinye.
  2. Ebazagxekeni – Ihlola Umsebenzi wenani (V (s)) ukunikeza impendulo kumlingisi ekutheni izenzo zazo ezikhethiwe ziholela emiphumeleni emihle.

Kusebenza kanjani:

  • Le khasi umlingisiukhetha isenzo ngokususelwa kwinqubomgomo yayo yamanje.
  • Le khasi ebazagxekeniIhlola umphumela (umvuzo + wombuso olandelayo) futhi ivuselela isilinganiso sayo senani.
  • Impendulo yogxekayo isiza umlingisi acwengelwe inqubomgomo yawo ukuze izenzo zesikhathi esizayo ziholele emivuzweni ephakeme.

Ukubeka konke ndawonye ukuze ku-LLMS

Isimo singaba umbhalo wamanje (ngokushesha noma ingxoxo), futhi isenzo singaba yithokheni elandelayo ukukhiqiza. Imodeli yomvuzo (isib. Impendulo Yomuntu), itshela imodeli ukuthi ilunge kangakanani noma imbi kangakanani umbhalo owenziwe ngayo.

Le nqubomgomo yicebo lemodeli yokukhetha ithokheni elilandelayo, kuyilapho inani lokusebenzisa ulwazi luwuzuzisa khona ukuthi kunesimo sokugcina sinenzuzo ekhiqiza izimpendulo ezisezingeni eliphakeme.

I-Deepseek-R1 (Ishicilelwe 22 Jan 2025)

Ukuqokomisa ukubaluleka kwe-RL, ake sihlole i-Deepseek-R1, imodeli yokubonisana efinyelela ukusebenza kwe-top-tier ngenkathi kusele umthombo ovulekile. Leli phepha elethule amamodeli amabili: I-Deepseek-R1-zero ne-Deepseek-R1.

  • I-Deepseek-R1-zero yaqeqeshwa kuphela nge-RL enkulu-scale, yeqa ukuhlelwa okuhle okuqondisiwe okuhle (SFT).
  • I-Deepseek-R1 yakha kuyo, ukubhekana nezinselelo.

Ake ungene emanzini athile asemqoka.

1. I-RL Algo: Iqembu elihlobene nenqubomgomo yenqubomgomo (GrPO)

Omunye umdlalo okhiye ukuguqula i-RL algorithm yinqubomgomo yenqubomgomo ehlobene neqembu (i-GRPO), okuhlukile kokusebenziseka kwenqubomgomo okuthandwa kakhulu (i-PPO). I-GRPO yethulwa ephepheni elijulile ngoFebhuwari 2024.

Kungani i-grpo ngaphezulu kwe-PPO?

Izimpikiswano ze-PPO ezinemisebenzi yokubonisana ngenxa:

  1. Ukuncika kusibonelo esigxekayo.
    I-PPO idinga imodeli yokugxeka ehlukile, inkumbulo ephindwe kabili futhi ibe ne-compute.
    Ukuqeqesha umgxeki kungaba yinkimbinkimbi ngemisebenzi ebizayo noma ebekiwe.
  2. Izindleko eziphezulu ze-computational njenge-RL Pipelines zifuna izinsizakusebenza eziphakeme ukuhlola nokwenza ngcono izimpendulo.
  3. Ukuhlolwa komvuzo ngokuphelele
    Lapho uncika emvuzweni ophelele – okusho ukuthi kukhona okujwayelekile noma i-metric ukwahlulela ukuthi impendulo “enhle” noma “kungaba nzima ukuthwebula ama-nuances wemisebenzi evulekile, ehlukahlukene emisebenzini ehlukile yokucabanga.

I-GRPO ibhekana kanjani nalezi zinselelo:

I-GRPO isusa imodeli yokugxeka ngokusebenzisa Ukuhlolwa kwesihlobo– Izimpendulo ziqhathaniswa neqembu kunokuba zihlulelwe ngezinga elithile.

Cabanga abafundi ukuxazulula inkinga. Esikhundleni sothisha ukubala bona ngawodwana, baqhathanisa izimpendulo, bafunda komunye nomunye. Ngokuhamba kwesikhathi, ukuguqulwa kokusebenza kubheke ekhwalithi ephezulu.

Ngabe i-GRPO ingena kanjani kwinqubo yonke yokuqeqeshwa?

I-GRPO iguqula ukuthi ukulahleka kubalwa kanjani ngenkathi kugcina ezinye izinyathelo zokuqeqeshwa zingashintshiwe:

  1. Hlanganisa imininingwane (imibuzo + izimpendulo)
    – Ama-LLMS, imibuzo ifana nemibuzo
    – Inqubomgomo endala (Snapshot endala yemodeli) yakha izimpendulo eziningana zokhetho ngombuzo ngamunye
  2. Nikeza Imivuzo– Ukuphendula ngakunye eqenjini kuthola amaphuzu (“umvuzo”).
  3. Khomba ukulahleka kwe-grpo
    Ngokwesiko, uzosebenzisa ukulahleka – okukhombisa ukuphambuka phakathi kokubikezela kwemodeli nelebula leqiniso.
    E-GRPO, nokho, ukala:
    a) Kungenzeka kanjani inqubomgomo entsha ukukhiqiza izimpendulo ezedlule?
    b) Ingabe lezo zimpendulo zingcono kakhulu noma zimbi kakhulu?
    c) Sebenzisa ukunqunywa ukuvikela izibuyekezo ezeqile.
    Lokhu kuveza ukulahleka kwesikali.
  4. Ukusabalalisa emuva + I-Gradient Forcent
    – Ukusakazeka emuva kubalwa ukuthi ipharamitha ngayinye ibe nomthelela yini ekulahlekelweni
    – Ukuhlaziya okuphezulu kuvuselela lawo mapharamitha ukunciphisa ukulahleka
    – Ngaphezulu kwama-Iterations amaningi, lokhu kancane kancane kuguqula inqubomgomo entsha ukuncamela izimpendulo zomvuzo eziphakeme
  5. Vuselela inqubomgomo endala ngezikhathi ezithile ukufanisa inqubomgomo entsha.
    Lokhu kuvuselela isisekelo somzuliswano olandelayo wokuqhathanisa.

2. Chain of umcabango (i-cot)

Ukuqeqeshwa kwe-LLM yendabuko kulandela ukuqeqeshwa kwangaphambilini → SFT → RL. Kodwa-ke, i-Deepseek-R1-zero weqa i-SFTukuvumela imodeli ukuthi ihlole ngokuqondile ukucabanga kweCot.

Njengabantu bacabanga ngombuzo onzima, i-COT yenza amamodeli aphule izinkinga ngezinyathelo eziphakathi, ezenza amandla okubonisana okuyinkimbinkimbi. Imodeli ye-O1's O1's futhi ithola lokhu, njengoba kuphawuliwe ngombiko wayo wayo ngoSepthemba 2024: Ukusebenza kwe-O1 kuthuthuka nge-RL eningi (i-Train-Time compute) kanye nesikhathi esithe xaxa (i-test-time compute).

I-Deepseek-R1-Zero ikhonjiswe ukuthambekela okubonisa ukukhombisa, okucwengezele ngokuzimele ukucabanga kwayo.

Igrafu esemqoka (ngezansi) ephepheni ikhombise ukucabanga okwengeziwe ngesikhathi sokuqeqeshwa, okuholela ekutheni kube isikhathi eside (amathokheni amaningi), izimpendulo eziningi ezinemininingwane nezingcono.

Isithombe esithathwe ephepheni le-Deepseek-R1

Ngaphandle kwezinhlelo ezicacile, kwaqala ukuvakashela izinyathelo zokucabanga kwangaphambilini, ukwenza ngcono ukunemba. Lokhu kuqokomisa ukucabanga kwe-chain-of-ecabanga njengempahla evelayo yokuqeqeshwa kwe-RL.

Imodeli nayo ine “ahamson” (ngezansi) – isibonelo esijabulisayo sendlela i-RL engaholela ngayo imiphumela engalindelekile neyinkimbinkimbi.

Isithombe esithathwe ephepheni le-Deepseek-R1

Qaphela: Ngokungafani ne-Deepseek-R1, i-OpenAI ayikhombisi amaketanga aphelele okucabanga ngo-O1 njengoba ekhathazekile ngengozi yokuhlobisa – lapho kufika khona umuntu obonisana ngokulingisa. Esikhundleni salokho, izifingqo nje ezinjalo zamaketanga wemicabango.

Ukugcizelela Ukufunda Ngempendulo Yomuntu (RLHF)

Ngemisebenzi enemiphumela eqinisekisiwe (isib. Izinkinga zezibalo, i-Q & A), izimpendulo ze-AI zingahlaziywa kalula. Kepha kuthiwani ngezindawo ezinjengezifingqo noma ukubhala okudala, lapho kungekho mpendulo eyodwa “efanele”?

Yilapho kungena khona impendulo yabantu – kodwa izindlela ze-naïve RL azinakubonwa.

Isithombe nguMlobi

Ake sibheke indlela engenangqondo nezinye izinombolo ezithile zokuphikisana.

Isithombe nguMlobi

Lokho kuhlaziywa okubizwayo okuyisigidigidi sobuntu! Lokhu kubiza kakhulu, kuhamba kancane futhi kungathandeki. Ngakho-ke, ikhambi elibukhali ukuqeqesha i-AI “Model Recol” ukuze ufunde okuncamelayo komuntu, ukunciphisa kakhulu umzamo womuntu.

Izimpendulo ezisezingeni zilula futhi zinembile ukwedlula amagoli aphelele.

Isithombe nguMlobi

Upsures we-rlhf

  • Ingafakwa kunoma yisiphi isizinda, kufaka phakathi ukubhala okudala, izinkondlo, ukufingqa, neminye imisebenzi evulelekile.
  • Ukuphuma kwezindawo kulula kakhulu kubathengisi babantu kunokukhiqiza imiphumela yokudala uqobo.

Phansi kwe-rlhf

  • Imodeli yomvuzo iyi-assop – kungenzeka ingabonisi kahle izintandokazi zabantu.
  • I-RL ilungile ekwenzeni imodeli yomvuzo – uma isebenza isikhathi eside kakhulu, imodeli ingahle ixhase izikhala, ukukhiqiza imiphumela engenangqondo eyisathola amaphuzu aphezulu.

Qaphela ukuthi i-rlhf ayifani ne-RL yendabuko.

Okokuzikhanyiswa okunamandla, okuqinisekisiwe (isib. Math, ukufaka amakhodi), i-RL ingagijima unomphela futhi ithole amasu wenoveli. Ngakolunye uhlangothi, i-rlhf, ifana nesinyathelo esihle kakhulu sokuvumelanisa amamodeli ancamela abantu.

Ukugcina

Futhi lokho kungukugoqwa! Ngiyethemba ukuthi ukujabulele ingxenye 2 🙂 Uma ungakafundi nje ingxenye 1 – Uyakuhlola lapha.

Unemibuzo noma imibono yalokho okufanele ngimboze ngokulandelayo? Balahle kumazwana – Ngingathanda ukuzwa imicabango yakho. Sizokubona esihlokweni esilandelayo!

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button