Ukuqeqesha amamodeli amakhulu olimi: kusuka ku-TRPO kuya ku-grpo

I-Deepseek isanda kwenziwa impela i-buzz Emphakathini we-AI, ngenxa yokusebenza kwayo okuhlaba umxhwele ngezindleko eziphansi kakhulu. Ngicabanga ukuthi leli yithuba elifanelekile lokungena ekujuleni okujulile kokuthi ziqeqeshwa kanjani amamodeli wezilimi (i-LLMS). Kulesi sihloko, sizogxila ekufundeni okuqinisiwe (RL) uhlangothi lwezinto: Sizomboza i-TRPO, PPO, futhi, muva nje, i-grpo (ungakhathazeki, ngizochaza yonke le migomo kungekudala!)
Ngihlose ukugcina le ndatshana ilula ukuyifunda futhi ifinyeleleke, ngokunciphisa izibalo, ngakho-ke ngeke udinge isizinda sokufunda esijulile sokufunda. Kodwa-ke, ngizocabanga ukuthi ujwayele ukujwayelana nokufunda komshini, ukufunda okujulile, nokuqonda okuyisisekelo kokuthi i-LLMS isebenza kanjani.
Ngithemba ukuthi uyayijabulela i-athikili!
Izinyathelo ezi-3 zokuqeqeshwa kwe-LLM
Ngaphambi kokungena ekucacisweni kwe-RL, ake sibuke kafushane lezi zigaba ezintathu eziyinhloko zokuqeqesha imodeli enkulu yolimi:
- Ukuqeqeshwa kwangaphambili: Imodeli iqeqeshwa kwidatha enkulu ukubikezela ithokheni elandelayo ngokulandelana okusekelwe kumathokheni adlule.
- Ukuqondisa ubuhle bokuqondisa (SFT): Imodeli yileso sikhathi kuhle kwimininingwane ebhekiswe kakhudlwana futhi iqondaniswe nemiyalo ethile.
- Ukugcizelela Ukufunda (kuvame ukubizwa Rlhf Ukuze uthole ukuqiniswa kokufunda ngempendulo yabantu): Lokhu kugxilwe kulo mbhalo. Umgomo oyinhloko ukuqhubekisela phambili ukuvumelana kwezimpendulo 'ngokuzikhethela kwabantu, ngokuvumela imodeli ukuthi ifunde ngokuqondile kusuka kwimpendulo.
Izisekelo zokufunda zokufunda

Ngaphambi kokushona ngokujulile, ake abuyele kafushane imibono eyisisekelo ngemuva kokuqinisa ukuqiniswa.
I-RL iqonde ngqo ukuze uqonde ezingeni eliphakeme: a umenzeli uxhumana ne indawo ezungezile. Umenzeli uhlala ngokuqondile -sho ngaphakathi kwemvelo futhi ingathatha isenzo ukuguqukela kwamanye amazwe. Isenzo ngasinye siveza a vuza Kusuka Emovilweni: Yile ndlela imvelo enikeza ngayo impendulo eqondisa izenzo ze-Agent zesikhathi esizayo.
Cabanga ngesibonelo esilandelayo: a ilobhothi (umenzeli) uzulazula (futhi uzama ukuphuma) a Maze (imvelo).
- Le khasi -sho yisimo samanje semvelo (isikhundla serobhothi ku-maze).
- Irobhothi lingathatha elihlukile isenzo: Ngokwesibonelo, kungaqhubekela phambili phambili, jikela kwesokunxele, noma kujike kwesokudla.
- Ukuzulazula ngempumelelo ekuphumeni kwezithelo a Umvuzo Omuhlengenkathi eshaya udonga noma ukunamathela ku-maze kuphumela imivuzo emibi.
Easy! Manje, manje ake senze isifaniso ukuthi i-RL isetshenziswa kanjani kumongo we-LLMS.
Rl kumongo we-llms

Lapho isetshenziswa ngesikhathi sokuqeqeshwa kwe-LLM, i-RL ichazwa yizakhi ezilandelayo:
- I-LLM uqobo i-ejenti
- Indawo ezungezile: Konke kwangaphandle kwe-LLM, kufaka phakathi okushukumisayo komsebenzisi, amasistimu wokuphendula, kanye neminye imininingwane yesimo. Lokhu ngokuyisisekelo uhlaka lwe-LLM luxhumana ngesikhathi sokuqeqeshwa.
- Izenzo: Lezi yizimpendulo zombuzo ovela kumodeli. Ngokuqondile: Lezi yizo amathokheni ukuthi i-LLM inquma ukukhiqiza ukuphendula umbuzo.
- Isimo: Umbuzo wamanje uphendulwa kanye namathokheni i-LLM ikhiqize kuze kube manje (ie, izimpendulo ezithile).
- Imivuzo: lokhu kuyinkohliso ethe xaxa lapha: Ngokungafani nesibonelo se-maze ngenhla, kukhona ngokuvamileyo Awukho umvuzo kanambambili. Ngokwesimo se-LLMS, imivuzo imvamisa ivela kwenye Imodeli yomvuzookukhipha amaphuzu ngakunye (umbuzo, impendulo) pair. Le modeli iqeqeshwa kusuka kwimininingwane echazwe ngumuntu (yingakho “RLHF”) lapho ama-Anvotators abeka izimpendulo ezahlukene. Umgomo wenzelwe izimpendulo ezisezingeni eliphakeme ukuthola imivuzo ephakeme.
Qaphela: Kwezinye izimo, imivuzo empeleni ingathola lula. Isibonelo, eSenseekMath, Izindlela ezenzelwe umthetho ingasetshenziswa ngoba izimpendulo zezibalo zivame ukunqunyelwa okuningi (impendulo efanele noma engalungile)
Umgomo ngumqondo wokugcina esiwudingayo manje. Emihlanganweni ye-RL, inqubomgomo umane nje isu lokunquma ukuthi yisiphi isenzo okufanele sisithathe. Endabeni ye-LLM, inqubomgomo ekhipha ukusatshalaliswa okungenzeka ngaphezulu kwamathokheni ezinyathelo ngalunye Ngokwethembisa, inqubomgomo inqunywa ngamapharamitha wemodeli (izinsimbi). Ngesikhathi sokuqeqeshwa kwe-RL, siguqula la mapharamitha ngakho-ke i-LLM iba khona kakhulu ukukhiqiza amathokheni “angcono” akhiqiza izikolo eziphakeme zomvuzo.
Sivame ukubhala inqubomgomo njengoba:

-phi a isenzo (ithokheni ukukhiqiza), ushikukazi umbuso (umbuzo namathokheni akhiqizwa kuze kube manje), futhi θ (Amapharamitha wemodeli).
Lo mbono wokuthola inqubomgomo engcono kakhulu yiphuzu lonke le-RL! Njengoba asinalo idatha ebikiwe (njengoba senza ekufundeni okugadiwe) Sisebenzisa imivuzo ukulungisa inqubomgomo yethu ukuthi ithathe izenzo ezingcono. (Emigabeni ye-LLM: Siguqula amapharamitha we-LLM yethu ukukhiqiza amathokheni angcono.)
I-TRPO (ukwelashwa kwenqubomgomo ye-Trust Region)
Isifaniso ngokufunda okugadiwe
Ake sithathe isinyathelo esisheshayo sibuyele ekufundeni okugadiwe ngokuvamile sisebenza. Ufake amalebula idatha futhi usebenzise umsebenzi wokulahleka (njenge-cross-entropy) ukukala ukuthi ukuvala kanjani ukubikezela kwemodeli yakho kumalebula angempela.

Singasebenzisa ama-algorithms afana ne-backpropagation kanye ne-gradient fear ukunciphisa umsebenzi wethu wokulahlekelwa futhi avuselele izinsimbi θ yemodeli yethu.
Khumbula ukuthi inqubomgomo yethu futhi iphuzu! Ngalowo mqondo, kuyabonakala ekubikezelweni kwemodeli ekufundeni okugadiwe … silingeka ukuba sibhale into efana:

-phi ushikukazi isimo samanje futhi a isenzo esingenzeka.
A (s, a) ibizwa ngokuthi Umsebenzi Inzuzo Futhi zinyathelo zihle kangakanani isenzo esikhethiwe esimweni samanje, uma kuqhathaniswa nesisekelo. Lokhu kufana kakhulu nombono we amalebula ekufundeni okugadiwe kepha kuthathwe kusuka thola esikhundleni sokulebula okucacile. Ukwenza lulasingabhala inzuzo njengoba:

Ekusebenzeni, okuyisisekelo kubalwa kusetshenziswa a Umsebenzi wenani. Leli yisikhathi esivamile e-RL engizochaza kamuva. Okudingayo ukwazi manje ukuthi kukala umvuzo okulindelekile ukuthi sizokwamukela uma siqhubeka nokulandela inqubomgomo yamanje evela kuhulumeni ushikukazi.
Yini i-TRPO?
I-TRPO (ukwethembana kwenqubomgomo ye-Trust Region) yakha ngalomqondo wokusebenzisa umsebenzi wokuthola inzuzo kepha ingeza isithako esibucayi se qinisa: imininindo Kuze kube kude kangakanani le nqubomgomo entsha evela kwinqubomgomo endala kusinyathelo ngasinye sokuvuselela (okufana nalokho esikwenzayo nge-batch gradient feor ngokwesibonelo).
- Yethula igama le-KL Divergence (likubone njengesilinganiso sokufana) phakathi kwenqubomgomo yamanje neyakudala:

- Iphinde ihlukanise inqubomgomo yinqubomgomo endala. Lesi silinganiso, sandiswa ngumsebenzi wenzuzo, sisinikeza umuzwa wokuthi isibuyekezo ngasinye sizuzisa kanjani okuhlobene nenqubomgomo endala.
Ukubeka konke, i-TRPO izama qoyisa inhloso ye-surrogate (okubandakanya inzuzo kanye nesilinganiso senqubomgomo) ngokuya nge I-KL Divergence Constraint.

I-PPO (Ukusebenza kahle kwenqubomgomo yeProximal)
Ngenkathi i-TRPO yayithuthuka ngokubalulekile, ayisasetshenziswa kabanzi, ikakhulukazi ngokuqeqeshwa kwe-LLMS, ngenxa yokubala kwayo kwe-gradienty tom.
Esikhundleni salokho, i-PPO manje iyindlela ekhethwayo ekwakheni iningi lezakhiwo ze-LLMS, kufaka phakathi i-chatgpt, i-gemini, nokuningi.
Empeleni kufana ne-trpo, kepha esikhundleni sokuphoqelela Ukucindezelwa okunzima ku-KL DivergenceI-PPO yethula a “-sibeziwe Inhloso ye-surrogate “ekhawulela ngokuphelele izibuyekezo zenqubomgomo, futhi yenza lula inqubo yokwenza kahle.
Nakhu ukuwohloka komsebenzi we-PPO inhloso esikwenzayo ukuwazula emaphangeni ethu wemodeli yethu.

I-GRPO (Iqembu elihlobene nenqubomgomo yenqubomgomo)
Ngabe kutholakala kanjani inani lenani elivame ukutholakala?
Ake siqale sikhulume kabanzi nge imbuyiselo kanye Inani lemisebenzi Ngethule phambilini.
Kuma-setups ajwayelekile (njenge-PPO), a Imodeli yenani iqeqeshwa eceleni kwenqubomgomo. Umgomo wayo ukubikezela inani lesenzo ngasinye esilithathayo (ithokheni ngalinye elikhiqizwa yimodeli), kusetshenziswa imivuzo esiyitholayo (khumbula ukuthi inani kufanele limele umvuzo olindelekile).
Nakhu ukuthi kusebenza kanjani ekusebenzeni. Thatha umbuzo “Yini 2 + 2?” njengesibonelo. Imiphumela yethu yemodeli “2 + 2 ingu-4” futhi ithola umvuzo ka-0.8 ngaleyo mpendulo. Sibe sesibuyela emuva futhi siphawule Imivuzo Ehlisiwe esiqalweni ngasinye:
- “2 + 2 ngu-4” uthola inani le-0.8
- “2 + 2 ngu” (1 Token emuva) uthola inani le-0.8γ
- “2 + 2” (amathokheni amathokheni abuyela emuva) uthola inani le-0.8²²
- njll.
-phi γ isici sesaphulelo (0.9 ngokwesibonelo). Sibe sesisebenzisa lezi ziqalo namanani ahambisana nokuqeqesha imodeli yenani.
Inothi elibalulekile: Imodeli yenani kanye nemodeli yomvuzo yizinto ezimbili ezahlukene. Imodeli yomvuzo iqeqeshwa ngaphambi kwenqubo ye-RL futhi isebenzise ngababili (umbuzo, impendulo) kanye nesikhundla somuntu. Imodeli yenani iqeqeshwa kanye kanye nenqubomgomo, futhi ihlose ukubikezela umvuzo olindelekile esiteshini ngasinye senqubo yesizukulwane.
Yini okusha ku-grpo
Noma ngabe ekusebenzeni, imodeli yomvuzo ivame ukususelwa kwinqubomgomo (ukuqeqeshwa kuphela “” ikhanda “), sisagcina sigcina amamodeli amaningi nokuphatha izinqubo zokuqeqesha eziningi (inqubomgomo, umvuzo, imodeli yenani). Uhlobo lwenhlanzi Iqondisa lokhu ngokwethula indlela ephumelelayo.
Khumbula engikushilo phambilini?

E-PPO, sanquma ukusebenzisa umsebenzi wenani lethu njengesisekelo. UGrpo ukhetha okunye: nakhu i-GRPO eyenzayo: Ngombuzo ngamunyeI-GRPO yakha iqembu lezimpendulo (iqembu likasayizi G) futhi lisebenzisa imivuzo yabo ukubala inzuzo yokuphendula ngayinye njenge I-Z-Score:

-phi ᵢ umvuzo we mina-Impendulo futhi μ na- σbisa ingabe ukuphambuka okujwayelekile nokujwayelekile kwemivuzo kulelo qembu.
Lokhu ngokwemvelo kuqeda isidingo semodeli yenani elihlukile. Lo mbono wenza umqondo omningi lapho ucabanga ngakho! Iqondanisa nomsebenzi wenani esawethula ngaphambili Futhi futhi izinyathelo, ngomqondo othile, umvuzo “olindelekile” esikwazi ukukuthola. Futhi, le ndlela entsha iguqulwe kahle nenkinga yethu ngoba i-LLMS ingakhiqiza kalula kakhulu Imiphumela enganqunyelwe ngokusebenzisa okuphansi izinga lokushisa nokubanda (Ilawula ukungahleliwe kwezizukulwane zamathokheni).
Lona ngumqondo ophambili ngemuva kwe-GRPO: Ukususa imodeli yenani.
Ekugcineni, i-grpo ingeza a Kl divergence Ithemu (ukuze libe ngqo, i-GRPO isebenzisa ukulinganiselwa okulula kwe-KL Divergence ukuthuthukisa i-algorithm ngokuqhubekayo) ngqo enhlosweni yayo, kuqhathaniswa nenqubomgomo yamanje ku inqubomgomo yereferensi (imvamisa imodeli ye-post-sft).
Bona ukwakheka kokugcina ngezansi:

Futhi … lokho kakhulu nge-grpo! Ngiyethemba ukuthi lokhu kukunika ukubuka konke okucacile kwenqubo: Kusathembela kwimibono efanayo yesisekelo njenge-TRPO ne-PPO kodwa ukwethula ukuthuthuka okwengeziwe ukwenza ukuqeqeshwa okwenziwe kahle, okusheshayo – izinto ezibalulekile ngemuva Impumelelo kaSenseeek.
Ukugcina
Ukugcizelela ukuqiniswa kuphenduke itshe lokuqeqesha amamodeli amakhulu olimi olukhulu, ikakhulukazi nge-PPO, futhi muva nje i-GRPO. Indlela ngayinye ihlala ku-RL Okuyisisekelo
• I-TRPO wethule izingqinamba zenqubomgomo eziqinile nge-KL Divergence
• Uhlobo lwenkathoni yakha ezinye izingqinamba ngenjongo evuhliwe
• Uhlobo lwenhlanzi Uthathe isinyathelo esengeziwe ngokususa imfuneko yemodeli yenani kanye nokusebenzisa umvuzo osuselwa kwiqembu. Vele, ukujula futhi kusizakala kwezinye izinto ezintsha, njengedatha esezingeni eliphakeme namanye amasu okuqeqesha, kepha lokho kungenye isikhathi!
Ngiyethemba ukuthi le ndatshana ikunikeze isithombe esicacile sokuthi lezi zindlela zixhuma kanjani futhi zivela kanjani. Ngikholwa ukuthi ukufunda ukuqiniswa kuzoba ukugxila okuyinhloko ekuqeqesheni i-LLMS Ukwenza ngcono ukusebenza kwabo, kudlula ukuqeqeshwa kwangaphambi kokuqeqeshwa kanye ne-SFT ekushayeleni okusha okuzayo.
Uma unesifiso sokushona ngokujulile, zizwe ukhululekile ukubheka izethenjwa ezingezansi noma ukuhlola okuthunyelwe kwami kwangaphambili.
Siyabonga ngokufunda, futhi uzizwe ukhululekile ukushiya ukushaya ihlombe kanye nokuphawula!
Ufuna ukufunda kabanzi mayelana nabaguquli noma bangene kwi-Math ngemuva kwesiqalekiso sobukhulu? Bheka izindatshana zami zangaphambilini:
Abaguquli: Baguqula kanjani idatha yakho?
Ukungena kwi-Transformers Architecture nokuthi yini ebenza bangabekezeleleki emisebenzini yolimimaqondana ne-danatascescince
Izibalo ngemuva “kwesiqalekiso sobukhulu obukhulu”
Ngena “isiqalekiso sobukhulu bobukhulu” futhi uqonde izibalo ngemuva kwazo zonke izinto ezimangazayo ezivelayo …maqondana ne-danatascescince
Izinkomba: