Izifundo Ezingafundiwe Zokucazululwa Kombuzo we-RAG: Isakhiwo Ngaphambi Kokusesha

umngane ku I-Enterprise Document Intelligenceuchungechunge ifilosofi yalo ebekwe ku- Khulisa Uchwepheshe.
Iyasondeza isitini 2 (ukuhlahlelwa kombuzo) yezakhiwo zezitini ezine futhi iveza izifundo eziningi ezeqayo.
Izifundo eziningi ze-RAG zeqa ukuncozululwa kwemibuzo. Iyunithi yezinhlamvu yomsebenzisi iqonda ngqo ekubuyiseni, i-cosine isebenza phezulu-k, futhi imodeli inikezwa noma yini ebuyisiwe. Asikwenzi lokho, ngesizathu esisodwa: umbuzo womsebenzisi akuwona umbuzo wosesho. Kuphathe njengento eyodwa bese uthola izimpendulo ezithule ezingaphelele, futhi ekukhiqizeni yilapho i-RAG eningi yephuka buthule.
📓 Izincwadi zokubhalela ezingase zisebenze ziku-GitHub: doc-intel/notebooks-vol1.

Isisekelo esingenangqondo lesi sihloko siqhubekela phambili

Ipayipi elingenalwazi lishumeka iyunithi yezinhlamvu yomsebenzisi futhi licela isitolo se-vector ama-top-k amachunks afanayo kakhulu. Akukho kulokho kusetha okwaziyo ukuthi umbuzo unezingxenye ezimbili, noma ukuthi umsebenzisi ubefuna inani eliqondile hhayi isigaba. Ngakho-ke sisebenzisa isitini esisodwa esengeziwe embuzweni ngokwawo: umugqa phakathi question_df enamakholomu amahlanu athayiphiwe (amagama angukhiye, ububanzi, umumo, ukuwohloka, ukucaciswa) kanye namathebula esathelayithi, kanye nezifinyezo ezimbili ezisuselwe (RetrievalQuery ngesitini sokubuyisa, GenerationBrief okwesitini sesizukulwane).

Umdwebo we-anatomy ubonisa amakholomu ayisisekelo amahlanu, kodwa ukukhiqizwa question_df ithwala okunye okubili okunquma ukuthi ukubuyiswa kwefasitela kuzodlula kangakanani esizukulwaneni. Isiyalo somongo silinganiswa ngo imigqa (hhayi izinhlamvu, ezinomsindo kakhulu; hhayi amakhasi, amaholoholo kakhulu). Ithebula elingezansi libonisa imigqa yamasampula amathathu: ukubheka okukodwa okuyiqiniso, eyodwa yebo/cha i-boolean, umbuzo owodwa wohlu. Umugqa ngamunye usayizi wewindi lomongo ngokuhlukile, ngokufunda umumo wempendulo kanye nephethini yokubola.

Amakholomu amabili e-emerald ayisiyalo esishibhile iningi lamapayipi awalokothi libhale phansi. Ukubheka okuyiqiniso (inani leprimiyamu, idethi eqala ukusebenza, inani elidonswayo) akudingi cishe umongo ozungezile: umugqa owodwa noma emibili ngaphambi kwehange ukuze kuqinisekiswe, okumbalwa ngemva kwalokho ngomsila wezimpawu zokuloba. Umbuzo wokufakwa kuhlu udinga ingqikithi enguziro ngaphambili kanye newindi elide eliya phambili ngoba uhlu lunwebela phansi esigabeni. Umhlahleli uyagcwalisa lines_before_anchor futhi lines_after_anchor kusukela kusimo esihlukanisiwe kanye nokubola; ukubuyisa kuyawahlonipha; akukho magic top-k cutoff ehamba ngepayipi.
Ngezansi kunezifundo eziyisithupha ezingafundiswanga ezibamba isitini ndawonye.
Isifundo 1 – I-schema esihlobene, i-symmetric ohlangothini lwedokhumenti
Imibhalo “inokuqonda kombuzo” kanye “nokubhalwa kabusha kombuzo”, kodwa yomibili iphatha umbuzo njengeyunithi yezinhlamvu eguqulwe yaba enye iyunithi yezinhlamvu. Ukulingisa njengomugqa phakathi question_df kanye namathebula esathelayithi akuyona indlela abantu abavamise ukuyifreyima ngayo. Okwenza ukuthi ichofoze ukufana nohlangothi lwedokhumenti (line_df, toc_df, span_df): zombili izinhlangothi zihlobene, zombili ziyajoyina, futhi ukubuyisa kuba isihlungi kuzo zonke.
Kungani kubalulekile. Amapayipi amaningi okukhiqiza agcina umbuzo njengeyunithi yezinhlamvu eyodwa ngaphakathi kwesifanekiso sokwaziswa se-LLM. Akukho mqondo “umbuzo unomumo”, “umbuzo unobubanzi”, “umbuzo uyahlukana”. Lapho iqembu lidinga ikhono elisha (ukubamba ukuphika, ukuphatha imibuzo exubile, ukuphatha ububanzi), okuwukuphela kwendawo yokusengeza isifanekiso esisheshayo. Ezinyangeni eziyisithupha, umyalo uphethe imigqa engamashumi ayisithupha yezigatshana zamacala ayisipesheli okungekho kuzona ukucwaningwa kwamabhuku okungalandelelwa. Ukuhlela umbuzo kanye emngceleni womhlahleli, indlela ukuhlaziya okumisa ngayo idokhumenti emngceleni wako, kususa lokho kubola emthonjeni wako.
Ukugqama okuphathekayo. Umsebenzisi uyabuza “Iyini inani leprimiyamu kanye nomnqamulajuqu wokuvuselela?”. Isisekelo esingenangqondo sishumeka lolo chungechunge futhi silinganise izingcezu. Uchungechunge lugcwalisa umugqa owodwa we question_df: amagama angukhiye ["premium", "amount", "renewal", "deadline"] ububanzi "contract" umumo (Amount, Date) ukubola "independent" (imibuzo emincane emibili). Manje ukubuyisa sekunomugqa okufanele kuhlungwe line_df ngokumelene, futhi isizukulwane sinesimo esithayiphiwe okufanele sigcwalise.
Owesibili. Kubuza ummeli wezomthetho “Ingabe isigatshana sokubuyisela siyasinda ekunqanyulweni, futhi uma kunjalo, isikhathi esingakanani?”. Indlela engenangqondo idlulisela yonke intambo ku-LLM; impendulo ivamise ukubuya no-yebo-noma-cha ekusindeni futhi yeqa isikhathi buthule. Uchungechunge luyagcwalisa question_df ngomumo (Boolean, Duration) ukubola "conditional" (ubude besikhathi bubalulekile kuphela uma ukusinda Kuyiqiniso), futhi izitini ezansi nomfula zazi kahle ukuthi imuphi umbuzo ongaphansi ofakwe ngasiphi isango.
→ I-Athikili 6A: Hlaziya umbuzo ngaphambi kokuthi useshe uhambe kuwo wonke umkhawulo womhlahleli uye ekugcineni.
Isifundo 2 – I-schema, hhayi ikhodi yegatsha
Iningi lamakhodi e-RAG akhulisa i-logic yokusingatha imibuzo njengekhodi yegatsha, evalwe ngesango if intent == "..." amaketanga agcwala phakathi nezinyanga. Sikhulisa isitini njenge i-schema esikhundleni salokho: amandla amasha yikholomu engezwe kuyo question_dfihlelwe uchwepheshe, hhayi indlela entsha yekhodi. Izindleko zesici esisha zihlala ziqondile enanini lamakholomu, hhayi ama-quadratic ezinhlanganisela zamagatsha.
Ukugqama okuphathekayo. Engeza “ukuphatha ukuphika” esitini. Indlela engenangqondo: igatsha elisha kukhodi yokuhlanganisa yokwaziswa, kanye nokuhlola, kanye nokuhlolwa kokuhlanganiswa kokuhlehla. Indlela yochungechunge: engeza a negation_present ikholomu (i-boolean), engeza umugqa kusichazamazwi samathokheni wokuphika, bhala ukuziphatha okuphansi komfula, futhi isithumeli sifunde leyo kholomu lapho sidinga khona.
→ I-Athikili 6B: Izinkambu ezinhlanu i-RAG okufanele ikhiphe kunoma yimuphi umbuzo wakha amakholomu amahlanu ngayinye ngayinye.
Isifundo 3 – Izifinyezo ezimbili, esisodwa ngesitini esingaphansi komfula
Okuzenzakalelayo ukwaziswa okukodwa okuthwala yonke into, lapho ukubuyisa kufanele kuzibe izinkambu zokukhiqiza kuphela futhi isizukulwane kufanele sihlaziye kabusha izinkambu zokubuyisa. Siyawahlukanisa: isitini sokubuyisa sithola kuphela lokho esingakwenza (amagama angukhiye, ububanzi, ukusikisela kwesakhiwo), futhi isitini sokukhiqiza sithola kuphela lokho esikudingayo (inhloso, umumo ophumayo, okungafakwanga). Isitini ngasinye esingezansi sifunda usayizi omfushane womsebenzi waso, hhayi umbuzo wonke.
Ukugqama okuphathekayo. Ngoba “Iyini inani le-premium ngamadola, hhayi ama-euro?” umfushane wokubuyisa amagama angukhiye ["premium", "amount"] kanye nobubanzi "contract" . Isizukulwane sifushane umumo "Amount(value, currency='USD')" kanye nokungafakwa ["EUR"] . Ukubuyisa akudingi ukwazi mayelana nokukhishwa kwemali; i-generation ayidingi ukuphinda ikhiphe amagama angukhiye.
→ I-athikili 6A ihlukanisa umbuzo ube kafushane kabili, futhi i-Athikili 6B ikhipha amakholomu.
Isifundo 4 – Isichazamazwi sochwepheshe esehlula ukushumeka
Indaba evamile ithengisa okushumekiwe njengendlela yokusingatha amagama afanayo: umsebenzisi uthayipha “i-premium”, imodeli “iyazi” ihlobene “nomnikelo wanyanga zonke”. Ngokusebenza concept_keywords_df ibeka igama lomsebenzisi egameni ledokhumenti ngaphambi kwanoma yikuphi ukuseshangengxenye yezindleko futhi akukho nokukodwa kokuhamba. Uchwepheshe ugcina isichazamazwi njenge-wiki; imodeli yokushumeka ayinawo umbono wokuthi isiphi isibizo esisemthethweni ku-corpus yakho.
Ukugqama okuphathekayo. Izinhlobo zabasebenzisi “Ngikhokha malini ngenyanga?”. Isisekelo se-Naive siyashumeka, i-cosine ibuyisela amakhasi “okukhokha” ajwayelekile. Amasheke ochungechunge concept_keywords_df okokuqala: "pay each month" amamephu ukuze ["premium", "monthly contribution", "monthly installment"] kule khophasi yomshwalense. Ukubuyisa kusebenzisa ukusesha kwegama elingukhiye kulawo magama amathathu; ulayini wangempela (“i-premium ye-$124 / ngenyanga”) ikhanya ngokushesha.
→ I-Athikili 6B: Izinkambu ezinhlanu i-RAG okufanele ikhiphe kunoma yimuphi umbuzo ochaza concept_keywords_df indlela.
Isifundo 5 – Amaphethini amane emibuzo exubile, awekho athule
Umbuzo onezingxenye ezimbili (“inani nomnqamulajuqu”) uvame ukuphendulwa engxenyeni eyodwa bese wehliselwa enye buthule. Uchungechunge luqamba amaphethini amane (azimele, alandelanayo, ahlanganisiwe, anemibandela) futhi luphoqa umhlahleli ukuthi amake ukuthi iyiphi esebenzayo. Ipayipi libe selibola (bese lihamba ngokuhambisana), amaketango (bese liphakela ingxenye A libe yingxenye B), noma lenqabe ukuphendula ingxenye elingakwazi ukuyivala. Ayikho impendulo ethule kancane.
Ukugqama okuphathekayo. Kubuza umsebenzisi “Iyini imali edonswayo uma isimangalo sidlula ikepisi, futhi yini ikepisi?”, aa okulandelanayo* okuhlanganisiwe. I-Naive RAG ithumela kokubili njengentambo eyodwa; i-LLM iphendula mayelana nekepisi bese ikhohlwa isigatshana esinemibandela edonswayo. Uchungechunge luyabona decomposition = "sequential" uhlaziya ingxenye A (cap?) kanye nengxenye B (deductible if claim > cap?), iwaqhuba ngokulandelana, futhi ithumele impendulo ngayinye enengcaphuno yayo, noma imaka eyodwa njengengatholakali uma injalo ngempela.
→ I-Athikili 6B: Izinkambu ezinhlanu i-RAG okufanele ikhiphe kunoma yimuphi umbuzo zibeka amaphethini amane ahlanganisiwe.
Isifundo 6 – I-Deterministic dispatcher, hhayi i-LLM-enqumayo
I-agent reflex ithi: vumela i-LLM ikhethe ukuthi yiziphi izitholi, izikimu, futhi yazise izingcezu ezizovula ucingo ngalunye. Nazi izindlela ezintathu eziyinhloko: umsebenzisi-okusobala (ifomu liqhuba ukusebenza), i-deterministic-dispatcher (imithetho kuzimpawu zombuzo wemephu yekhodi ukwenza kusebenze), kanye I-LLM-inquma (imodeli iyazihlela). Ababili bokuqala bahlala. Siyeka okwesithathu kwebhizinisi, ngoba isistimu ezihlela kabusha zonke izingcingo azikwazi ukucwaningwa ngendlela efanayo kabili.
Ukugqama okuphathekayo. Umbuzo ofanayo wokuthobela usebenza kabili ohlelweni. Nge i-deterministic-dispatcher irekhodi lokuhlola libonisa indlela efanayo yokuthumela izikhathi zombili: decide.py ulayini 47 uxoshiwe, route = "factual_lookup" izindlela zokubuyisa ["keyword", "toc"] icushiwe, i-schema yokukhiqiza AmountWithEvidence . Nge I-LLM-inquma umlando wokuhlola ubonisa ukulandelana okubili okuhlukene, futhi awukwazi ukuqinisekisa umzila ofanayo kusasa. Eyokuqala iyafundeka. Eyesibili ayikho.
→ I-Athikili 6C: Umbuzo owodwa we-RAG ohlaziyiwe, izinqumo ezine zihlanganisa iphethini ye-dispatcher.
Izifundo eziyisithupha zabelana ngomnyakazo owodwa: thatha isinyathelo ibhuku lokudlala elivamile eliphatha njengokucutshungulwa kwentambo esemgqeni, bese ulenza isitini esithayiphiwe. Uma umbuzo usuwumugqa onamakholomu, lonke elinye ipayipi lizohlungwa, lihlole uhlobo, futhi lithumele ngezindlela iyunithi yezinhlamvu eyisicaba engeke ikwazi. I-deep-dives (6A, 6B, 6C, 6bis) ihambisa ikhodi yomkhumbi ku-corpora yangempela; lesi siqeshana yikhathalogi ebakhombayo.
Inothi lokutholwa kwenhloso. I-Vol.1 ihlala incane ezihlosweni: umthumeli ubona isethi eyisisekelo (ukubheka okuyiqiniso, ukufakwa kuhlu, isifinyezo esisheshayo esifundwe kusukela parsing_summary.summary isifinyezo esijulile esivela ku-TOC + imigqa yokuqala, ukulungiswa kwereferensi, ukwenqaba ngaphandle kwekhorasi), okwanele ukuthumela imibuzo evame kakhulu ye-PDF yebhizinisi ngendlela efanele. I egcwele i-taxonomy yenhloso ingena Umqulu 2 (ukuhumusha, isifinyezo kuwo wonke amadokhumenti, ukuqhathanisa, ukwenza kabusha, ukuhlola ubufakazi), lapho inhloso × ifomethi ye-matrix ikhiqiza inqwaba yezindlela zokuthunyelwa phezu komgogodla wezitini ezine. I-Vol.1 igcina umgogodla uhlanzekile; I-Vol.2 yakha i-matrix.
Kuyo yonke imikhakha nemisebenzi
Isitini siphatha zonke izizinda ngendlela efanayo: khipha amakholomu athayiphiwe embuzweni, thola amafuphi amabili. Isichazamazwi sochwepheshe ngaphakathi concept_keywords_df iqondene nomkhakha othile; i-schema kanye ne-dispatch logic ikhona yonke indawo. Imikhakha emihlanu ngezansi, iphethini yokuhlaziya eyodwa, amakholomu amahlanu afanayo.

Okushintshayo ukusuka kumugqa kuye kumugqa isichazamazwi sochwepheshe. Umdayisi womshwalense concept_keywords_df amamephu “khokha inyanga ngayinye” ku ["premium", "monthly contribution", "monthly installment"]; amamephu alinganayo ezokwelapha “igazi elincane” ku ["anticoagulant", "warfarin", "heparin", "DOAC"]; amamephu alingana nezezimali “umugqa ophezulu” ku ["revenue", "net revenue", "GAAP revenue"] . Amakholomu ezitini, ukuthunyelwa, nomzila wokuhlola kuhlala kufana.
Lapho lezi zifundo zingena khona ochungechungeni
Ama-athikili anezinombolo athuthukisa isifundo ngasinye ngekhodi, ngamabhukwana asebenzisekayo:
- I-Athikili 6A (ukuncozululwa kombuzo: ithisisi) yenza isimo sokuthi iyunithi yezinhlamvu ayiwona umbuzo futhi ikhombisa umumo wokuhlobana.
- I-Athikili 6B (ukuhlahlelwa kombuzo: isizinda) ihamba imindeni emihlanu yamakholomu (amagama angukhiye, ububanzi, umumo, ukubola, ukucaciswa) agcwalisa
question_df. - I-Athikili 6C (ukudluliswa kombuzo: i-dispatch) ithuthukisa umthumeli ophendulela umbuzo ohlaziywe ube izinqumo zomzila.
- I-athikili 6bis (iluphu yokucacisa) isingatha icala lapho umbuzo ungacacile kakhulu ukuthi ungawusebenzisa kanjani futhi isistimu ibuza ukucaciswa okukodwa okugxilile.
Imithombo nokufunda okuqhubekayo
Izincwadi zebhuku/i-athikili ekuqondeni kombuzo zimise okosesho lwabathengi (Elastic, Google) futhi ayidluliseli ngokuhlanzekile ekhorasi yebhizinisi elincane lapho isilulumagama sochwepheshe siyimpahla. Ukuma kochungechunge wukwakhiwa kabusha komumo wobudlelwano phezu kohlangothi lwedokhumenti ehlelekile.
- Hlaziya umbuzo ngaphambi kokuthi useshe (Isihloko 6A). Ithisisi eshicilelwe yokuhlaziya imibuzo.
- Izinkambu ezinhlanu ze-RAG okufanele zikhiphe kunoma yimuphi umbuzo (Isihloko 6B). Isizinda sekholomu ngekholomu kukhodi.
- Umbuzo owodwa we-RAG, izinqumo ezine (Isihloko 6C). Iphethini ye-dispatcher eshintsha amakholomu ancozululiwe abe izinqumo zomzila.
- Lapho abasebenzisi be-RAG bebuza imibuzo engacacile (Isihloko 6bis). Iluphu yokucacisa efunda okuzenzakalelayo ngemva kokubuza okukodwa.
Ngaphambilini ochungechungeni:



