Machine Learning

Usebenzisa i-Langectract ye-Google kanye ne-gemma yokukhishwa kwedatha ehlelekile

Njengezinqubomgomo zomshuwalense, amarekhodi ezokwelapha, kanye nemibiko yokuthobela ngokudumisa okude futhi kube nesidina ukubalwa.

Imininingwane ebalulekile (isib. Ukuthola imikhawulo kanye nezibopho ezinqubomgomo zomshuwalense) zingcwatshwa embhalweni wobuntu obukhulu, ongahleliwe oziphonsela inselelo yokuthi umuntu ojwayelekile abulawe futhi agaye.

Amamodeli amakhulu olimi (LLMS), osuvele wawaziwa ngokuguquguquka kwawo, asebenze njengamathuluzi anamandla okusika kulolu nkinga, ukukhipha amaqiniso asemqoka futhi aguqule amadokhumenti angcolile abe imininingwane ecacile, ehlelekile.

Kulesi sihloko, sihlola i-Google's I-LangecTract Uhlaka kanye ne-Open-Source LLM, Gemma 3okuhlanganisa ukwenza imininingwane ehlelekile evela kumbhalo ongahleliwe oqondile futhi osebenzayo.

Ukuletha lokhu empilweni, sizohamba ngedemo ekuhlanganiseni inqubomgomo yomshuwalense, sibonisa ukuthi imininingwane efana kanjani ukuphileka ingahle ithathwe ngempumelelo.

Okuqukethwe

(1) Ukuqonda i-langexcract ne-gemma
(2) Ngaphansi kwe-hood of langexcract
(3) Isibonelo ukuhamba

I-repo ehambisana nalesi sihloko i-GitHub ingatholakala lapha.


(1) Ukuqonda i-Langectract ne-Gemma

(i) Ukubuyiselwa kwe-Langectract

I-LangecTract Ingabe umtapo wezincwadi ovulekile wePython (okhishwe ngaphansi kwe-github yakwaGoogle) esebenzisa i-LLMS ukukhipha imininingwane ehlelekile embhalweni ongcolile ongahleliwe osuselwa kwimiyalo echazwe ngumsebenzisi.

Inika amandla i-LLMS ukuze idlule ukuqashelwa kwebhizinisi okuthiwa (okufana nemikhawulo yokumboza, ukukhishwa, kanye nemishwana) futhi Ukukhishwa kobudlelwano (ngokunengqondo uxhumanisa isigatshana ngasinye ezimweni zawo) ngezimo ezihlobene namaqembu ahlukahlukene.

Ukuthandwa kwaso kuvela ngokulula kwayo, njengoba nje imigqa embalwa yekhodi yanele ukwenza isizinda semininingwane esakhiwe. Ngaphezu kokulula kwayo, izici ezimbalwa ezibalulekile zenza I-LangecTract Phuma:

  • Ukuqondanisa komthombo okuqondile: Into ngayinye ekhishwe ixhumeke emuva endaweni yayo eqondile embhalweni wokuqala, uqinisekise ukulandelwa okugcwele.
  • Yakhelwe amadokhumenti amade: Iphatha inkinga “yenaliti-in-a-haystack” nge-Smart chunkhing, ukucubungula okuhambisanayo, nokudlula okuzayo ukukhulisa ukukhumbula ukuze uthole amabhizinisi angeziwe.
  • Ukuvumelana okubanzi kwemodeli: Isebenza ngaphandle komthungo ngama-llms ahlukene, kusuka kumamodeli asuselwa emafu afana ne-gemini ezinketho ezivulekile zoMthombo wendawo.
  • Domain agnostic: Ivumelanisa nanoma yisiphi isizinda esinezibonelo ezimbalwa kuphela, esusa isidingo sokuhleleka okuhle.
  • Imiphumela ehlelekile ehlelekile: Isebenzisa izibonelo ezimbalwa zokudubula nokulawulwa okulawulwayo (kuphela ama-LLMS afana ne-gemini) ukuze asebenzise i-schema ememezayo esetshenzisiwe futhi akhiqize imiphumela ethembekile, ehlelekile.
  • Ukuqonda okusebenzayo: Kwakha ifayela le-HTML elisebenzisanayo ukubona ngeso lengqondo futhi libuyekeze amabhizinisi akhishwe kumongo wawo wokuqala.

(ii) Gemma 3

Isilwane esibumnene ungumndeni we I-LLMS engasindi, yezwe evulekile evulekile evela ku-Googleeyakhiwe kusuka ocwaningweni olufanayo olusetshenziselwa ukudala amamodeli we-Gemini.

Gemma 3 Ukukhishwa kwakamuva emndenini weGemma, futhi kuyatholakala ngosayizi wepharamitha emihlanu: 270m, 1b, 4b, 12b, kanye 27b. Ibuye ibe yinqwaba ibe yiyona Imodeli yamanje, ekwazi kakhulu egijima kwi-GPU eyodwa.

Kungaphatha okokufaka okusheshayo okufika kumathokheni ayi-128k, kusinika amandla ukucubungula izindatshana eziningi ezinamakhasi amaningi (noma amakhulukhulu wezithombe) ngokusheshisa okukodwa.

Kulesi sihloko, Sizosebenzisa i- Imodeli ye-Gemma 3 4b (Ipharamitha engu-4-billion ehlukile), esetshenziswe endaweni yakini nge Ollama.


(2) ngaphansi kwe-hood ye-langexcract

I-Langexctract iza nezici eziningi ezijwayelekile ezilindelwe kuzinhlaka zanamuhla ze-LLM, njengokufakwa kwedokhumenti, ukwenziwa kabusha (isib. Tokenization), ukuphathwa okusheshayo, nokuphatha okuphumayo.

Okwakuthola ukunakwa kwami ​​kukhona Amakhono amathathu axhasa ukukhishwa kwemininingwane yombuso ende:

  • I-Smart Chunkhing
  • Ukucubungula okuhambisanayo
  • Ukukhishwa okuningi okudlula

Ukuze ubone ukuthi lokhu kwaqaliswa kanjani, ngimbiwa kwikhodi yomthombo futhi ngilandele ukuthi basebenza kanjani ngaphansi kwe-hood.

(i) amasu wokugoqa

I-Langexcract Isebenzisa amasu we-smart chunking ukuthuthukisa ikhwalithi yokukhipha ngaphezulu kokudlula okukodwa kudokhumenti enkulu.

Umgomo ukuhlukanisa amadokhumenti ube yi-chunks emincane, egxile engcupheni yomongo ophathekayo, ukuze umbhalo ofanele ugcinwe ngendlela eyenziwe kahle futhi okulula ukuyiqonda.

Esikhundleni sokusika ngokungenangqondo emikhawulweni yezinhlamvu, ihlonipha imisho, izigaba kanye nama-Newlines.

Nasi isifinyezo sokuziphatha okubalulekile kwesu le-chunkling:

  • Umusho- futhi uqaphele: Ama-chunks akhiwa kusuka emishweni yonke lapho kungenzeka (ngokuhlonipha abadidiyeli wombhalo njengekhefu lesigaba), ukuze umongo uhlala uqinile.
  • Iphatha imisho emide: Uma umusho mude kakhulu, uphukile kumaphoyinti wemvelo afana namaNewlines. Kuphela uma kunesidingo kuzohlukana ngaphakathi komusho.
  • Ukuphathwa Kwecala Eliphethe: Uma igama elilodwa noma ithokheni isikhathi eside kunomkhawulo, iba yi-chunk ukugwema amaphutha.
  • Ukuhlukaniswa okususelwa kuthokheni: Konke ukusikeka kuhlonipha imingcele yamathokheni, ngakho-ke amagama awakaze ahlukaniswe maphakathi nendlela.
  • Ukulondolozwa komongo: I-Chunk ngayinye ithwala izikhundla ze-metadata (amathokheni kanye nezikhundla zomlingiswa) ezikuphamba imebhu yedokhumenti.
  • Ukucutshungulwa okusebenzayo: Ama-chunks angahlelwa ngamabheki futhi acutshungulwe ngokufana, ngakho-ke izinzuzo zekhwalithi azingezi okwengeziwe nge-latency.

Ngenxa yalokhu, i-langexctract idala ama-chunks asungulwe kahle apakisha esimweni esiningi lapho ugwema ukuqhekeka okungcolile, okusiza i-LLM Gcina ikhwalithi yokukhishwa kwamadokhumenti amakhulu.


(ii) Ukucutshungulwa okufanayo

Ukusekelwa kweLangectractrectract ukucubungula okuhambisanayo ku-LLM Ukufakwa (njengoba kubonakala kuma-Model Pergider Prickingcripts) kunika amandla ikhwalithi yokukhishwa kufinyelela kumadokhumenti amade (okungukuthi, ukumbozwa okuhle kwenhlangano) ngaphandle kokukhulisa kakhulu i-latency ephelele.

Lapho unikezwa uhlu lwe-trunks yombhalo, The max_workers Ipharamitha ilawula ukuthi mangaki imisebenzi engagijima ngokufana. Lezi zisebenzi zithumela ama-chunks amaningi kwi-LLM ngasikhathi sinye, aze afike max_workers ama-chunks acutshungulwe ngokufana.


(iii) Ukudlula okuningi okuzayo

Inhloso yokudlula kokukhishwa kwe-Iterative ukuthuthukisa ukukhumbula ngokuthwebula izinhlangano ezingaphuthelwa kunoma yikuphi ukugijima okukodwa.

Ngokuyinhloko, yamukela a isampula eminingi nokuhlangana Isu, lapho ukukhishwa kuqhutshwa kaningi ngokuzimela, ukuthembela ku-llm's Stochastic imvelo emabhizinisini angahle aphuthelwe ngokugijima.

Ngemuva kwalokho, imiphumela evela kuzo zonke izimpubha iyahlanganiswa. Uma ukukhishwa okubili kuhlanganisa isifunda esifanayo sombhalo, inguqulo evela ekuphaselweni kwangaphambili igcinwa.

Le ndlela ikhulisa ukukhumbula ngokuthwebula amabhizinisi angeziwe kuwo wonke ama-run, ngenkathi kuxazululwa izingxabano nge ukuphumelela kokuphela kokuwina umthetho. Okuphansi yilokho kuphoqa amathokheni amahlandla amaningi, Okungakhulisa izindleko.


(3) Isibonelo sokuhamba ngezinyawo

Ake sibeke i-langexcractractractract ne-gemma ekuhlolweni kombhalo wenqubomgomo yomshuwalense wezimoto, etholwe esidlangalaleni kwiwebhusayithi yeMsig Singapore.

Bheka i-GitHub Repo ehambisana nayo ukuze ilandelwe.

Ukubuka kuqala kwedokhumenti yomshuwalense wezimoto zeMsig | Umthombo: Msig Singapore

(i) Ukusetha kokuqala

I-Langexctract ingafakwa kusuka ku-PYPI nge:

pip install langextract

Sibe sesilanda futhi sigijime i-Gemma 3 (4b Model) endaweni yakini nge-Ollama.

Ollama iyithuluzi lomthombo ovulekile elenza lula ukusebenzisa ama-LLMS kwikhompyutha yethu noma kuseva yendawo. Kusivumela ukuthi sihlangane nalezi zinhlobo ngaphandle kokudinga uxhumano lwe-inthanethi noma ukuncika ezinsizakalweni zamafu.

Ukufaka i-OLLAMA, vakashela ikhasi lokulanda bese ukhetha isifakeli sohlelo lwakho lokusebenza. Uma usuqedile, qinisekisa ukufakwa ngokusebenza ollama --version ku-terminal yakho.

-Nzima: Qinisekisa ukuthi ithuluzi lakho lendawo linokufinyelela kwe-GPU Okwe-ollama, njengoba lokhu kushesha kakhulu ukusebenza.

Ngemuva kokuthi u-Ollama efakiwe, sithola insizakalo esebenza ngokuvula uhlelo (i-MACOS noma iWindows) noma ukungena ollama serve ye-Linux.

Ukulanda i-Gemma 3 (4b) endaweni yakini (3.3GB ngosayizi), sigijimisa lo myalo: ollama pull gemma3:4bngemuva kwalokho esigijimayo ollama list Ukuqinisekisa ukuthi iGemma ilandiwe endaweni yakini ohlelweni lwakho.


(ii) pdf parsing kanye ukucubungula

Isinyathelo sokuqala ukufunda idokhumenti yenqubomgomo ye-PDF bese uhlanganisa okuqukethwe kusetshenziswa i-Pymupddf (efakiwe pip install PyMuPDF).

Sakha a Document Class ukugcina ucezu lombhalo kanye ne-metadata ehlobene ne-metadata, ne PDFProcessor ikilasi lemibhalo ephelele yokuphawula.

Nansi incazelo yekhodi engenhla:

  • load_documents: Idlula ekhasini ngalinye, likhipha amabhlogo wombhalo, futhi libasindise njengoba Document izinto. I-block ngayinye ifaka umbhalo kanye nemetadata (isib, inombolo yekhasi, izixhumanisi ngekhasi ububanzi / ukuphakama).
    Izixhumanisi zithwebule lapho kuvela khona umbhalo ekhasini, gcina imininingwane yokwakheka njengokuthi yini inhloko, umbhalo womzimba, noma unyaweni.
  • get_all_text: Hlanganisa yonke imibhalo ekhishwe kwintambo eyodwa, ngamamaki acacile ahlukanisayo ahlukanisayo.
  • get_page_text: Kuthola umbhalo kuphela ovela ekhasini elithile.

(iii) Ubunjiniyela obusheshayo

Isinyathelo esilandelayo ukuhlinzeka ngemiyalo yokuqondisa i-LLM kwinqubo yokukhishwa nge I-LangecTract.

Siqala ngesistimu yesistimu ecacisa imininingwane ehlelekile esifuna ukuyikhipha, sigxile kwimishwana yokukhishwa kwenqubomgomo.

Kuleso sikhathi ngenhla, ngichaze ngokucacile okukhishwa kwe-JSON njengefomethi yokuphendula okulindelekile. Ngaphandle kwalokhu, cishe sizoshaya iphutha langextract.resolver.ResolverParsingError.

Impikiswano ukuthi Isilwane esibumnene iyakwenza -I Faka phakathi ukuqiniswa okukhethwe okwakhelwe ngaphakathi, ngakho-ke ngokuzenzakalelayo, kukhipha umbhalo ongahlelwanga ngolimi lwemvelo. Kungafaka futhi kungahlosile kufake umbhalo owengeziwe noma i-JSson engafanele, okungenzeka ukwephula ama-json parsers aqinile e-langexform.

Kodwa-ke, uma sisebenzisa i-LLMS efana ne-Gemini enayo Ukuhlobisa okucindezelwe i-Schema (Ie, olungiselelwe ukukhishwa okuhlelekile), khona-ke imigqa esheshayo 11-21 ingashiywa.

Okulandelayo, sethula Ukudutshulwa okumbalwa okushukumisayo Ngokuhlinzeka isibonelo sasiphi isigatshana sokukhishwa esisho umongo womshuwalense.

Langexctract ExampleData Ikilasi lisebenza njengesifanekiso esibonisa izibonelo ezisetshenziselwa i-LLM zendlela umbhalo kufanele imephu yokuphuma ehlelekile, ukwazisa yini okufanele uyikhiphe na- Ungasihlekisa kanjani.

Iqukethe uhlu lwe Extraction Izinto ezimele umphumela oyifunayo, lapho ngayinye iyisigaba sesitsha esihlanganisa izimfanelo zocezu olulodwa olukhishwe.


(iv) ukukhishwa kwe-extraction

Nge-PDF parser yethu kanye nokukhuthaza ukusetha, sikulungele ukusebenzisa ukukhishwa ngama-langexctract extract Indlela:

Nansi incazelo yamapharamitha angenhla:

  • Sidlula umbhalo wethu wokufaka, okushukumisayo, kanye nezibonelo ezimbalwa zokudubula ezimbalwa ku text_or_documents, prompt_descriptionfuthi examples amapharamitha ngokulandelana
  • Sidlula i-Model Version gemma3:4b ku- model_id
  • Le khasi model_url ifakwe iphutha kwi-ollama yendawo endpoint (http://localhost:11434). Qinisekisa ukuthi insiza ye-OLLAAMA isivele isebenza emshinini wangakini
  • Sabeka fence_output na- use_schema_constraint kwa- False Njengoba i-Gemma ayihloselwe umphumela ohlelekile, futhi i-langexcract ayikasekeli izingqinamba ze-schema ze-ollama
  • max_char_buffer Ibeka inani eliphezulu lezinhlamvu zokutholwa. Amanani amancane athuthukisa ukunemba (ngokunciphisa usayizi womongo) kepha wandise inani lezingcingo ze-LLM
  • extraction_passes isetha inani lokudlula kokukhishwa okuthuthukile kokukhishwa

E-8GB Vram GPU, idokhumenti enamakhasi ayi-10 yathatha <10 imizuzu ukuqeda ukuhlanganisa nokukhishwa.


(v) Gcina nokukhishwa kwangemva kokuphuma

Ekugcineni sigcina okukhipha kusetshenziswa i-langexctract io module:

Ukucutshungulwa kwangokwezifiso ngemuva kwalokho kusetshenziswa ukwenza ubuhle ngomphumela wokubukwa okulula, futhi nansi i-snippet yokuphuma:

Siyabona ukuthi izimpendulo ze-LLM ziqukethe ukukhishwa okuhleliwe kusuka kumbhalo wokuqala, ukuwahlukanisa ngeklasi (ukukhishwa okuqondile) nokuhlinzeka ngomugqa wombhalo womthombo kanye nencazelo ye-Plain-English.

Le fomethi yenza kube lula ukuhumusha, ukunikela ngemephu ecacile phakathi kolimi lwenqubomgomo esemthethweni nezifingqo ezilula.


(4) Ukusonga

Kulesi sihloko, sahlola ukuthi i-langexctract's chunkhing, ukucubungula okufanayo, kanye nokudlula okufana, okuhlanganiswe namakhono we-gemma 3, kunika amandla ukukhishwa okuthembekile kwemininingwane ehlelekile kusuka kumadokhumenti amade kusuka kumadokhumenti amade kusuka kumadokhumenti amade kusuka kumadokhumenti amade.

Lawa masu akhombisa ukuthi inhlanganisela efanelekile yamamodeli namasu okukhishwa angaguqula amasu amade, ayinkimbinkimbi ekutholeleni okuhleliwe okunembile, okulandelwayo, futhi okulungele ukusetshenziswa okusebenzayo.

Ngaphambi kokuthi uhambe

Ngikumema ukuthi ulandele amakhasi ami we-GitHub kanye ne-Linkedin yokuzibandakanya nokusebenza okusebenzayo. Okwamanje, zijabulise ukukhipha imininingwane ehlelekile nge-langexcract kanye ne-gemma 3!

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button