Generative AI

Imodeli Eyodwa, Izindlela Ezintathu: I-ByteDance Ikhipha I-Lance Yokuqonda Isithombe Nevidiyo, Isizukulwane, Nokuhlela

Ukwakha imodeli eyodwa engaqonda futhi ikhiqize izithombe namavidiyo kunzima kunalokho okuzwakalayo. Le misebenzi emibili idonsela ezinhlangothini eziphambene. Ukuqonda izinzuzo ezivela ezicini eziphezulu ze-semantic eziqondaniswe ngokuqinile nolimi. Isizukulwane sidinga izethulo eziqhubekayo ezisezingeni eliphansi ezigcina ukuthungwa, ijiyomethri, namandla esikhashana. Amasistimu amaningi aphatha lokhu kungezwani ngokuhlukanisa lezi ezimbili zibe izakhiwo ezihlukene, bese ziwavala ngemuva kwe-hoc.

Ithimba labacwaningi be-ByteDance lithathe indlela ehlukile nge I-Lance. Kunokuba lihlanganise izingxenye ezihlukene, ithimba locwaningo liklame imodeli ehlanganisa ukuqonda, ukukhiqiza, nokuhlela kuzo zonke izindlela zezithombe nevidiyo – eqeqeshwe ngokuhlanganyela kusukela ekuqaleni.

Lokho uLance angakwenza

I-Lance ihlela amakhono ayo abe imindeni emithathu yokuphumayo: umbhalo (X2T), izithombe (X2I), namavidiyo (X2V). Ngasohlangothini lokuqonda, lokhu kuhlanganisa amazwibela esithombe nevidiyo, impendulo yemibuzo ebonakalayo, i-OCR, isisekelo esibonakalayo, nokucabanga. Ohlangothini lwesizukulwane, iphatha umbhalo uye esithombeni, umbhalo uye kuvidiyo, isithombe-kuya-ividiyo, isizukulwane esiqhutshwa isihloko, ukuhlelwa kwesithombe, nokuhlelwa kwevidiyo – okuhlanganisa ukuhlela okuguquguqukayo okuningi kuzo zonke izindlela.

Leli khono lokuhlanganisa konke liyingqopha-mlando enkulu. Nakuba izakhiwo ezijwayelekile ezihlanganisiwe ngokuvamile zima ekuqondeni okuyisisekelo kwesithombe nasekukhiqizeni umbhalo uye kwesithombe, u-Lance uphakathi kwezimbalwa zokuhlanganisa yonke indawo ye-ecosystem yesithombe-ividiyo kuyo yomibili imisebenzi yokuqonda neyokukhiqiza.

Indlela I-Architecture Esebenza Ngayo

I-architecture isekelwe ezimisweni ezimbili: imodeli yokuqukethwe okuhlanganisiwe futhi izindlela zokukwazi ukuhlukaniswa.

Ukuze uthole umongo obumbene, u-Lance uguqula konke okokufaka — umbhalo, izithombe, namavidiyo — kube ukulandelana okukodwa kwe-multimodal okwabelwana ngaso. Amathokheni ombhalo avela kusendlalelo sokushumeka se-Qwen2.5-VL. Okokufaka okubonwayo okugxile ekuqondeni, isishumeki se-Qwen2.5-VL ViT sikhiqiza amathokheni abukwayo e-compact semantic. Okokufaka okubonakalayo okugxile esizukulwaneni, isifaki khodi se-Wan2.2 3D esiyimbangela ye-VAE sifaka amakhodi izithombe namavidiyo kube izethulo ezifihlekile eziqhubekayo, zisebenzisa i-16× spatial downsampling kanye ne-4× temporal downsampling. Zonke lezi zinhlobo zamathokheni ezinhlobonhlobo – umbhalo, okubukwayo kwe-semantic, nokubukwayo okucashile – kuhlala ngokulandelana okufanayo. Imodeli ibe isisebenzisa ukunaka okujwayelekile kwe-3D kwesizathu phezu komongo ogcwele, namathokheni ombhalo asebenzisa ukunaka okuyimbangela namathokheni abonakalayo kusetshenziswa ukunaka okukabili.

Emizileni ehlukanisiwe, i-Lance isebenzisa inhlanganisela yochwepheshe be-dual-stream eqaliswe ku-Qwen2.5-VL 3B. Uchwepheshe wokuqonda (LLMUND) uphatha umbhalo kanye namathokheni abonakalayo e-semantic, akhiqize imiphumela yokucabanga nge-multimodal nokukhiqizwa kombhalo. Uchwepheshe wezesizukulwane (LLMGEN) uphatha amathokheni afihlekile e-VAE ukuze ahlanganise okubonakalayo nokuhlela. Okubalulekile ukuthi, bobabili ochwepheshe basebenza ngokulandelana okufanayo okunezinhlangothi ezinezinhlangothi ezihlukene – babelana ngomongo kodwa abaqhudelani ngamapharamitha afanayo. Uchwepheshe wokuqonda uqeqeshwe ngokulahlekelwa kwesibikezelo esilandelayo; uchwepheshe wokukhiqiza uqeqeshwe ngenhloso yokufanisa ukugeleza endaweni ecashile eqhubekayo. Ukulahlekelwa okubili kuhlanganiswe nezisindo ezilungisekayo kulo lonke ukuqeqeshwa.

I-Modality-Aware Rotary Positional Encoding (MaPE)

Ukusebenzisa amathokheni e-ViT semantic, amathokheni wesimo se-VAE ahlanzekile, namathokheni ethagethi e-VAE anomsindo ngokulandelana okufanayo kudala inkinga ecashile. I-3D-RoPE ejwayelekile ibhala ngekhodi izindawo ngokusekelwe kusakhiwo se-spatiotemporal kuphela – ayinayo indlela yokuhlukanisa la maqembu amathokheni. Uma amaqembu amaningi amathokheni abonakalayo ethatha ukulandelana okufanayo, imingcele yawo yokuma iba yindida, okungalimaza ukuqondanisa komsebenzi ophambene.

Ethula uLance I-Modality-Aware Rotary Positional Encoding (MaPE) ukulungisa lokhu. I-MaPE isebenzisa i-offset engaguquki yesikhashana eqenjini ngalinye lezinto ezisuselwe kunkomba yalo ngokulandelana. Izixhumanisi zendawo zihlala zingashintshiwe, ngakho-ke isakhiwo esingaphakathi kwezithombe namavidiyo siyagcinwa. I-offset yesikhashana iyodwa yanele ukuhlukanisa amaqembu amathokheni endaweni yokuma yomhlaba wonke ngaphandle kokuphazamisa uku-oda kwesikhashana ngaphakathi kwanoma iyiphi ividiyo ngayinye.

Ukukhipha i-MaPE kwehlisa i-GenEval isuka ku-80.94 iye ku-80.56, i-GEdit-Bench isuka ku-6.86 iye ku-6.30, kanye ne-VBench isuka ku-81.81 iye ku-80.95 – ukuwohloka okungaguquki esizukulwaneni sonke, ukuhlela, nokuqonda.

Ukuqeqeshwa: Izigaba Ezine, Uhlaka Olulodwa Oluhlanganisiwe

I-Lance iqeqeshwa ngokusebenzisa izigaba ezine ezilandelanayoisakhiwo ngasinye kwesokugcina.

Pre-Training (PT) ibeka isisekelo isebenzisa cishe amapheya wesithombe-sithombe angu-1B kanye ne-140M yevidiyo yombhalo, emboza amathokheni okuqeqesha angu-1.5T. Lesi sigaba sisungula ukuqondanisa okuyisisekelo kwe-multimodal namandla okwenza. Izifaki khodi ze-VAE ne-ViT zifriziwe lapha; kuphela umgogodla nezixhumi eziqeqeshiwe.

Ukuqeqeshwa Okuqhubekayo (CT) inweba isikhala somsebenzi ngokwethula idatha yemisebenzi eminingi eshiyekile – ukuhlela amasampula, amasampula okukhiqiza aqhutshwa yisihloko, kanye nedatha yokuqonda yezindlela eziningi – kuwo wonke amathokheni angaba ngu-300B. Isheduli eqhubekayo yokuhlanganisa idatha kancane kancane inyusa ingxenye yemisebenzi enzima njengokuhlela njengoba ukuqeqeshwa kuqhubeka.

I-Supervised Fine-Tuning (SFT) iqinisa imiyalelo elandelayo, ukunemba kokuhlela, nokuvumelana kobunikazi kusetshenziswa idatha ekhethiwe yekhwalithi ephezulu kuwo wonke amathokheni angu-72B.

I-Reinforcement Learning (RL) isebenzisa i-Group Relative Policy Optimization (GRPO), ene-PaddleOCR esebenza njengemodeli yomvuzo, ukuze iqhubeke nokulola umbhalo onikeza ukunemba nokuqondanisa kombhalo nesithombe.

Yonke into ilingana ngaphakathi kwesabelomali sokuqeqesha esiphezulu sama-GPU angu-128.

Imiphumela

Isizukulwane Sesithombe. Ku-GenEval, u-Lance uthole amaphuzu angu-0.90 isiyonke, ehambisana ne-TUNA endaweni ephezulu phakathi kwamamodeli ahlangene. Izikolo zesigaba esingaphansi zifaka ukubala (0.84), imibala (0.97), kanye nesimo sendawo (0.87). Ku-DPG-Bench, u-Lance uthole amaphuzu angu-84.67, ngokumodela okuqinile okuhlobene – nakuba i-TUNA (86.76) ne-TUNA-2 (86.54) ihola lelo bhentshimakhi. Ukubeka ukusebenza kahle kwepharamitha ngombono: U-Janus-Pro-7B uthole amaphuzu angu-0.80 ku-GenEval; I-Show-o2 (7B) ithola amaphuzu angu-0.76. I-Lance ifana nesikolo semodeli ehlanganisiwe ephezulu kumapharamitha acushiwe angu-3B.

Isizukulwane Sevidiyo. Ku-VBench, u-Lance uzuza Ingqikithi Yamaphuzu angu-85.11 (esebenzisa ukubhala kabusha kwe-LLM), ephakeme kakhulu phakathi kwamamodeli ahlanganisiwe. Imodeli ehlangene elandelayo ehamba phambili, i-TUNA, ithola amaphuzu angama-84.06. I-Lance iphinda idlule amamodeli esizukulwane kuphela azinikele afaka i-HunyuanVideo (83.43) ne-Wan2.1-T2V (83.69).

Ukuhlela Isithombe. Ku-GEdit-Bench, u-Lance uthole u-7.30 Avg/G_O, eliphakeme kakhulu phakathi kwamamodeli ahlanganisiwe. Iholela ekushintsheni kwengemuva, ukuguqulwa kwezinto, ukuguqulwa kokunyakaza, ukubukeka kwesithombe sibe muhle, ukususwa kwesihloko, ukushintshwa kwesihloko, nokudluliswa kwethoni. Ukuguqulwa kombhalo kumakwe njengobuthakathaka obusele.

Ukuqonda Kwevidiyo. Ku-MVBench, u-Lance uzuza amaphuzu angu-62.0, inani eliphakeme kunawo wonke phakathi kwamamodeli ahlanganisiwe. I-Show-o2 (7B), imodeli ehlangene elandelayo ehamba phambili, ithola amaphuzu angu-55.7. I-Lance iphinda isebenze kahle kakhulu kunamamodeli ambalwa okuqonda kuphela anemingcele eyengeziwe – okuphawulekayo uma kubhekwa ukuthi ngesikhathi esisodwa iqeqeshelwa ukukhiqiza nokuhlela.

Isichazi Esibonakalayo sikaMarktechpost

Okuthathwayo Okubalulekile

  1. I-Lance iyipharamitha eyenziwe yasebenza engu-3B yomdabu ehlanganisiwe ye-multimodal ephethe ukuqonda kwesithombe nevidiyo, ukwenziwa, nokuhlela ngaphakathi kohlaka olulodwa oluqeqeshwe ngokuhlanganyela.
  2. I-dual-stream ingxube yochwepheshe bezakhiwo ezine-Modality-Aware Rotary Positional Encoding (MaPE) ihlukanisa ukuqonda kanye nezindlela zokukhiqiza ngenkathi ikugcina engqikithini eyabiwe yezindlela eziningi ezihlukene.
  3. ULance uthole u-0.90 kuGenEval no-85.11 ku-VBenchInani eliphakeme kakhulu Lenani lamaphuzu phakathi kwamamodeli ahlanganisiwe, aqeqeshwe ngaphakathi kwesabelomali esikhulu sama-GPU angu-128.
  4. Ku-MVBench, uLance uthole amaphuzu angu-62.0ephakeme kakhulu phakathi kwamamodeli ahlanganisiwe — esebenza kahle kakhulu i-Show-o2 (7B) ku-55.7, kuyilapho isekela ukukhiqiza nokuhlela.
  5. I-Lance ingumthombo ovulekile ngaphansi kwe-Apache 2.0enezisindo ezitholakala ku-Hugging Face.

Hlola Iphepha, Izisindo Zemodeli kanye Nekhasi Lephrojekthi. Futhi, zizwe ukhululekile ukusilandela Twitter futhi ungakhohlwa ukujoyina wethu 150k+ ML SubReddit futhi Bhalisela ku Iphephandaba lethu. Linda! ukutelegram? manje ungasijoyina kuthelegramu futhi.

Udinga ukusebenzisana nathi ekuthuthukiseni i-GitHub Repo yakho NOMA Ikhasi Lobuso Lokugona NOMA Ukukhishwa Komkhiqizo NOMA I-Webinar njll.? Xhumana nathi


Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button