Generative AI

Ukuqhathanisa izikhathi ezi-6 eziphezulu ze-6

Amamodeli amakhulu olimi manje akhawulelwe ngaphansi kokuqeqeshwa nokuningi ngendlela esheshayo futhi eshibhile singasebenza ngayo amathokheni ngaphansi kwethrafikhi yangempela. Lokho kwehlela kuze kube yimininingwane emithathu yokuqalisa: ukuthi izicelo ze-Runtime ezihamba phambili kanjani, kanjani okweqisa kuze kube manje futhi zinquma, nokuthi zikusebenzisa kanjani i-KV Cache. Izinjini ezihlukile zenza abathengisi abahlukahlukene kulezi zimbazo, ezibonakala ngqo njengokwehluka kumathokheni ngehora lesibili, P50 / P99 latency, kanye nokusetshenziswa kwememori ye-GPU.

Lo mbhalo uqhathanisa izikhathi eziyisithupha ezibonisa kaninginingi ezinhlakeni zokukhiqiza:

  • vllm
  • Tensorrt llm
  • Ukuqabulana Ubuso Bekungaphansi Kombhalo (TGI V3)
  • Lmdcwalisa
  • Shang
  • Ukuzithoba okujulile / okuvinjelwe zero

1. Vllm

Ukusongoza icebo

I-VLLM yakhelwe nxazonke Cwaziswa. Esikhundleni sokugcina isilondolozi se-KV ngakunye ku-buffer enkulu esontayo, ihlukanisa i-KV kumabhulokhi osayizi ongaguquki bese isebenzisa ungqimba lokulandelana kokulandelana kokulandelana kokulandelana kwamabhulokhi.

Lokhu kunikeza:

  • Ukuqhekeka okuphansi kakhulu kwe-KV (kubikwa <4% imfucuza vs 60-80% kuma-naïve allocators)
  • Ukusetshenziswa okuphezulu kwe-GPU ngokuhanjiswa okuqhubekayo
  • Ukuxhaswa kwendabuko kokwabelana nge-prefix kanye ne-KV sebenzisa kabusha ezingeni le-block

Izinguqulo zakamuva zengeza I-KV Inani (FP8) kanye nokuhlanganisa izikhwebu zesitayela se-flashtattion.

Ukwenza

Ukuhlolwa kwe-VLLM:

  • I-VLLM ifinyelela 14-24 × Ukuphikisana okuphezulu kunokuqabulana nobuso be-transformers kanye 2.2-3.5 × ephakeme kune-TGI yokuqala yamamodeli we-llama ku-nvidia gpus.

KV nokuziphatha kwenkumbulo

  • I-Pefatter ihlinzeka ngesakhiwo se-KV esinobungane bobabili i-GPU enobungane futhi ihlukaniswa.
  • I-FP8 KVanization inciphisa usayizi we-kv futhi ithuthukise ukuguqulwa kwethonya lapho i-compute akuyona ibhodlela.

Lapho kufanelana khona

  • Injini yokusebenza okuzenzakalelayo lapho udinga i-General LLM ekhonza i-Backlend nge-Foot Photoput, i-TTFT enhle, ne-Hardware.

2. I-Tensorrt llm

Ukusongoza icebo

I-Tensort LLM iyinjini esekelwe eqoqweni ngaphezulu kweNvidia TensORRT. Ikhiqiza ama-kerners afakiwe ngemodeli nemodeli, futhi adalule i-api efakiwe esetshenziswa ngohlaka olufana neTriton.

I-SUBSTSTSTST yayo ye-KV iyacaca futhi inesici esicebile:

  • I-KV cache
  • Isilondolozi se-KV (Int8, FP8, ngenhlanganisela ethile isavela)
  • Isilondolozi se-buffer kayindiso
  • I-KV cache basebenzise kabushakufaka phakathi ukulayisha i-KV ku-CPU nokusebenzisa kabusha ngokuzungeza ukunciphisa i-TTFT

INvidia ibika ukuthi i-CPU esekelwe KV Reule inganciphisa isikhathi sokufika kokuqala 14 × Ku-H100 nangokwengeziwe ku-GH200 kwizimo ezithile.

Ukwenza

I-Tensort LLM iguquke kakhulu, ngakho-ke imiphumela iyahluka. Amaphethini ajwayelekile avela ekuqhathaniseni Komphakathi kanye namabhentshi abathengisi:

  • Phansi kakhulu Isicelo Esingashadi Esihle Ku-nvidia gpus lapho izinjini zihlanganiswa imodeli ngqo nokucushwa.
  • Ngesikhathi esilinganiselwe, kungahle kubhekwe noma ku-TTFT ephansi noma ngokudlula okuphezulu; Ekuhlanganeni okuphezulu kakhulu, ukudlula kwamaphrofayili alungiselelwe ukusunduza P99 Up ngenxa yokuvusa ngolaka.

KV nokuziphatha kwenkumbulo

  • I-KV Plus Plus I-KV ehlukaniswe i-KV inikeza ukulawula okuqinile kokusetshenziswa kwememori kanye nomkhawulokudonsa.
  • I-EBUCE NE-Memory APIS vumela ukuthi uklame izinqubomgomo ze-cache ezaziyo

Lapho kufanelana khona

  • Ukulayisha okubucayi kwe-latency kanye nama-nvidia kuphela izindawo, lapho amaqembu angatshala khona imali enjini ekheni kanye nokuhleleka kwamamodeli.

3. Ukuqabula ubuso be-TGI v3

Ukusongoza icebo

Ukutholwa Kwezizukulwane (I-TGI) yiseva egxile kwisitaki nge:

  • Iseva yokugqwala esekwe ku-HTTP ne-GRPC Server
  • Ukuhamba okuqhubekayo, ukusakaza, izingwegwe zokuphepha
  • I-Backtorch ye-Pytorch ne-TensOrgt futhi ivale ukuhlanganiswa kobuso bukabuso

I-TGI V3 Ingeza ipayipi elisha elide elide:

  • I-Chunkled Found kokufakwa okude
  • Isiqalo se-KV Umlando wengxoxo emide kangaka awubuyiselwa esicelweni ngasinye

Ukwenza

Kuma-Products ajwayelekile, kukhombisa umsebenzi wakamuva weqembu lesithathu:

  • I-VLLM ivame ukuphenya i-TGI kumathokheni aluhlaza okwesibili ngokuhlangana okuphezulu ngenxa ye-petatatuntion ngenxa yokuqothuka, kepha umehluko awukhulu kuma-setups amaningi.
  • I-TGI v3 icubungula amathokheni ama-3 × ngaphezulu futhi ingaphezulu kwe-13 × ngokushesha kune-vllm ekukhuphiseni okudengaphansi kwesethaphu ngemilando emide kakhulu kanye nesiqalo se-prefix esivunyelwe.

Iphrofayili ye-Latency:

  • I-P50 isikhathi esifushane nobude obude bufana ne-VLLM lapho bobabili bahlelwe ngokuthuthuka okuqhubekayo.
  • Ngemilando yengxoxo emide, okuqala kokubusa kumapayipi angenangqondo; Ukusetshenziswa kabusha kwe-TGI v3 kwangaphambilini amathokheni anikeza ukuwina okukhulu ku-TTFT ne-P50.

KV nokuziphatha kwenkumbulo

  • I-TGI isebenzisa ukulondolozwa kwe-KV ngama-pernels wesitayela sokunakwa futhi kunciphisa inkumbulo yezinyawo ngokusebenzisa i-chunking of flues
  • Ihlanganisa ukwakhiwa kwamanani ngebhithi nama-byte kanye ne-GPTQ kanye ne-run kuzo zonke izipebe ezimbalwa zehardware.

Lapho kufanelana khona

  • Ukukhiqizwa kwesitaki esivele kuvale ubuso, ikakhulukazi imithwalo yezitayela yezindaba enemiyalezo emide lapho isiqalo se-prefix sinikeza izinzuzo ezinkulu zomhlaba zangempela.

4. Lmdeplobhishe

Ukusongoza icebo

Lmdcwalisa iyithuluzi lokucindezela kanye nokuhanjiswa kusuka ku-interll emcomelweni ye-interllm. Iveza izinjini ezimbili:

  • Umtshendela: Ukusebenza okuphezulu kweCuda Kernels yeNvidia GPUS
  • Injini yePytorch: Ukuvuka okuguquguqukayo

Izici ze-Key Runnitime:

  • Ukuqothuka okuqhubekayo, okuqhubekayo
  • Ivinjiwe i-KV cache nemenenja yokwabiwa nokusebenzisa kabusha
  • Ukuhlukaniswa okunamandla kanye ne-fuse for amabhlogo wokunakwa
  • Ukufana kwesibeletho
  • Isisindo kuphela futhi i-KVanization (kufaka phakathi i-AWQ ne-Intanethi ye-Int8 / Int4 KV)

I-LMDEPHSEPS Kufika ku-1.8 × Isicelo Esiphezulu Phezulu kune-VLLMebeka lokhu ekuphakamiseni okuqhubekayo, okuvinjelwe i-KV namakhekhe alungiselelwe.

Ukwenza

Ukuhlola Khombisa:

  • Okwezinhlobo ezi-4 zeLlama zesitayela sezitayela ku-A100, i-LMDEDY ingafinyelela amathokheni aphezulu ngomzuzwana ku-vllm ngaphansi kwezinkinga zokuqhathanisa eziqhathaniswa neConmurrency ephezulu.
  • Kubika futhi lokho I-4 yokulinganisa kancane imayelana ne-2.4 × ngokushesha kune-FP16 amamodeli asekelwayo.

I-Latency:

  • Isicelo esisodwa TTFT siku-ballpark efanayo nezinye izinjini ezenzelwe i-GPU uma zilungiselelwe ngaphandle kwemikhawulo ye-batch eyedlulele.
  • Ngaphansi kokuhlangenwe nakho okusindayo, ukuhlangana okuqhubekayo kwenhlanganisela okuvinjelwe i-KV Let LmDeplocled streak ukuphakama ngaphandle kokuwohloka kwe-TTFT.

KV nokuziphatha kwenkumbulo

  • I-KV cache evinjelwe ngokusobala ngokulandelana kwama-buffers ngegridi yama-KV Chunks aphethwe yisikhathi sokugijima, okufana nomoya we-VLLM kodwa ngesakhiwo esihlukile sangaphakathi.
  • Ukusekelwa kwe isisindo nesisindo ne-kva ihlose amamodeli amakhulu ku-GPU ecindezelwe.

Lapho kufanelana khona

  • Ukuhanjiswa kwe-Nvidia Centrus okudinga ukugcwaliswa okuphezulu futhi kukhululekile ukusebenzisa i-turbomnd kanye ne-LMDEDH.

5. Sglang

Ukusongoza icebo

USglang womabili:

  • A I-DSL Okwenziwa ngezinhlelo ezihleliwe ze-LLM ezifana nama-ejenti, ukugeleza komsebenzi we-rag kanye namapayipi ethuluzi
  • Isikhathi sokusebenza esisebenzayo I-Radixattlentionindlela yokusebenzisa kabusha i-KV eyabelana ngeziqalo zisebenzisa isakhiwo sesihlahla se-radix esikhundleni se-block hashes elula.

I-Radixattlention:

  • Igcina i-KV ngezicelo eziningi esihlahleni se-prefix esikhishwe ngamathokheni
  • Inika amandla amanani aphezulu we-KV Hit lapho izingcingo eziningi zabelana ngeziqalo, okufana nokuvuselelwa okumbalwa okudutshuliwe, amaketanga amaningi, noma amaketanga ethuluzi

Ukwenza

Ukuqonda okusemqoka:

  • I-Sglang ifinyelela kufika ku-6.4 × ngaphezulu okuphezulu na- Kufika ku-3.7 × eliphansi langemuva Kunezindlela eziyisisekelo ezifana ne-VLLM, LMQL nabanye ngemisebenzi ehlelekile yomsebenzi.
  • Ukuthuthuka kukhulu kakhulu lapho kubuye kube nokusebenzisa isiqalo esindayo, isibonelo, ukuxoxa okuningi noma imithwalo yokuhlola enomongo ophindaphindwayo.

Kubikwe amanani e-KV cache hit aqala kusuka cishe I-50% kuya ku-99%namasheduli ochungechuse amakhekhe asondele kakhulu emazingeni okushaya afanele kumabhentshi alinganisiwe.

KV nokuziphatha kwenkumbulo

  • I-Radixattlention ihlala ngaphezulu kwezikhumba zesitayela sokunakwa kwe-petaged futhi igxile kuyo sebenzisa futhi kunokuba kube ukwabiwa.
  • USgglang uhlanganisa kahle nezinhlelo zokulondolozwa kwezikhundla zokugcina ukuhambisa i-KV phakathi kwe-GPU ne-CPU lapho kulandela ukulandelana, yize lezo zinhlelo zivame ukusetshenziswa njengamaphrojekthi ahlukile.

Lapho kufanelana khona

  • Izinhlelo ze-Agentic, amapayipitela amathuluzi, nezicelo ezisindayo ze-rag lapho izingcingo eziningi zabelana ngeziqalo ezinkulu ezisheshayo kanye nama-KV Reule izindaba ezingeni lesicelo.

I-6. Ukuzithoba okujulile

Ukusongoza icebo

I-Deepspeed ihlinzeka ngezingcezu ezimbili ezifanelekile zokutholwa:

  • Ukutholwa okujulile: Kwenzelwe ama-transformer kernels kanye ne-tensor ne-pipeline parallelism
  • I-Zero Indence / Zero Offload: Amasu alayishwa izinsimbi zemodeli, nakwezinye isilondolozi se-KV, ku-CPU noma i-NVME ukugijima amamodeli amakhulu kakhulu kwimemori ekhawulelwe ye-GPU

Ukutholwa kwe-zero kugxile ku:

  • Ukugcina Izinsimbi Ezincane Noma Zomukeli
  • Ukusakaza tincanyana kusuka ku-CPU noma i-NVME njengoba kudingeka
  • Ukuqondisa ububanzi kanye nosayizi wemodeli kunokuba kuncishiswe i-latency ephansi

Ukwenza

Ku-zero ukuvancence chofoza u-30b Isibonelo ku-V100 32GB:

  • Ukulayisha okugcwele kwe-CPU kufinyelela cishe Amathokheni angama-43 ngomzuzwana
  • Ukulayishwa okugcwele kwe-NVME Amathokheni angama-30 ngomzuzwana
  • Bobabili 1.3-2.4 × ngokushesha kunokucushwa kokulayishwa okuyingxenye, ngoba ukulahlwa okugcwele kunika amandla amasayizi amakhulu ama-batch

Lezi zinombolo zincanyana ngokuqhathaniswa ne-GPU Residence LLM CRIMS ku-A100 noma H100, kepha ziyasebenza kwimodeli engahambisani nemvelo ku-32GB.

Inhlanganisela yakamuva ye-I / O ye-DeepSpeed ​​and FlexGen iqinisekisa ukuthi amasistimu asuselwa ekuthambitheni aphethwe ngokufundwa kwama-kib ama-128 amancane nokuthi i / o Ukuziphatha kube yi-mainleneck enkulu.

KV nokuziphatha kwenkumbulo

  • Isisindo semodeli futhi kwesinye isikhathi amabhlogo we-KV alayishwa ku-CPU noma e-SSD ukuze alungele amamodeli angaphezu kwamandla e-GPU.
  • I-TTFT ne-P99 kuphakama kuqhathaniswa nezinjini ezihlanzekile ze-GPU, kepha i-TradeFOFF yikhono lokusebenzisa amamodeli amakhulu kakhulu okungenzeka ukuthi awafaneleki.

Lapho kufanelana khona

  • Ukungaxhunyiwe ku-inthanethi noma i-batch, noma izinsizakalo eziphansi ze-QPS lapho usayizi wemodeli kufanelekile ngaphezu kwe-latency kanye ne-GPU ukubala.

Ukuqhathanisa amatafula

Le thebula ifingqa ama-raneffs amakhulu afanele:

Isikhathi sokuqalisa Umbono oyinhloko Amandla ahlobene Isu le-KV Icala elijwayelekile lokusebenzisa
vllm I-PefatThention, Ukuqhuma Okuqhubekayo Amathokheni aphezulu ngomzuzwana ngamunye enikezwe i-TTFT Amabhulokhi we-KV, ukusekelwa kwe-FP8 KV Inhloso ejwayelekile ye-GPU esebenza, i-dwardware eningi
Tensorrt llm Izikhwebu ezihlanganisiwe ku-Nvidia + KV Sebenzisa kabusha I-latency ephansi kakhulu kanye nokudlula okuphezulu ku-nvidia I-Peged, i-KV eyenziwe kabusha, iphinde isebenzise kabusha I-Nvidia kuphela, i-latency inzwetive
I-TGI v3 I-HF esebenza ungqimba nge-ende esheshayo Ukusebenza okusheshayo okusheshayo, isitaki esihlanganisiwe I-KV ye-KV, i-CHANED FUSTIAL, ISIQINISEKISO SOKUGCINA HF centric APIs, umlando omude wengxoxo
Lmdcwalisa Izikhwebu ze-turbomind, zivimbe i-KV, nekana Kufika ku-1.8 × VLLM PASHT ekuhlolweni komthengisi Ivinjelwe i-KV cache, isisindo kanye ne-kv. Ukuhanjiswa kweNvidia kugxile ekutholeni okuluhlaza
Shang Ama-radixattrion kanye nezinhlelo ezihlelekile Kufika ku-6.4 × ukugcwala kanye ne-3.7 × eliphansi le-latency emisebenzini ehlelekile I-radix Tree KV isebenzisa kabusha iziqalo Ama-ejenti, i-rag, isiqalo esiphakeme
I-Deepspeed I-GPU CPU NVME Off in amamodeli amakhulu Inika amandla amamodeli amakhulu ku-GPU encane; Ukungena okuqondiswe Izinsimbi ezilayishiwe futhi kwesinye isikhathi i-KV Amamodeli amakhulu kakhulu, ungaxhunyiwe ku-inthanethi noma ama-QPs aphansi

Ukukhetha isikhathi sokusebenza ngokuzilolonga

Ngohlelo lokukhiqiza, ukukhetha kuvame ukuwa emaphethini ambalwa alula:

  • Ufuna injini eqinile eqinile enomsebenzi omncane wangokwezifiso: Ungaqala nge vllm. Kukunika okubucayi okuhle, i-TTFT enengqondo, nokuphathwa okuqinile kwe-KV ku-Hardware ejwayelekile.
  • Uzibophezele ku-NVIDIA futhi ufuna ukulawulwa okuhle okukhohlisayo phezu kwe-latency ne-KV: Ungasebenzisa Tensorrt llmcishe ngemuva kweTriton noma i-TGI. Hlela injini ethile yemodeli yakha nokuhleleka.
  • Isitaki sakho sesivele sikubambe ubuso futhi unendaba nezingxoxo ezinde: Ungasebenzisa I-TGI v3. Ukulondolozwa kwayo okusheshayo okusheshayo kanye nesiqalo se-prefix kusebenza kakhulu kwithrafikhi yesitayela sengxoxo.
  • Ufuna ukugcwala okuphezulu nge-GPU ngamamodeli ahlukanisiwe: Ungasebenzisa Lmdcwalisa Nge-turbomind futhi ivimbe i-KV, ikakhulukazi amamodeli omndeni ama-4.
  • Ukwakha ama-ejenti, amaketanga ethuluzi noma izinhlelo ezisindayo ze-rag: Ungasebenzisa Shang Futhi ukuklama okukhohlisayo ukuze i-KV yokusebenzisa kabusha nge-radixattive iphakeme.
  • Kufanele usebenzise amamodeli amakhulu kakhulu ku-GPU elinganiselwe: Ungasebenzisa Ukuzithoba okujulile / okuvinjelwe zeroYamukela i-latency ephakeme, bese uphatha i-GPU njengenjini yokufakelwa nge-SSD ku-loop.

Sekukonke, zonke lezi zinjini ziguqulwa ngombono ofanayo: I-CACE ye-KV iyisisetshenziswa sangempela se-bottleneck. Abaphumelele yizikhathi zokuziphatha eziphatha i-KV njengesakhiwo sokuqala sedatha yeklasi ukuze ziboshwe, zilinganiswe, zisetshenziswe kabusha futhi zilayishwe kancane, hhayi nje ukuthi kufakwe i-tensor enkulu kwimemori ye-GPU.


UMichal Sutter ungumsebenzi wesayensi yedatha ene-Master of Science ku-Data Science evela e-University of PADOVA. Ngesisekelo esiqinile ekuhlaziyeni kwezibalo, ukufunda ngomshini, kanye nobunjiniyela bedatha, ama-Mikhali ama-Excels ekuguquleni imininingwane eyinkimbinkimbi ekutholeni okusebenzayo.

Landela uMarktechpost: Sengeze njengomthombo owuthandayo ku-Google.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button