Kungani Unakekela Ngokulondolozwa Kwesikhashana Okusheshayo kuma-LLM?

sikhulume kakhulu ngokuthi iyini ithuluzi elimangalisayo i-RAG lokusebenzisa amandla e-AI kudatha yangokwezifiso. Kodwa, kungakhathaliseki ukuthi sikhuluma ngezicelo ezicacile ze-LLM API, izinhlelo zokusebenza ze-RAG, noma ama-agent e-AI ayinkimbinkimbi, kunombuzo owodwa ovamile osalokhu ufana. Zihamba kanjani zonke lezi zinto? Ikakhulukazi, kwenzekani ngezindleko nokubambezeleka njengoba inani lezicelo kuzinhlelo zokusebenza ezinjalo likhula? Ikakhulukazi kuma-ejenti e-AI athuthuke kakhulu, angase aqukathe izingcingo eziningi eziya ku-LLM ukuze kucutshungulwe umbuzo womsebenzisi oyedwa, le mibuzo iba ngobaluleke kakhulu.
Ngenhlanhla, empeleni, lapho ushaya izingcingo ku-LLM, amathokheni okokufaka afanayo avame ukuphindwa kuzo zonke izicelo eziningi. Abasebenzisi bazobuza eminye imibuzo ethize kakhulu kuneminye, ukwaziswa kwesistimu nemiyalo ehlanganiswe ezinhlelweni ezisebenza ngamandla e-AI kuyaphindwa kuwo wonke umbuzo wabasebenzisi, futhi ngisho nangokwaziswa okukodwa, amamodeli enza izibalo eziphindaphindayo ukuze enze impendulo yonke (khumbula ukuthi ama-LLM akhiqiza kanjani umbhalo ngokubikezela amagama ngamunye ngamunye?). Ngokufanayo nezinye izinhlelo zokusebenza, ukusetshenziswa kwe- inqolobaneing umqondo ungasiza kakhulu ekuthuthukiseni izindleko zesicelo se-LLM nokubambezeleka. Isibonelo, ngokusho kwemibhalo ye-OpenAI, i-Prompt Caching inganciphisa ukubambezeleka kuze kufike ku-80% ohlaba umxhwele kanye nezindleko zethokheni yokufaka kuze kufike ku-90%.
Kuthiwani nge-caching?
Ngokuvamile, ukulondoloza isikhashana kukhompyutha akuwona umqondo omusha. Emgogodleni wayo, inqolobane iyingxenye egcina idatha okwesikhashana ukuze izicelo zesikhathi esizayo zedatha efanayo zinikezwe ngokushesha. Ngale ndlela, singakwazi ukuhlukanisa phakathi kwezimo ezimbili eziyisisekelo ze-cache – i-cache hit kanye ne-cache miss. Ngokuqondene:
- Ukushaya kwenqolobane kwenzeka lapho idatha eceliwe itholakala kunqolobane, okuvumela ukubuyiswa okusheshayo nokushibhile.
- Ukulahleka kwenqolobane kwenzeka uma idatha ingekho kunqolobane, okuphoqa uhlelo lokusebenza ukuthi lufinyelele kumthombo wangempela, obiza kakhulu futhi odla isikhathi.
Okunye okuvamile ukusetshenziswa kwenqolobane kuziphequluli zewebhu. Lapho uvakashela iwebhusayithi okokuqala ngqa, isiphequluli sihlola i-URL kuyo inqolobane inkumbulo, kodwa ayitholi lutho (lokho kuzoba yi-cache miss). Njengoba idatha esiyifunayo ingatholakali endaweni, isiphequluli kufanele senze isicelo ebiza kakhulu futhi esidla isikhathi kuseva yewebhu kuyo yonke i-inthanethi, ukuze sithole idatha kuseva ekude lapho ikhona khona ekuqaleni. Uma ikhasi ekugcineni selilayishiwe, isiphequluli sivamise ukukopisha leyo datha endaweni yaso inqolobane. Uma sizama ukulayisha kabusha ikhasi elifanayo emizuzwini emi-5 kamuva, isiphequluli sizolibheka endaweni yalo yokugcina. Kulokhu, izoyithola (i-cache hit) futhi ilayishe kusukela lapho, ngaphandle kokubuyela emuva kuseva. Lokhu kwenza isiphequluli sisebenze ngokushesha futhi sisebenzise izinsiza ezimbalwa.
Njengoba ungase ucabange, ukulondoloza isikhashana kuwusizo ikakhulukazi kumasistimu lapho idatha efanayo iceliwe izikhathi eziningi. Kumasistimu amaningi, ukufinyelela kwedatha akuvamisile ukufana, kodwa kunalokho kuvame ukulandela ukusatshalaliswa lapho ingxenye encane yedatha ilandisa ngobuningi bezicelo. Ingxenye enkulu yezinhlelo zokusebenza zangempela zilandela isimiso se-Pareto, okusho ukuthi cishe u-80% wezicelo cishe u-20% wedatha. Uma kungenjalo ngesimiso se-Pareto, inkumbulo yenqolobane izodinga ukuba ibe nkulu njengememori eyinhloko yesistimu, iyenze ibize kakhulu, kakhulu.
I-Caching esheshayo kanye ne-Little Bit mayelana ne-LLM Inference
Umqondo we-caching – ukugcina idatha esetshenziswa njalo ndawana thize futhi iyikhiphe lapho, esikhundleni sokuyithola futhi emthonjeni wayo oyinhloko – isetshenziswa ngendlela efanayo ukuze kuthuthukiswe ukusebenza kahle kwezingcingo ze-LLM, okuvumela izindleko ezincishiswe kakhulu nokubambezeleka. Ukulondoloza isikhashana kungasetshenziswa ezintweni ezihlukahlukene ezingase zibandakanyeke kuhlelo lokusebenza lwe-AI, okubaluleke kakhulu okuyi-Prompt Caching. Noma kunjalo, ukulondoloza isikhashana kunganikeza izinzuzo ezinkulu ngokusetshenziswa kwezinye izici zohlelo lokusebenza lwe-AI, njengesibonelo, ukulondoloza inqolobane ekubuyiseni kwe-RAG noma ukulondoloza isikhashana kokuphendula imibuzo. Noma kunjalo, lokhu okuthunyelwe kuzogxila kuphela ku-Prompt Caaching.
Ukuze siqonde ukuthi i-Prompt Caching isebenza kanjani, kufanele siqale siqonde kancane mayelana nokuthi i-LLM inference – sisebenzisa i-LLM eqeqeshiwe ukwenza umbhalo – imisebenzi. Ukuchazwa kwe-LLM akuyona inqubo eyodwa eqhubekayo, kodwa kunalokho ihlukaniswe izigaba ezimbili ezihlukene. Lezo yilezi:
- Gcwalisa ngaphambiliniokusho ukucubungula wonke ukwaziswa ngesikhathi esisodwa ukukhiqiza ithokheni yokuqala. Lesi sigaba sidinga ukubala kanzima, futhi kunjalo ikhompuyutha-boshiwe. Singase sicabange ngenguqulo eyenziwe lula kakhulu yalesi sigaba njengoba ithokheni ngayinye inakekela wonke amanye amathokheni, noma into efana nokuqhathanisa yonke ithokheni nawo wonke amathokheni adlule.
- Ukuqophaelengeza ithokheni yokugcina ekhiqiziwe emuva ekulandeleni futhi ikhiqize elandelayo ngokuhlehlisa ngokuzenzakalela. Lesi sigaba inkumbulonjengoba isistimu kufanele ilayishe wonke umongo wamathokheni wangaphambilini kusuka kumemori ukuze ikhiqize ithokheni ngayinye entsha.
Isibonelo, ake sithi sinalo mlayezo olandelayo:
What should I cook for dinner?
Lapho singathola khona ithokheni yokuqala:
Here
kanye nalezi ziphindaphindo zokukhipha amakhodi ezilandelayo:
Here
Here are
Here are 5
Here are 5 easy
Here are 5 easy dinner
Here are 5 easy dinner ideas
Inkinga ngalokhu ukuthi ukuze ukhiqize impendulo ephelele, imodeli kuzodingeka icubungule amathokheni afanayo wangaphambilini ngokuphindaphindiwe ukuze ikhiqize igama ngalinye elilandelayo ngesikhathi sesigaba sokuqopha, okuyinto, njengoba ungase ucabange, ingasebenzi kahle kakhulu. Esibonelweni sethu, lokhu kusho ukuthi imodeli izocubungula futhi amathokheni 'Yini okufanele ngiyipheke ngesidlo sakusihlwa? Nazi ezi-5 ezilula' ukukhiqiza okukhiphayo'imibono', noma ngabe isiwacubungulile amathokheni'Yini okufanele ngiyipheke ngesidlo sakusihlwa? Nanka 5′ ama-millisecond athile adlule.
Ukuxazulula lokhu, i-KV (Key-Value) Ukulondoloza isikhashana kusetshenziswa kuma-LLM. Lokhu kusho ukuthi izilinganisi ezimaphakathi zikakhiye kanye Nenani zokwaziswa kokufaka kanye namathokheni akhiqizwe ngaphambilini abalwa kanye bese agcinwa kunqolobane ye-KV, esikhundleni sokubala kabusha kusukela ekuqaleni ekuphindaphindweni ngakunye. Lokhu kubangela ukuthi imodeli yenze izibalo ezidingekayo ezincane zokukhiqiza impendulo ngayinye. Ngamanye amazwi, ekuphindaphindweni kokuqoshwa ngakunye ngakunye, imodeli yenza izibalo kuphela ukuze ibikezele ithokheni entsha bese iyifaka kunqolobane ye-KV.
Noma kunjalo, ukugcinwa kwesikhashana kwe-KV kusebenza kuphela ekwazisweni okukodwa kanye nokwenza impendulo eyodwa. I-Prompt Caching inweba izimiso ezisetshenziswa ekugcinweni kwesikhashana kwe-KV ukuze kusetshenziswe ukugcinwa kwesikhashana kuzo zonke iziyalezo ezihlukene, abasebenzisi, namaseshini.
Empeleni, ngokufaka kunqolobane ngokushesha, silondoloza izingxenye eziphindaphindiwe zokwaziswa ngemva kokucelwa kokuqala. Lezi zingxenye eziphindaphindiwe zokwaziswa ngokuvamile zinendlela enkulu iziqalonjengemiyalo yesistimu, imiyalelo, noma okuqukethwe okubuyisiwe. Ngale ndlela, uma isicelo esisha siqukethe isiqalo esifanayo, imodeli isebenzisa izibalo ezenziwe ngaphambilini esikhundleni sokubala kabusha kusukela ekuqaleni. Lokhu kulula ngendlela emangalisayo ngoba kunganciphisa kakhulu izindleko zokusebenza zohlelo lokusebenza lwe-AI (akudingeki sikhokhele okokufaka okuphindaphindiwe okuqukethe amathokheni afanayo), kanye nokunciphisa ukubambezeleka (akudingeki silinde imodeli ukucubungula amathokheni asecubunguliwe). Lokhu kuwusizo ikakhulukazi ezinhlelweni zokusebenza lapho ukwaziswa kuqukethe imiyalelo emikhulu ephindaphindiwe, njengamapayipi e-RAG.
Kubalulekile ukuqonda ukuthi le nqolobane iyasebenza ezingeni lamathokheni. Empeleni, lokhu kusho ukuthi ngisho noma ukwaziswa okubili kwehluka ekugcineni, inqobo nje uma behlanganyela isiqalo sethokheni esifanayo, ukubala okugcinwe kunqolobane kwaleyo ngxenye okwabelwana ngayo kusengaphinda kusetshenziswe, futhi kwenziwe izibalo ezintsha kuphela zamathokheni ahlukene. Ingxenye ekhohlisayo lapha ukuthi amathokheni avamile kufanele abe ekuqaleni kokwaziswa, ngakho-ke indlela esenza ngayo ukwaziswa kwethu nemiyalo iba yinto ebaluleke kakhulu. Esibonelweni sethu sokupheka, singacabanga imiyalo elandelayo elandelanayo.
Prompt 1
What should I cook for dinner?
bese kuthi uma sifaka ukwaziswa:
Prompt 2
What should I cook for launch?
Amathokheni abiwe 'Ngiphekeni' kufanele kube i-cache hit, ngakho-ke umuntu kufanele alindele ukusebenzisa amathokheni ancishiswe kakhulu we-Prompt 2.
Noma kunjalo, ukube besineziyalezo ezilandelayo…
Prompt 1
Dinner time! What should I cook?
bese
Prompt 2
Launch time! What should I cook?
Lokhu kungaba ukuphuthelwa kwenqolobane, njengoba ithokheni yokuqala yokwaziswa ngakunye ihlukile. Njengoba iziqalo zokwaziswa zihlukile, asikwazi ukushaya inqolobane, noma ngabe i-semantics yazo iyafana.
Njengomphumela, umthetho oyisisekelo wokuthola ukulondoloza okwesikhashana ukuze usebenze uwukuhlanganisa njalo noma yiluphi ulwazi olumile, njengemiyalelo noma ukwaziswa kwesistimu, ekuqaleni kokufaka kwemodeli. Ngakolunye uhlangothi, noma yiluphi ulwazi olushintshashintshayo olufana nezitembu zesikhathi noma ukukhonjwa komsebenzisi kufanele kuhambe ekugcineni kokwaziswa.
Ukungcolisa izandla zethu nge-OpenAI API
Namuhla, amamodeli amaningi esisekelo somngcele, njenge-GPT noma i-Claude, ahlinzeka ngolunye uhlobo lomsebenzi we-Prompt Caching ohlanganiswe ngokuqondile kuma-API awo. Ikakhulukazi, kuma-API ashiwo, Ukulondoloza Isikhashana Okusheshayo kwabelwanwa ngabo bonke abasebenzisi benhlangano abafinyelela ukhiye ofanayo we-API. Ngamanye amazwi, uma umsebenzisi enza isicelo futhi isiqalo saso sigcinwe kunqolobane, kunoma yimuphi omunye umsebenzisi ofaka ukwaziswa ngesiqalo esifanayo, sithola i-cache hit. Okusho ukuthi, sithola ukusebenzisa izibalo ezenziwe ngaphambilini, ezinciphisa kakhulu ukusetshenziswa kwamathokheni futhi zenze isizukulwane sokuphendula sisheshe. Lokhu kuwusizo ikakhulukazi lapho kuthunyelwa izinhlelo zokusebenza ze-AI ebhizinisini, lapho silindele abasebenzisi abaningi ukuthi basebenzise uhlelo olufanayo, kanjalo neziqalo ezifanayo zokufakwayo.
Kumamodeli akamuva, i-Prompt Caching icushwa ngokuzenzakalelayo, kodwa izinga elithile le-parametrization liyatholakala. Singahlukanisa phakathi:
- Esikhumbuzweni ukugcinwa kwenqolobane ngokushesha, lapho iziqalo ezigcinwe kunqolobane zigcinwa njengemizuzu emi-5-10 kuze kufike ehoreni elingu-1, futhi
- Kunwetshiwe ukugcinwa okusheshayo kwenqolobane (itholakala kuphela kumamodeli athile), okuvumela ukugcinwa isikhathi eside kwesiqalo esifakwe kunqolobane, kufika esilinganisweni samahora angu-24.
Kodwa ake sihlolisise!
Konke lokhu singakubona ngokwenza ngesibonelo esincane esilandelayo sePython, senza izicelo ku-OpenAI API, sisebenzisa i-Prompt Caching, kanye nemiyalelo yokupheka okukhulunywe ngayo ekuqaleni. Ngengeze isiqalo esabiwe esikhulu kakhulu ekwazisweni kwami, ukuze ngenze imiphumela yokulondoloza isikhashana ibonakale kakhudlwana:
from openai import OpenAI
api_key = "your_api_key"
client = OpenAI(api_key=api_key)
prefix = """
You are a helpful cooking assistant.
Your task is to suggest simple, practical dinner ideas for busy people.
Follow these guidelines carefully when generating suggestions:
General cooking rules:
- Meals should take less than 30 minutes to prepare.
- Ingredients should be easy to find in a regular supermarket.
- Recipes should avoid overly complex techniques.
- Prefer balanced meals including vegetables, protein, and carbohydrates.
Formatting rules:
- Always return a numbered list.
- Provide 5 suggestions.
- Each suggestion should include a short explanation.
Ingredient guidelines:
- Prefer seasonal vegetables.
- Avoid exotic ingredients.
- Assume the user has basic pantry staples such as olive oil, salt, pepper, garlic, onions, and pasta.
Cooking philosophy:
- Favor simple home cooking.
- Avoid restaurant-level complexity.
- Focus on meals that people realistically cook on weeknights.
Example meal styles:
- pasta dishes
- rice bowls
- stir fry
- roasted vegetables with protein
- simple soups
- wraps and sandwiches
- sheet pan meals
Diet considerations:
- Default to healthy meals.
- Avoid deep frying.
- Prefer balanced macronutrients.
Additional instructions:
- Keep explanations concise.
- Avoid repeating the same ingredients in every suggestion.
- Provide variety across the meal suggestions.
""" * 80
# huge prefix to make sure i get the 1000 something token threshold for activating prompt caching
prompt1 = prefix + "What should I cook for dinner?"
bese kuba yi-prompt 2
prompt2 = prefix + "What should I cook for lunch?"
response2 = client.responses.create(
model="gpt-5.2",
input=prompt2
)
print("nResponse 2:")
print(response2.output_text)
print("nUsage stats:")
print(response2.usage)

Ngakho-ke, ngokusheshiswa okungu-2, sizokhokhiswa kuphela ingxenye esele, engafanayo yokwaziswa. Lokho kungaba amathokheni okokufaka susa amathokheni agcinwe kunqolobane: 20,014 – 19,840 = amathokheni angu-174 kuphela, noma ngamanye amazwi, amathokheni angaphansi angu-99%.
Kunoma ikuphi, njengoba i-OpenAI ibeka umkhawulo omncane wamathokheni angu-1,024 ukuze kusebenze ukugcinwa kwesikhashana okusheshayo futhi inqolobane ingagcinwa isikhathi esingaphezu kwamahora angu-24, kuyacaca ukuthi lezo zinzuzo zezindleko zingatholwa ngokwenza kuphela lapho kusetshenziswa izinhlelo zokusebenza ze-AI esikalini, nabasebenzisi abaningi abasebenzayo abenza izicelo eziningi nsuku zonke. Noma kunjalo, njengoba kuchaziwe ezimweni ezinjalo, isici se-Prompt Caching singahlinzeka ngezindleko ezinkulu kanye nezinzuzo zesikhathi zezinhlelo zokusebenza ezinikwe amandla yi-LLM.
Emqondweni wami
I-Prompt Caching iwukulungiselelwa okunamandla kwama-LLM okungathuthukisa kakhulu ukusebenza kahle kwezinhlelo zokusebenza ze-AI kokubili ngokwezindleko nesikhathi. Ngokuphinda usebenzise ukubala kwangaphambilini kweziqalo ezifanayo zokwaziswa, imodeli ingakwazi ukweqa izibalo ezingafuneki futhi igweme ukucubungula ngokuphindaphindiwe amathokheni okokufaka afanayo. Umphumela uyimpendulo esheshayo nezindleko eziphansi, ikakhulukazi ezinhlelweni zokusebenza lapho izingxenye ezinkulu zezaziso—njengemiyalo yohlelo noma umongo obuyisiwe—zihlala zingashintshile kuzo zonke izicelo. Njengoba amasistimu e-AI ekala kanye nenani lamakholi e-LLM likhula, lokhu kulungiselelwa kuba kubaluleke kakhulu.
Ukuthandile lokhu okuthunyelwe? Asibe abangani! Ngijoyine ku:
📰Isiteki esincane 💌 Maphakathi 💼I-LinkedIn ☕Ngithengele ikhofi!
Zonke izithombe zombhali, ngaphandle kokushiwo ngenye indlela.



