Ukuqonda ukusebenza kwesicelo nge-Roofline Modeling

Ngokubala ukusebenza kohlelo lokusebenza ukuthi ukusebenza kwangempela komhlaba nokusebenza kwethiyori kungahluka. Nge-ecosystem yemikhiqizo ekhula ngezidingo eziphezulu zokusebenza njenge-computing ephezulu yokusebenza (HPC), ukugembula, noma ezindaweni zamanje – amamodeli amakhulu (i-LLMS), kubalulekile ukubala kahle ukusebenza kohlelo lokusebenza.
Ukumane ulinganise ama-gflops we-theoretical (imisebenzi entantayo ye-FAIT ON OBET) akwanele, njengoba izinhlelo zokusebenza kuyaqabukela zifinyelela lokhu kufinyelela kumanani angempela emhlabeni. Yilapho kungena khona imodeli ye-Roofline, enikeza indlela ecacile yokubuka ukulinganisa ukusebenza kohlelo lokusebenza nokugqamisa indima ebalulekile yokwakheka okukhethekile kwe-Hardware.
Kungani amamethrikhi alula akwanele
Lapho sicabanga ngokulinganisa ukusebenza, kunamamethrikhi ambalwa aza engqondweni:
- Isikhathi sokubulawa: Lokhu kukutshela isikhathi eside kangakanani umsebenzi wathatha kodwa awunasisekelo sokuqonda ngobani.
- Imijikelezo ngemiyalo ngayinye (i-CPI): Lokhu onyly kukala ukusebenza kwe-processor compute.
- Ukubulawa kwe-serial vs ukubulawa: Kukala ukusebenza kwe-compute Ukunganakwa noma yikuphi ukusebenza kwe-Hardware.
- Ukusebenza kwephuzu elintantayo ngomzuzwana (Flop / s): Lokhu onyly imele umkhawulo wethiyori okuvame ukungafinyeleleki esimweni sangempela somhlaba.
Ngenkathi lawa ama-metric amahle, ngokuvamile awanikezi imininingwane eyanele. Isibonelo, usebenzisa imisebenzi entantayo yamaphoyinti ngemizuzwana ngayinye ingumkhawulo wethiyori okungavamisile ukufinyelelwa. Ngakho-ke ukusebenzisa lokho njenge -nye I-metric akwanele ngoba ayinaki umkhawulo wokusebenza okuvamile – ukunyakaza kwedatha.
I-Roofline Modeling
Imodeli ye-Roofline iyithuluzi elinamandla elibonakale lisebenzisa ukusebenza kohlelo lokusebenza ngokumelene namakhono wokwakha i-Hardware ethize, njenge-CPU noma i-GPU. Imodeli ithola igama layo kusuka esimweni segrafu elikukhiqizayo, esibonisa “ophahleni” oluhlanganiswe umugqa ovuthiwe kanye nomugqa ovundlile, ovundlile. Lesi simo simele umkhawulo wokusebenza wokugcina okhishwe yi-Hardware.
Ukusuka kule ndlela yokumodeli, kunamapharamitha amabili achaza imikhawulo efinyelelekayo ngehadiwe:
- Ukuhamba kwedatha: Isikhathi esithatha isikhathi sokuhambisa idatha, kubalwa njengosayizi wedatha ophelele ohlukaniswe nge-bandwidth yememori ephezulu yohlelo.
- Ikhompyutha: Isikhathi esidingekayo sokubalwa, esinqunywa ngokuhlukanisa inani eliphelele lokusebenza kwamaphoyinti okuntanta ngokusebenza okuphezulu kohlelo (okuvame ukukalwa ku-gflop / s).
Isikhathi esiphelele sokwenza isicelo sinqunywa yingxenye enkulu yalezi zindinganiso ezimbili: max {data_movement, computation}.
Ngaphandle kwehadiwe enokusebenza okungcono kokusebenza, ukunyakaza kwedatha kuvame ukuba yi-bottleneck. I-Roofline Modeling yethula umqondo we Ukuqina kwe-arithmetic (AI). I-AI yisilinganiso sokusebenza kwamaphoyinti okuntanta okwenziwe ngayo yonke i-Byte yedatha ehanjiswa kwimemori.
- I-algorithm enama-arithmetic aphezulu athathwa njengelambele. Ukusebenza kwayo kunqunyelwe ukuthi izibalo zisheshe zenziwe ngokushesha kangakanani.
- I-algorithm ene-arithmetic ephansi ibhekwa njengelambile idatha. Ukusebenza kwayo kukhawulelwe ukuthi idatha ngokushesha ingahanjiswa ngokushesha kangakanani.
Ukuqonda igrafu
I-Creative Commons Attribution-Yabelana ngokufana 4.0 International
Igrafu yophahla ophahleni lwe-Flop / S (Y-axis) ngokumelene nokuqina kwe-arithmetic (x-axis). I- “Roof” ngokwayo ikhombisa ukulinganiselwa kwe-Hardware. Ingxenye eshisiwe yophahla imelela i-bandwidth yedatha ephezulu (ku-GB / S), ngenkathi ingxenye ephansi imelela ukusebenza kwe-Peak Qaphela ukuthi konke okusesithombeni kusezingeni le-logarithmic.
- Amaphoyinti angezansi ophahleni: Khombisa ukusebenza kwe-sunoptal okubonisa ubukhulu bokuthuthuka.
- Amaphoyinti ashaya umugqa oshayiwe: Isicelo sokulamba kwedatha. Ukusebenza kwayo kukhawulelwe yi-bandwidth yedatha.
- Amaphoyinti ashaya umugqa wefulethi: I-compute application elambile. Kusebenzisa amandla aphelele wokuhlanganisa iprosesa.
Kungani i-roofline modeling ibalulekile?
I-Roofline Modeling ihlinzeka ngendlela ebonakalayo, enembile yokuqonda ukusebenza kohlelo lokusebenza, okubonisa izici ezibalulekile njengokuqina kokusebenza, amakhono we-GPU, kanye neFlop / s. Lolu hlobo lwemodeli lusiza uhlelo lwezinhlelo lwenze ukulungiswa okuhlosiwe ohlelweni lwabo lwe-Hardware lapho imiphumela engcono ingatholakala khona.
- Ukuhlaziywa kweBottleneck: Ukuba nosizo lokubuka kwenza kube lula ukuthi unjiniyela athole ukuthi i-bottleneck iyinkumbulo noma ukusebenza. Uma uhlelo lokusebenza lukukhumbule kakhulu, unjiniyela angagxila ekuthuthukiseni indawo yedatha enamasu anjenge-caching noma ukukhahlela kwe-loop. Uma kunamandla amakhulu, ukugxila kungashintsha ukuze kuvumela ukuhlanganiswa okuhambisanayo noma ukumelana nokuqamba ama-compiler.
- I-Hardware ne-Software Design: Onjiniyela besoftware akufanele besabe i-Hardware ephansi. Esikhundleni salokho, ukwakheka kwehardware kufanele kwamukelwe futhi kulungiswe kahle. Onjiniyela beSoftware bangasebenzisa imininingwane kusuka ku-Roofline Modeling ukuze bamukele futhi basebenzise ukwakheka kwezakhiwo ezithile abazisebenzisayo.
I-Roofline Modeling Esebenzayo
Ukwenza ama-roofline modeling, kudingeka sifake isicelo sokuqonda ukusebenza. Ukusuka kumaphrofayili, singathola amamethrikhi anjenge-Floighting Point Operations (Flops) kanye ne-Memory Bandwidth Ukusetshenziswa, zombili ezidingekayo ukuze kumodeli we-Roofline. Lo mbhalo uhlola amabili ala mathuluzi – Nvidia's ncu Yikuphi i-CLIALT CLUCE CLI yokuhlaziywa kwe-GPU ne-Pytorch's Profiler, ikakhulukazi izinhlelo zokusebenza zisebenzisa i-Pytorch.
Ukuze uthole imininingwane eningiliziwe ye-cuda kernel kanye nokubala okunembile kwe-flop / byte, ncu Inikeza imininingwane eqondile ye-GPU Hardware Counter. Ngokungqubuzanayo, torch.profiler.profile Inikeza umbono ophakeme wezinga ngaphakathi kwe-Pytorch, isiza ekuqondeni ukusebenza kwezinga le-opharetha, ukusetshenziswa kwememori yeTessor, kanye nokuziphatha kohlelo lokusebenza okuphelele okubandakanya zombili imisebenzi ye-CPU nemisebenzi ye-GPU.
Iphrofayili nge-NCU
ncu Ingabe isikhombimsebenzisi somyalo osetshenziselwa ukusetshenziselwa ama-profing cuda kernels [2]. Ingabonisa imiphumela ngqo esigungwini noma sigcine kufayela le-log ukuze lihlaziywe kamuva. Ukwakha imodeli yophahla, kudingeka sithwebule amamethrikhi athile azosivumela ukubala ukuqina kwezibalo.
Sizosebenzisa i-Pytorch imagenetset repository [3] njengesibonelo sethu. Kuyindlela enhle ngoba kulula ukuyiqonda, okubhalwe kahle yi-Pytorch, futhi isebenza ngephrofayili yazo, ngakho-ke singakwazi ukumba ekwenzeni umsebenzi.
Isinyathelo 1: Qalisa umyalo we-NCU ukuqoqa amamethrikhi
Isinyathelo sokuqala ukusebenzisa uhlelo lokusebenza nge-NCU ukuqoqa idatha edingekayo ye-Hardware. Umyalo ubukeka kanjena:
ncu --log-file
--metrics
--target-processes all
python3
- Ifayela lokungena: Ifayela le-Log lapho sifuna ukugcina khona imiphumela.
- Amamethrikhi: Le yipharamitha ebaluleke kakhulu futhi ifanekisela amamethrikhi esifuna ukuwathola. Ngokubala ubukhulu bezibalo, siyacabanga:
dram__sectors_write.sum: Inani lemikhakha ye-DRAM ebhaliwedram__sectors_read.sum: Inani lemikhakha ye-DRAM ifundwesmsp__sass_thread_inst_executed_op_fadd_pred_on.sum: isamba sokungezwa kwephuzu elintantayosmsp__sass_thread_inst_executed_op_fmul_pred_on.sum: isamba sokuphindaphindeka kwamaphoyinti okuntantasmsp__sass_thread_inst_executed_op_ffma_pred_on.sum: isamba sephoyinti elintantayo lifaka eminye imisebenzi
- Inqubo-Injongo:
allIfulegi liqinisekisa ukuthi sifaka isicelo sonke isicelo.
Imiyalo yethu ye-NCU iyashintsha:
ncu --log-file logs_example --metrics dram__sectors_write.sum,
dram__sectors_read.sum,
smsp__sass_thread_inst_executed_op_fadd_pred_on.sum,
smsp__sass_thread_inst_executed_op_fmul_pred_on.sum,
smsp__sass_thread_inst_executed_op_ffma_pred_on.sum
--target-processes all python3
main.py /imagenet --arch resnet50 --epochs 1 --batch-size 10
--print-freq 10 --seed 42
Isinyathelo 2: ukubala ama-flops avela kumamethrikhi
Lapho nje i-profiler iqhubekile, singahlanganisa ama-metric aqoqwe ukubala inani eliphelele elintantayo. Ifomula yile:
[FLOPs = 2 * FMA_count + FADD_count + FMUL_count]
- Flops: Ukubalwa kwemisebenzi entantayo.
- FMA_Count: Ukusebenza okungeziwe okungeziwe (kwe-FMA) kuvame ukubalwa njengama-2 flops (okuphindaphindile kokunye nokungezwa okukodwa). Lokhu kuvezwa yi
smsp__sass_thread_inst_executed_op_ffma_pred_on.summetric. - Fadd_count: Lokhu kuvezwa yi
smsp__sass_thread_inst_executed_op_fadd_pred_on.summetric. - Fmul_count: Lokhu kuvezwa yi
smsp__sass_thread_inst_executed_op_fmul_pred_on.summetric.
Isinyathelo 3: Bala amabhayithi adluliselwe
Okulandelayo, sibala idatha ephelele edluliselwe ku-DRAM. Amamethrikhi we-NCU anikeza inani lemikhakha ye-DRAM efundwayo futhi ebhaliwe. Ukuthatha usayizi womkhakha ojwayelekile wamabhayithi angama-32 we-GPU wanamuhla:
[Total_DRAM_bytes = (dram__sectors_read.sum + dram__sectors_write.sum) * 32]
Isinyathelo 4: Bala ukuqina kwe-arithmetic
Ngama-Flops nama-Bytes aphelele, manje sesingabala ubukhulu be-arithmetic:
[AI = FLOPs / Total_DRAM_Bytes]
Isinyathelo 5: Bala isikhathi sokubulawa
Ukuthola ukusebenza kohlelo lokusebenza ku-Flop / s, futhi sidinga nesikhathi sokubulawa. Kulokhu, singasebenzisa ama-NVIDIA NSight Systems (NSYS), i-profiler ebanzi yohlelo engakala ngokunembile isikhathi sokusebenza kwezingxenye zohlelo lokusebenza. Sisebenzisa uhlelo lwethu lokusebenza futhi, kulokhu nge-NSYS, ukukhiqiza umbiko osuselwa kusikhathi. Kulombiko, singakhipha isikhathi se-GPU esiphelele.
nsys profile -f true -o python3
Imiyalo yethu ye-NSYS iyashintsha:
nsys profile -f true -o time.qdrep python3 main.py /imagenet
--arch resnet50 --epochs 1 --batch-size 10 --print-freq 10
--seed 42
Ngemuva kokugijima lo myalo, singathola GPU_RUNNING_TIME.
Isinyathelo 6: Bala ukusebenza kwesicelo
Okokugcina, sibala ukusebenza okutholakele ku-Flop / s ngokuhlukanisa amaflophi aphelele ngesikhathi sokubulawa:
[FLOP/s = FLOPs / GPU_RUNNING_TIME]
Leli nani lisinikeza “i-flop / s” esingayifinyelela esikwazi ukuhlela kugrafu yethu yophahla.
Iphrofayili ngethoshi
Ngezicelo ezibhalwe ePytorch, eyakhelwe ngaphakathi torch.profiler.profile inikeza indlela esebenziseka kalula yokuqoqa idatha yokusebenza. Kunezinketho ezi-2 ezinikezwe onjiniyela:
- Sebenzisa umphathi we-profiler
- Ukuhloselana ngokuhlobisa izingqimba zenethiwekhi ethile ye-neural
Imenenja yomongo ye-profiler
Ingxenye yekhodi esifuna ukuyiphrofayili ingagoqwa ngaphakathi kwe torch.profiler.profile() Imenenja yomongo. Ku with isitatimende, ungachaza i- activities ukulandelela (CPU, CUDA, noma bobabili), setha a schedule Ukufakazela izinyathelo ezikhethekile zokuqeqesha, bese ukhetha ukuthi urekhoda amajamo we-TESON, ukusetshenziselwa inkumbulo, noma amaflophi. Kanye ngaphakathi komongo, kufanele ushayele prof.step() Ekupheleni kwe-iteration ngayinye ukusayina iphrofayili ukuqhubekisela phambili, ikakhulukazi lapho kusetshenziswa uhlelo.
with profile(
activities=,
schedule=torch.profiler.schedule(),
record_shapes=,
profile_memory=,
with_flops=
) as prof:
....
prof.step()
- imisebenzi: Chaza ukuthi yini okufanele uyifake i-CPU, CUDA noma zombili.
- Isheduli: Iwusizo ku-propring izinyathelo eziningi ku-loop yokuqeqeshwa. Uma kusetshenziswa i-parameter yeshedyuli, iphrofari idinga ukubiza uProf.Step () ukuqhubekela esigabeni esilandelayo.
- Rekhoda_Shapes: Ukuthi arekhoda amajamo ezincanyana.
- Iphrofayili_memory: Ukuthwebula ukusetshenziswa kwememori
- nge_flops: Lokhu kuvivinya kepha kusetshenziselwa ukufufula nabasebenza.
Umyalo wethu we-profiler uyashintsha:
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
record_shapes=True,
profile_memory=True,
with_flops=True
) as prof:
Ukuhloselana ngokuhlobisa izingqimba zenethiwekhi ethile ye-neural
Iphrofayili nayo ingasetshenziswa ngendlela eqondiswe kuyo ukuhlaziya izingqimba ezithile zenethiwekhi ye-neural. Lokhu kuyasiza ukuhlola ukuthi ngabe ungqimba oluthile oluthile lufaka isandla ngaphezulu ekusebenzeni kunezinye izingqimba ezinikeza unjiniyela inketho yokushintsha izingqimba ezithile. Ngenkathi usebenzisa lokhu kulula kakhulu ukuyisebenzisa, ezimweni eziningi, inketho yokuqala isebenza kangcono. Imiphumela yePytorch Profiler nayo ingathunyelwa futhi ibonakale ku-tensorboard.
profiler.start()
self.conv2(x)
profiler.stop()
I-LLMS ne-Roofline Modeling
Ukuza esihlokweni sonke wonke umuntu ubelinde – Ngabe u-Roofline usiza usizo ngokubala ukusebenza kwe-LLM? Impendulo emfushane inguyebo.
I-LLMS yizakhiwo eziyinkimbinkimbi zenethiwekhi yenethiwekhi enezinkulungwane zezigidi zamapharamitha kanye nemininingwane enkulu abayicubungula. Ngenkathi ukuqeqeshwa kuwumsebenzi onamandla kakhulu, ukuthobeka kanye nokuhlenga okuhle imodeli futhi kudingeka isebenze kahle.
- Amabhodlela: I-LLMS ngesikhathi sokuphaphama ingahlushwa amabhodlela ngenxa yenani elikhulu lamapharamitha asebenza nawo. Lawa mapharamitha anezisindo zamamodeli futhi azibangela izindaba ze-bandwidth. Usebenzisa i-Roofline Modeling, izingqimba eziqondile zingafakwa kufakwe amabhodlela.
- Ukukhetha kwe-Hardware: Njengoba iningi lezinhlangano zibonisa amamodeli akhona akhona kunokuziqeqesha kusuka ekuqaleni, ekukhetheni ingqalasizinda efanele kubalulekile ekuphatheni izindleko. Lokhu kugcizelela ukubaluleka kokukhetha ingqalasizinda efanelekile yokuziqeqesha. Isibonelo, ukukhetha i-Hardware ngokokwakhiwa kwakho kwe-LLM noma ukwenzela imodeli yakho ukuze isebenze ekwakhiweni okuthile kunganciphisa ukuqeqeshwa kanye nezindleko zokulinganisa.
Ukugcina
Imodeli ye-Roofline inikezela ngokuhlaziywa okubonakalayo okubonakalayo kokusebenza kokusebenza kohlelo lokusebenza. Ngokubona ngeso lengqondo ukusebenza kohlelo lokusebenza kwimemori kanye ne-compute, kunikezwa umhlahlandlela ocacile ekukhetheni indlela engcono kakhulu yokusondela ekukhulumeni. Ngenkathi le ndatshana ibheka kuphela amamodeli we-Nave Roofline, kunamasu athuthukile afana namamodeli we-Hierarchical Roofline noma engeza ophahleni lokuhlelelwa kwamalungelo athile.
Ukunqubekela phambili
[1]
[2]
[3]
[4]



