Machine Learning

Ukukala I-ML Inference ku-Databricks: I-Liquid noma Ihlukanisiwe? Usawoti noma Cha?

Isingeniso

okuguquguqukayo okuqhubekayo kwemikhiqizo emine ehlukene. Ipayipi lokufunda lomshini lakhiwe ku-Databricks futhi kunezingxenye ezimbili ezinkulu.

  1. Ukulungiswa kwesici ku-SQL ngekhompyutha engenaseva.
  2. Ukucatshangelwa ekuhlanganisweni kwamamodeli angamakhulu ambalwa asebenzisa amaqoqo emisebenzi ukuze abe nokulawula phezu kwamandla ekhompyutha.

Emzamweni wethu wokuqala, iqoqo le-420-core lichithe cishe amahora ayi-10 licubungula izingxenye eziyi-18 kuphela.

Inhloso uku lungisa ukugeleza kwedatha ukuze ukhulise ukusetshenziswa kweqoqo futhi uqinisekise ukukala. Incazelo yenziwa kumasethi amane amamodeli e-ML, isethi eyodwa ngomkhiqizo ngamunye. Nokho, sizogxila kukho ukuthi idatha igcinwa kanjani njengoba izovela kungakanani ukufana esingakusebenzisa ukuze kucatshangelwe. Ngeke sigxile ekusebenzeni kwangaphakathi kwe-inference ngokwayo.

Uma kunokuhlukaniswa kwamafayela ambalwa kakhulu, iqoqo lizothatha isikhathi eside ukuskena amafayela amakhulu futhi ngaleso sikhathi, ngaphandle kokuthi lihlukaniswe kabusha (lokho kusho ukubambezeleka kwenethiwekhi okungeziwe kanye nokushova kwedatha), ungase uqondise kusethi enkulu yemigqa kukho konke ukwahlukanisa futhi. Futhi okuholela ezikhathini ezinde.

Umdwebo 1. Ungesabi ukufaka usawoti omncane kudatha yakho uma udinga. Isithombe ngu-Faran Raufi ku-Unsplash

Kodwa-ke, ibhizinisi linesineke esilinganiselwe sokuthumela amapayipi e-ML anomthelela oqondile ku-org. Ngakho ukuhlolwa kulinganiselwe.

Kulesi sihloko, sizobuyekeza isimo sethu sedatha yesici, bese sinikeza ukubuka konke kwe-ML inference, bese sethula imiphumela nezingxoxo zokusebenza okucatshangelwayo ngokusekelwe kuzimo ezine zokwelashwa kwedathasethi:

  1. Ithebula elihlukanisiwe, akukho usawoti, akukho mkhawulo werowu ekuhlukaniseni (akunasawoti futhi akuhlukanisiwe)
  2. Ithebula elihlukanisiwe, linosawoti, linomkhawulo womugqa ongu-1M (usawoti futhi uhlukanisiwe)
  3. Ithebula elihlanganisiwe eliwuketshezi, akukho usawoti, akukho mkhawulo werowu ekuhlukaniseni (okungafakwanga usawoti kanye noketshezi)
  4. Ithebula elihlanganisiwe eliwuketshezi, elinosawoti, elinomkhawulo womugqa ongu-1M (usawoti kanye noketshezi)

I-Data Landscape

Idathasethi iqukethe izici ezisetshenziswa isethi yamamodeli e-ML ukuze kucatshangelwe. Inemigqa engu-~550M futhi iqukethe imikhiqizo emine ekhonjwe kusibaluli ProductLine:

  • Umkhiqizo A: ~10.45M (1.9%)
  • Umkhiqizo B: ~4.4M (0.8%)
  • Umkhiqizo C: ~100M (17.6%)
  • Umkhiqizo D: ~354M (79.7%)

Bese iba nenye imfanelo ephansi yekhadinali attrB, equkethe amanani amabili kuphela ahlukene futhi esetshenziswa njengesihlungi ukuze kukhishwe amasethi angaphansi edathasethi yayo yonke ingxenye yesistimu ye-ML.

Ngaphezu kwalokho, RunDate ifaka idethi lapho izici zenziwe. Ziyi-append-kuphela. Ekugcineni, isethi yedatha ifundwa kusetshenziswa lo mbuzo olandelayo:

SELECT
  Id,
  ProductLine,
  AttrB,
  AttrC,
  RunDate,
  {model_features}
FROM
  catalog.schema.FeatureStore
WHERE
  ProductLine = :product AND
  AttrB = :attributeB AND
  RunDate = :RunDate

Ukusebenzisa Usawoti

I-salting lapha ikhiqizwa ngamandla. Inhloso yawo ukusabalalisa idatha ngokuya ngamavolumu. Lokhu kusho ukuthi imikhiqizo emikhulu ithola amabhakede amaningi futhi imikhiqizo emincane ithola amabhakede ambalwa. Isibonelo, uMkhiqizo D kufanele uthole cishe u-80% wamabhakede, uma kubhekwa izilinganiso zokwakheka kwedatha.

Senza lokhu ukuze sikwazi ukuba nezikhathi zokuqalisa okubikezelwayo futhi sandise ukusetshenziswa kweqoqo.

# Calculate percentage of each (ProductLine, AttrB) based on row counts
brand_cat_counts = df_demand_price_grid_load.groupBy(
   "ProductLine", "AttrB"
).count()
total_count = df_demand_price_grid_load.count()
brand_cat_percents = brand_cat_counts.withColumn(
   "percent", F.col("count") / F.lit(total_count)
)

# Collect percentages as dicts with string keys (this will later determine
# the number of salt buckets each product receives
brand_cat_percent_dict = {
   f"{row['ProductLine']}|{row['AttrB']}": row['percent']
   for row in brand_cat_percents.collect()
}

# Collect counts as dicts with string keys (this will help
# to add an additional bucket if counts is not divisible by the number of 
# buckets for the product
brand_cat_count_dict = {
   f"{row['ProductLine']}|{row['AttrB']}": row['count']
   for row in brand_cat_percents.collect()
}

# Helper to flatten key-value pairs for create_map
def dict_to_map_expr(d):
   expr = []
   for k, v in d.items():
       expr.append(F.lit(k))
       expr.append(F.lit(v))
   return expr

percent_case = F.create_map(*dict_to_map_expr(brand_cat_percent_dict))
count_case = F.create_map(*dict_to_map_expr(brand_cat_count_dict))

# Add string key column in pyspark
df_demand_price_grid_load = df_demand_price_grid_load.withColumn(
   "product_cat_key",
   F.concat_ws("|", F.col("ProductLine"), F.col("AttrB"))
)

df_demand_price_grid_load = df_demand_price_grid_load.withColumn(
   "percent", percent_case.getItem(F.col("product_cat_key"))
).withColumn(
   "product_count", count_case.getItem(F.col("product_cat_key"))
)

# Set min/max buckets
min_buckets = 10
max_buckets = 1160

# Calculate buckets per row based on (BrandName, price_delta_cat) percentage
df_demand_price_grid_load = df_demand_price_grid_load.withColumn(
   "buckets_base",
   (F.lit(min_buckets) + (F.col("percent") * (max_buckets - min_buckets))).cast("int")
)

# Add an extra bucket if brand_count is not divisible by buckets_base
df_demand_price_grid_load = df_demand_price_grid_load.withColumn(
   "buckets",
   F.when(
       (F.col("product_count") % F.col("buckets_base")) != 0,
       F.col("buckets_base") + 1
   ).otherwise(F.col("buckets_base"))
)

# Generate salt per row based on (ProductLine, AttrB) bucket count
df_demand_price_grid_load = df_demand_price_grid_load.withColumn(
   "salt",
   (F.rand(seed=42) * F.col("buckets")).cast("int")
)

# Perform the repartition using the core attributes and the salt column
df_demand_price_grid_load = df_demand_price_grid_load.repartition(
   1200, "AttrB", "ProductLine", "salt"
).drop("product_cat_key", "percent", "brand_count", "buckets_base", "buckets", "salt")

Ekugcineni, silondoloza idathasethi yethu kuthebula lesici futhi sengeza inombolo enkulu yemigqa ngokuhlukanisa ngakunye. Lokhu okokuvikela i-Spark ekukhiqizeni ama-partitions anemigqa eminingi kakhulu, engakwazi ukuyenza ngisho noma sesivele siwabale usawoti.

Kungani sisebenzisa imigqa engu-1M? Okugxilwe kakhulu kusikhathi sokuchazwa kwemodeli, hhayi kakhulu kusayizi wefayela. Ngemva kokuhlolwa okumbalwa nge-1M, 1.5M, 2M, eyokuqala ithela ukusebenza okuhle kakhulu kithi. Futhi, isabelomali esiningi kanye nesikhathi sibambezelekile kule phrojekthi, ngakho-ke kufanele sizisebenzise ngokugcwele izinsiza zethu.

df_demand_price_grid_load.write
   .mode("overwrite")
   .option("replaceWhere", f"RunDate = '{params['RunDate']}'")
   .option("maxRecordsPerFile", 1_000_000) 
   .partitionBy("RunDate", "price_delta_cat", "BrandName") 
   .saveAsTable(f"{params['catalog_revauto']}.{params['schema_revenueautomation']}.demand_features_price_grid")

Kungani ungathembeli ku-Spark's Adaptive Query Execution (AQE)?

Khumbula ukuthi okugxilwe kakhulu ezikhathini zokuqagela, hhayi ezilinganisweni ezilungiselelwe imibuzo evamile ye-Spark SQL njengosayizi wefayela. Ukusebenzisa i-AQE kuphela kwakuwumzamo wethu wokuqala. Njengoba uzobona emiphumeleni, izikhathi zokuqalisa bezingathandeki kakhulu futhi azizange zikhulise ukusetshenziswa kweqoqo uma kubhekwa izilinganiso zethu zedatha.

Umqondo wokufunda ngomshini

Kukhona ipayipi elinemisebenzi emi-4, eyodwa ngomkhiqizo ngamunye. Wonke umsebenzi wenza lezi zinyathelo ezijwayelekile ezilandelayo:

  • Ilayisha izici ezivela kumkhiqizo ohambisanayo
  • Ilayisha isethi engaphansi yamamodeli e-ML yomkhiqizo ohambisanayo
  • Yenza okucatshangwayo kuhhafu wesethi encanyana esikwe ngayo AttrB
  • Yenza okucatshangwayo komunye uhhafu usikwe ngo AttrB
  • Ilondoloza idatha kuthebula lemiphumela

Sizogxila kwesinye sezigaba zokuthi singagxili lesi sihloko ngezinombolo, nakuba esinye isigaba sifana kakhulu ngesakhiwo nemiphumela. Ngaphezu kwalokho, ungabona i-DAG ukuze ihlolwe ku-Fig. 2.

Umdwebo 2. I-DAG yesiteji senhlansi ye-ML. Ubunikazi bomnikazi.

Kubonakala kuqondile, kodwa izikhathi zokuqalisa zingahluka kuye ngokuthi idatha yakho ilondolozwe kanjani kanye nosayizi weqoqo lakho.

Ukucushwa kweqoqo

Esigabeni sokucatshangelwa esisihlaziyayo, kuneqoqo elilodwa lomkhiqizo ngamunye, elilungiselelwe imikhawulo yengqalasizinda yephrojekthi, kanye nokusatshalaliswa kwedatha:

  • Umkhiqizo A: abasebenzi abangu-35 (Standard_DS14v2, 420 cores)
  • Umkhiqizo B: abasebenzi abangu-5 (Standard_DS14v2, 70 cores)
  • Umkhiqizo C: Isisebenzi esingu-1 (Standard_DS14v2, 14 cores)
  • Umkhiqizo D: Isisebenzi esingu-1 (Standard_DS14v2, 14 cores)

Ngaphezu kwalokho, i-AdaptiveQueryExecution inikwe amandla ngokuzenzakalela, okuzovumela i-Spark inqume ukuthi ingalondoloza kanjani kangcono idatha enikezwe umongo owunikezayo.

Imiphumela nengxoxo

Esimweni ngasinye uzobona ukuboniswa kwenombolo yezingxenye zefayela ngomkhiqizo ngamunye kanye nenani elimaphakathi lemigqa ngengxenye ngayinye ukukunikeza inkomba yokuthi mingaki imigqa uhlelo lwe-ML oluzokwenza ukusho okuthile ngomsebenzi ngamunye we-Spark. Ngaphezu kwalokho, sethula amamethrikhi e-Spark UI ukuze sibheke ukusebenza kwesikhathi sokusebenza futhi sibheke ukusatshalaliswa kwedatha ngesikhathi sokunquma. Sizokwenza ingxenye ye-Spark UI kuphela kuMkhiqizo D, omkhulu kunayo yonke, ukuze singafaki ulwazi oluningi. Ngaphezu kwalokho, kuye ngesimo, ukucatshangelwa ngoMkhiqizo D kuba umgoqo ngesikhathi sokusebenza. Esinye isizathu sokuthi kungani bekugxile kakhulu emiphumeleni.

Okungafakwanga Usawoti kanye Ihlukanisiwe

Ungabona ku-Fig. 3 ukuthi ukuhlukaniswa kwefayela okumaphakathi kunamashumi ezigidi zemigqa, okusho ukuthi isikhathi esiningi sokusebenza sefa oyedwa. Okukhulu ngokwesilinganiso uMkhiqizo C onemigqa engaphezu kuka-45M ekuhlukaniseni okukodwa. Okuncane kunawo wonke uMkhiqizo B onemigqa emaphakathi elinganiselwa ku-12M.

Umdwebo 3. Isilinganiso serowu yokubala ekuhlukaniseni ngokumelene nemikhiqizo.

Umdwebo 4. ubonisa inani lama-partitions ngomkhiqizo ngamunye, nenani eliphelele lama-26 kubo bonke. Ihlola umkhiqizo D, izingxenye eziyi-18 zifinyela kakhulu kuma-cores angu-420 esinawo futhi ngokwesilinganiso, zonke izingxenye zizokwenza ukucabangela kumigqa engu-~40M.

Umdwebo 4. Isamba senani lokuhlukaniswa kwefayela ngomkhiqizo ngamunye

Bheka uFig 5. Sekukonke, iqoqo lichithe amahora angu-9.9 futhi lalingakaqedi, njengoba kwadingeka sibulale umsebenzi, ngoba wawubiza futhi uvimbela izivivinyo zabanye abantu.

Umdwebo 5. Isifinyezo sesigaba sokuqagela sedathasethi ehlukanisiwe, engagayiwe usawoti yoMkhiqizo D.

Kusukela kuzibalo ezifingqiwe ku-Fig. 6 zemisebenzi eqedile, singabona ukuthi kube nokugebenga okuqinile kuma-partitions oMkhiqizo D. Usayizi omkhulu wokufaka wawungu-~56M futhi isikhathi sokusebenza singu-7.8h.

Umfanekiso 6. Izibalo Ezifingqiwe zokusho kwabaphathi kudathasethi ehlukanisiwe nengenasawoti.

Okungafakwanga usawoti kanye noketshezi

Kulesi simo, singakwazi ukubona imiphumela efanayo kakhulu ngokwenani lesilinganiso semigqa ngokuhlukaniswa kwefayela kanye nenani lokuhlukaniswa komkhiqizo ngamunye, njengoba kubonakala ku-Fig. 7 kanye ne-Fig. 8, ngokulandelanayo.

Umdwebo 7. Isilinganiso somugqa wokubala ekuhlukaniseni uma kuqhathaniswa nemikhiqizo

Umkhiqizo D unama-partitions wamafayela angu-19, asemafushane kakhulu kuma-cores angu-420.

Umfanekiso 8. Isamba senani lokuhlukaniswa kwefayela ngomkhiqizo ngamunye

Sesingavele silindele ukuthi lokhu kuhlolwa bekuzobiza kakhulu, ngakho-ke nginqume ukweqa ukuhlolwa kwe-inference kulesi simo. Futhi, esimweni esihle, siya phambili, kodwa kunokusalela emuva kwamathikithi ebhodini lami.

Usawoti kanye Ihlukanisiwe

Ngemva kokufaka inqubo yokufaka usawoti kanye nokuhlukanisa, sigcina sesinamarekhodi angu-2.5M amaphakathi ingxenye ngayinye yomkhiqizo A no-B, kanye no-~1M wemikhiqizo C no-D njengoba kuboniswe ku-Fig 9.

Umfanekiso 9. Isilinganiso serowu yokubala ekuhlukaniseni uma kuqhathaniswa nemikhiqizo

Ngaphezu kwalokho, singabona ku-Fig. 10 ukuthi inani lokuhlukaniswa kwefayela likhuphuke lafinyelela cishe ku-860 kumkhiqizo D, okunikeza ama-430 ngesiteji ngasinye sokuqagela.

Umfanekiso 10. Isamba senani lokuhlukaniswa kwefayela ngomkhiqizo ngamunye

Lokhu kubangela isikhathi sokusebenza esingu-3h sokuhlehlisa uMkhiqizo D ngemisebenzi engama-360 njengoba kubonakala ku-Fig 11.

Umfanekiso 11. Isifinyezo sesigaba sokuqagela sedathasethi ehlukene futhi enosawoti

Ihlola izibalo ezifingqiwe kusukela ku-Fig. 12, ukusatshalaliswa kubukeka kulingana nezikhathi zokugijima cishe kuka-1.7, kodwa umsebenzi omkhulu othatha u-3h, okufanele uqhubeke uphenywa ngokuzayo.

Umfanekiso 12. Izibalo Ezifingqiwe zokusho kwabaphathi kudathasethi ehlukanisiwe nefakwe usawoti.

Enye inzuzo enkulu ukuthi usawoti usabalalisa idatha ngokuya ngezilinganiso zemikhiqizo. Uma besinokutholakala okwengeziwe kwezisetshenziswa, singase sinyuse inani lama-shova partitions phakathi repartition() futhi wengeze abasebenzi ngokwezilinganiso zedatha. Lokhu kuqinisekisa ukuthi inqubo yethu ilinganisa ngokubikezela.

Usawoti kanye Noketshezi

Lesi simo sihlanganisa amaleveli amabili aqine kakhulu esiwahlolile kuze kube manje:

usawoti ukuze ulawule usayizi wefayela nokufana, kanye nokuhlanganisa uketshezi ukuze kugcinwe idatha ehlobene ihlanganiswe ngaphandle kwemingcele eqinile yokuhlukanisa.

Ngemva kokusebenzisa isu elifanayo lokufaka usawoti kanye nomkhawulo werowu ongu-1M ngokuhlukanisa ngakunye, ithebula elihlanganisiwe eliwuketshezi libonisa usayizi wengxenye ofana kakhulu nekesi enosawoti nehlukanisiwe, njengoba kuboniswe ku-Fig 13. Imikhiqizo C kanye no-D ihlala iseduze nemigqa engu-1M eqondiwe, kuyilapho imikhiqizo A no-B ihlala kancane ngaphezu kwalowo mkhawulo.

Umdwebo 13. Isilinganiso somugqa wokubala ekuhlukaniseni uma kuqhathaniswa nemikhiqizo

Kodwa-ke, umehluko omkhulu uvela endleleni lezi zihlukaniso ezisatshalaliswa ngayo futhi zidliwe yi-Spark. Njengoba kuboniswe ku-Fig. 14, umkhiqizo D uphinde ufinyelele inombolo ephezulu yokuhlukaniswa kwefayela, unikeze ukufana okwanele ukuze kugcwaliswe ama-cores atholakalayo ngesikhathi sokunquma.

Umfanekiso 14. Isamba senani lokuhlukaniswa kwefayela ngomkhiqizo ngamunye.

Ngokungafani nozakwabo ohlukanisiwe, ukuhlanganisa okuwuketshezi kuvumela i-Spark ukuthi ijwayelane nesakhiwo sefayela ngokuhamba kwesikhathi kuyilapho isazuza kusawoti. Lokhu kubangela ukusatshalaliswa okulinganayo komsebenzi kubo bonke abafayo, nabangaphandle abambalwa ngokwedlulele kukho kokubili usayizi wokokufaka kanye nobude besikhathi somsebenzi.

Kusukela kuzibalo ezifingqiwe ezikuFig. 15, sibona ukuthi imisebenzi eminingi iqedwa ngaphakathi kwewindi lesikhathi sokusebenza eliqinile, futhi ubude besikhathi bomsebenzi obuwumkhawulo bungaphansi kunesimo esinosawoti nesihlukanisiwe. Lokhu kukhombisa ukuncipha kwe-skew kanye nokulinganisa okungcono komthwalo kulo lonke iqoqo.

Umfanekiso 15. Isifinyezo sesigaba sokunquma sedathasethi ehlanganisiwe nosawoti owuketshezi
Umfanekiso 16. Izibalo Ezifingqiwe zokusho kwabaphathi kudathasethi yedatha ehlanganisiwe nefakwe usawoti.

Umthelela oseceleni obalulekile ukuthi ukuhlanganisa okuwuketshezi kulondoloza indawo yedatha kumakholomu ahlungiwe ngaphandle kokuphoqelela imingcele eqinile yokuhlukanisa. Lokhu kuvumela i-Spark ukuthi isazuza ngokweqiwa kwedatha, kuyilapho usawoti uqinisekisa ukuthi akekho umenzi wefa oyedwa ogajwe amashumi ezigidi zemigqa.

Sekukonke, okunosawoti kanye noketshezi kuvela njengokusethwa okuqine kakhulu: kukhulisa ukufana, kunciphise ukugebenga, futhi kunciphise ubungozi bokusebenza lapho imithwalo yemisebenzi ecatshangwayo ikhula noma izinguquko zeqoqo.

Okuthathwayo Okubalulekile

  • Ukulinganisa kokucatshangelwa kuvame ukukhawulwa ukwakheka kwedatha, hhayi ubunkimbinkimbi bemodeli. Izingxenye zefayela ezinosayizi ongalungile zingashiya amakhulukhulu ezinhlamvu engenzi lutho kuyilapho izandisi ezimbalwa zicubungula amashumi ezigidi zemigqa.
  • Ukuhlukanisa kukodwa akwanele ekucabangeni ngezinga elikhulu. Ngaphandle kokulawula usayizi wefayela, amathebula ahlukanisiwe asengakwazi ukukhiqiza ukwahlukanisa okukhulu okuholela emisebenzini ehlala isikhathi eside, etshekile.
  • I-salting iyithuluzi elisebenzayo lokuvula ukuhambisana. Ukwethula ukhiye kasawoti kanye nokuphoqelela umkhawulo werowu ngengxenye ngayinye kukhulisa ngokumangalisayo inani lemisebenzi eqhubayo futhi kuzinzisa izikhathi zokusebenza.
  • Ukuhlanganisa okuwuketshezi kuhambisana nokwenza usawoti ngokunciphisa i-skew ngaphandle kwemingcele eqinile. Ivumela i-Spark ukuthi ivumelane nesakhiwo sefayela ngokuhamba kwesikhathi, okwenza isistimu iqine kakhulu njengoba idatha ikhula.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button