Ukusebenzisa ama-Polars Esikhundleni Se-Pandas: I-Performance Deep Dive

# Isingeniso
Kule minyaka eyishumi edlule, AmaPanda kube yisisekelo somsebenzi wedatha kuPython. Kumadathasethi alingana enkumbulweni, kuyashesha futhi kujwayeleke ngokwanele ukuthi ukushintsha amalabhulali akuvamisile ukweqa noma yimuphi umcabango wohlelo.
Kodwa-ke, uma uqala ukusebenza ngezigidi zemigqa, amaphutha aqala ukuvela: imisebenzi yeqembu ethatha imizuzwana embalwa, amakhophi aphakathi nendawo asebenzisa i-RAM, kanye nemisebenzi yewindi esebenza njengamalophu eleveli yePython esikhundleni sokuvezwa. C noma Ukugqwala ikhodi.
Amapulangwe iyilabhulali ye-DataFrame eyakhelwe ku-Rust phezu kwayo Umcibisholo we-Apache. Yaklanywa ngokuhambisana nokuhlola okuvilaphayo njengezici zesigaba sokuqala. I-Pandas yenza umsebenzi ngamunye ngaphambili nangokulandelana, kuyilapho i-Polars ingakwazi ukwakha uhlelo lombuzo futhi ilwenze ngaphambi kokwenza, ngemisebenzi eminingi eyenziwa kanyekanye kuwo wonke ama-CPU cores atholakalayo ngokuzenzakalelayo.
Kulesi sihloko, sihlola izinkinga ezintathu zedatha yangempela sisebenzisa imibuzo yangempela evela ku- I-StrataScratch iplathifomu yokufaka amakhodi. Enkingeni ngayinye, siqhathanisa izixazululo zombili zamalabhulali bese sikhomba lapho umehluko wokusebenza ubaluleke kakhulu.

# Ukusebenzisa izinga() vs. with_row_count(): Izinga Lomsebenzi
Kulo mbuzo, umgomo uwukuthola izinga lomsebenzi we-imeyili womsebenzisi ngamunye ngokusekelwe enanini eliphelele lama-imeyili athunyelwe. Umsebenzisi onama-imeyili amaningi uthola izinga 1. Imiphumela kufanele ihlungwe ngesamba sama-imeyili ngokulandelana okwehlayo, kusetshenziswa ukulandelana kwezinhlamvu njenge- tiebreaker, futhi izinga ngalinye kufanele lihluke, ngisho noma abasebenzisi ababili benesibalo esifanayo sama-imeyili.
// Ukubuka Kwedatha
I google_gmail_emails ithebula ligcina umugqa owodwa nge-imeyili ngayinye ethunyelwe, ene-ID yomthumeli (from_user), i-ID yomamukeli (to_user), kanye nosuku i-imeyili eyathunyelwa ngalo. Nakhu ukubuka kuqala kwethebula:
| id | kusuka_kumsebenzisi | kumsebenzisi | usuku |
|---|---|---|---|
| 0 | 6edf0be4b2267df1fa | 75d295377a46f83236 | 10 |
| 1 | 6edf0be4b2267df1fa | 32ded68d89443e808 | 6 |
| 2 | 6edf0be4b2267df1fa | 55e60cfcc9dc49c17e | 10 |
| 3 | 6edf0be4b2267df1fa | e0e0defbb9ec47f6f7 | 6 |
| 4 | … | … | … |
| 314 | e6088004caf0c8cc51 | e6088004caf0c8cc51 | 5 |
Okusanhlamvu (okushiwo umugqa owodwa wokukhiphayo): umsebenzisi oyedwa, nenani labo eliphelele le-imeyili kanye nezinga lomsebenzi oyingqayizivele.
// Iphutha Elivamile
Umbuzo ucela izinga elihlukile ngisho noma abasebenzisi ababili benesibalo esifanayo sama-imeyili. Iphutha elivamile ukusebenzisa i- rank(method='dense') indlela kumaPanda, enikeza izinga elifanayo kubasebenzisi ababoshiwe. Indlela elungile ngu 'first'enqamula izibopho ngendawo kuhlaka oluhlungiwe. Njengoba sihlunga ngama-alfabhethi nge user_id ngaphambi kokukleliswa, amazinga angumphumela ahlukile futhi ayanquma.
Isixazululo esiphezulu se-Polars sigwema i rank ukusebenza ngokuphelele. Ngemva kokuhlunga ["total_emails", "user_id"] ngohlelo lokwehla nokwenyuka, ngokulandelana, i .with_row_count("activity_rank", offset=1) isigatshana sinikeza izinombolo ezilandelanayo kusukela ku-1. Akukho mqondo wokunqamula isibopho odingekayo ngenxa yokuthi uhlobo seluyibambile kakade.
// Izixazululo
1. Isixazululo sePandas
Siqamba kabusha from_user ku user_idiqembu ngomsebenzisi, bala ama-imeyili, hlanganisa izinga lokuqala, futhi uhlunge ngesibalo se-imeyili ngohlelo olwehlayo, ngokunqanyulwa kwezinhlamvu ngokulandelana kwezinhlamvu.
import pandas as pd
import numpy as np
google_gmail_emails = google_gmail_emails.rename(columns={"from_user": "user_id"})
result = google_gmail_emails.groupby(
['user_id']).size().to_frame('total_emails').reset_index()
result['activity_rank'] = result['total_emails'].rank(method='first', ascending=False)
result = result.sort_values(by=['total_emails', 'user_id'], ascending=[False, True])
2. Isixazululo se-Polar
Sisebenzisa iketango elivilaphayo eliqamba kabusha, amaqembu, lihlunge, futhi linikeze izinombolo zemigqa ngephasi eyodwa. Iyafona .collect() ekugcineni kwenza umphumela.
import polars as pl
google_gmail_emails = google_gmail_emails.rename({"from_user": "user_id"})
result = (
google_gmail_emails.lazy()
.group_by("user_id")
.agg(total_emails = pl.count())
.sort(
by=["total_emails", "user_id"],
descending=[True, False]
)
.with_row_count("activity_rank", offset=1)
.select([
pl.col("user_id"),
"total_emails",
"activity_rank"
])
.collect()
)
// Ukuqhathanisa Ukusebenza
Isixazululo se-Pandas siphindaphinda idatha kabili ngemva kokuqoqa: kanye ukubala osayizi kanye nokwabela amazinga. Ngaphakathi, rank(method='first') yabela uhlu lwezinga, ixazulula izibopho nge-argsort, futhi ibhale emuva – okubiza kakhulu kunokubheka ikholomu eyodwa. I-Polar group_by umsebenzi uhlukanisa umsebenzi kuwo wonke ama-CPU cores atholakalayo, okuholela ekuhlanganisweni okusheshayo kwamatafula amakhulu. Futhi kusukela .with_row_count() isigatshana siyiphasi eyodwa engu-O(n) elandelanayo ngemva kokuhlunga, sithatha isikhundla somsebenzi wezinga ngokusebenza okushibhile okungaba khona. Kuthebula eliqukethe izigidi zamarekhodi e-imeyili, ukusetshenziswa kokuhlanganiswa okuhambisanayo ngaphandle komsebenzi wezinga kungaholela ekuthuthukisweni okungu-5–10x esikhathini sewashi eliwudonga uma kuqhathaniswa nendlela ye-Pandas.
Nakhu ukubuka kuqala kokuphuma kwekhodi:
| I-ID Yomsebenzisi | ama-imeyili_aphelele | izinga_lomsebenzi |
|---|---|---|
| 32ded68d89443e808 | 19 | 1 |
| ef5fe98c6b9f313075 | 19 | 2 |
| 5b8754928306a18b68 | 18 | 3 |
| 55e60cfcc9dc49c17e | 16 | 4 |
| 91f59516cb9dee1e88 | 16 | 5 |
| … | … | … |
| e6088004caf0c8cc51 | 6 | 25 |
# Ukusebenzisa i-cumcount() + pivot() vs. over(): Ukuthola Ukuthengwa Kwabasebenzisi
Kulo mbuzo, sicelwa ukuthi sikhombe abasebenzisi abasebenzayo ababuyayo – ikakhulukazi, labo abathenge okwesibili phakathi no-1 no-7 izinsuku ngemuva kokuqala kwabo. Ukuthenga okwenziwa ngosuku olufanayo akufanele kufakwe. Umphumela uwuhlu lokufaneleka nje user_id amanani.
// Ukubuka Kwedatha
I amazon_transactions ithebula linomugqa owodwa ngokuthenga ngakunye, nge user_id, item, created_at usuku, futhi revenue.
Nakhu ukubuka kuqala kwethebula:
| id | I-ID Yomsebenzisi | into | kudalwe_ku | imali engenayo |
|---|---|---|---|---|
| 1 | 109 | ubisi | 2020-03-03 | 123 |
| 2 | 139 | ibhisikidi | 2020-03-18 | 421 |
| 3 | 120 | ubisi | 2020-03-18 | 176 |
| … | … | … | … | … |
| 100 | 117 | isinkwa | 2020-03-10 | 209 |
Okusanhlamvu (okushiwo umugqa owodwa wokukhiphayo): I-ID eyodwa yomsebenzisi oyenze ukuthenga okufanelekile phakathi kwezinsuku eziyi-7 zokuqala.
// I-Edge Case
Ukuthenga kosuku olufanayo kufanele kuzitshwe, okusho ukuthi igebe phakathi kokuthenga kokuqala nokwesibili kufanele lidlule izinsuku ezingu-0 futhi okungenani libe yizinsuku ezingu-7. Ikhasimende elithenga kabili ngosuku olufanayo alifaneleki.
// Izixazululo
Zombili izixazululo zithola idethi yokuqala yokuthenga yomsebenzisi ngamunye bese zihlunga ngokuthenga okulandelayo phakathi nesikhathi esibekelwe usuku olu-1 kuya kweziyi-7. Into eyodwa okufanele uyibuke: uma created_at inezitembu zesikhathi esikhundleni samadethi angenalutho, udinga ukuncishiselwa kudethi ngaphambi kokuqhathanisa. Uma kungenjalo, ukuthenga okubili okwenziwe ngezikhathi ezihlukile ngosuku olufanayo kuzodlula ngokungalungile ukungalingani okuqinile.
1. Isixazululo sePandas
Ku-Pandas, isixazululo sibandakanya ukuhlukanisa izinsuku zokuthenga eziyingqayizivele ngomsebenzisi ngamunye, uzilinganise ngazo cumcount()zulazula ukuze uthole amadethi okuqala nawesibili eceleni, futhi ubale umehluko wosuku.
import pandas as pd
amazon_transactions["purchase_date"] = pd.to_datetime(amazon_transactions["created_at"]).dt.date
daily = amazon_transactions[["user_id", "purchase_date"]].drop_duplicates()
ranked = daily.sort_values(["user_id", "purchase_date"])
ranked["rn"] = ranked.groupby("user_id").cumcount() + 1
first_two = (ranked[ranked["rn"] <= 2]
.pivot(index="user_id", columns="rn", values="purchase_date")
.reset_index()
.rename(columns={1: "first_date", 2: "second_date"}))
first_two = first_two.dropna(subset=["second_date"])
first_two["diff"] = (pd.to_datetime(first_two["second_date"]) - pd.to_datetime(first_two["first_date"])).dt.days
result = first_two[(first_two["diff"] >= 1) & (first_two["diff"] <= 7)][["user_id"]]
2. Isixazululo se-Polar
Isixazululo se-Polars sibandakanya ukwenza ikhompuyutha idethi yokuqala yokuthenga ngomsebenzisi ngamunye njengokuchazwa kwewindi nge .over("user_id")ukuhlunga ekuthengeni okulingana newindi lesikhathi, nokubuyisela okukopishiwe user_id uhlu.
import polars as pl
# returning active users: 2nd purchase 1–7 days after the first (ignore same-day)
returning_users = (
amazon_transactions
.lazy()
# first purchase date per user (window so we avoid .groupby on LazyFrame)
.with_columns(
pl.col("created_at").min().over("user_id").alias("first_purchase_date")
)
# keep transactions strictly 1-7 days after that first purchase
.filter(
(pl.col("created_at") > pl.col("first_purchase_date")) &
(pl.col("created_at") <= pl.col("first_purchase_date") + pl.duration(days=7))
)
# distinct user list
.select("user_id")
.unique()
.sort("user_id", descending=[False])
)
// Ukuqhathanisa Ukusebenza
Qaphela inani lokwabiwa kwe-DataFrame ehlukile kusixazululo se-Pandas: ithebula lansuku zonke elikhishiwe, ithebula lezinga elihlungiwe, uhlaka oluphikisiwe, dropna umphumela, kanye nokuphumayo okuhlungiwe. Lokhu kuqukethe izinto ezinhlanu ezihlukene, ngayinye ekopisha idatha kubhulokhi yememori entsha. Kuthebula elikhulu lemisebenzi, isinyathelo se-pivot sisodwa singakhuphula kakhulu ukusetshenziswa kwememori, njengoba sibumba kabusha yonke idathasethi ibe yifomethi ebanzi.
I-Polars lazy chain ayibeki noma iyiphi inkumbulo kuze kube .collect(). I .over("user_id") isisho sewindi sihlanganisa idethi yokuqala yokuthenga yomsebenzisi ngamunye ngephasi eyodwa, i .filter() isebenza ngokushesha esinyathelweni esifanayo, futhi .unique() isebenza kanyekanye kuwo wonke ama-CPU cores. Ayikho i-pivot, ayikho ikhophi ehleliwe emaphakathi, futhi asikho isinyathelo sokusakaza sedethi esihlukile – Ama-Polar aphatha izibalo zedethi ngokomdabu ngaphakathi kwenjini yokusho. Le ndlela idla inkumbulo encane futhi isebenza ngokushesha, ngisho nakumadathasethi anosayizi omaphakathi.
Nakhu ukubuka kuqala kokuphuma kwekhodi:
| I-ID Yomsebenzisi |
|---|
| 100 |
| 103 |
| 105 |
| … |
| 143 |
# Ukusebenzisa i-expanding().mean() vs. cum_mean(): Isilinganiso Sokuthengiswa Kwanyanga Zonke
Kulo mbuzo, sicelwa ukuthi sinqume isilinganiso esiqongelelekayo sokuthengiswa kwezincwadi kwanyanga zonke ngo-2022. Isilinganiso sikhula inyanga ngayinye kusetshenziswa zonke izinyanga ezandulele: Isilinganiso sikaFebruwari sikaJanuwari noFebruwari, uMashi isilinganiso sokuthathu, njalo njalo. Okukhiphayo kufanele kubandakanye inyanga, isamba semali ethengisiwe yaleyo nyanga, kanye nesilinganiso esihlanganisiwe esifinyezwa enombolweni ephelele eseduze.
// Ukubuka Kwedatha
I amazon_books ithebula linomugqa owodwa incwadi ngayinye kanye nenani leyunithi. I book_orders ithebula linomugqa owodwa nge-oda ngalinye, elixhumanisa i-ID yebhuku nenani kanye nedethi yoku-oda. Nakhu ukubuka kuqala kwethebula:
| incwadi_id | incwadi_isihloko | Intengo yokukodwa |
|---|---|---|
| B001 | Imidlalo Yendlala | 25 |
| B002 | Abangaphandle | 50 |
| B003 | Ukubulala I-Mockingbird | 100 |
| … | … | … |
| B020 | Izinsika Zomhlaba | 60 |
I book_orders ithebula linomugqa owodwa nge-oda lebhuku ngalinye, elixhumanisa i-ID ye-oda ngalinye nedethi yoku-oda, i-ID yebhuku, kanye nenani eli-odiwe:
| i-oda_id | idethi_ye-oda | incwadi_id | ubuningi |
|---|---|---|---|
| 1001 | 2022-01-10 | B001 | 1 |
| 1002 | 2022-01-10 | B009 | 1 |
| 1003 | 2022-01-15 | B012 | 2 |
| … | … | … | … |
| 1084 | 2023-02-01 | B009 | 1 |
Okusanhlamvu (okushiwo umugqa owodwa wokukhiphayo): inyanga eyodwa ngo-2022, okuthengisiwe okuphelele kwaleyo nyanga kanye nesilinganiso esiqoqiwe sakho konke ukuthengiswa kwanyanga zonke kuze kufike futhi kuhlanganise naleyo nyanga.
// Ukuhwebelana
Ukusebenzisa i-Pandas, i- .expanding().mean() isigatshana sifanelekile, kodwa sisebenza ngaphakathi ngeluphu yezinga lePython phezu kwezingcezu zewindi ezikhulayo. Ukuze uthole isifinyezo sanyanga zonke esinemigqa engu-12, lezi zindleko azinakwa. Kudatha yansuku zonke noma yehora esikalini (isibonelo, iminyaka emithathu yemisebenzi yehora), ucezu ngalunye lwewindi olukhulayo lwengeza phezulu oluhlanganisa umugqa ngomugqa.
I-Polar' cum_mean() isebenzisa iphasi eyodwa ku-Rust futhi ishesha ngokwemvelo esikalini. Kukhona okubanjwayo okukodwa: umbuzo udinga ukufinyezwa enombolweni ephelele eseduze, futhi i-Pandas isebenzisa ukuzungezisa kwebhange (ukuzungeza uhhafu ukuya kokulinganayo) ngokuzenzakalelayo. Isixazululo se-Polars sisebenzisa i-NumPy's cumsum enombhalo ocacile floor(x + 0.5) ifomula yokuphoqelela ukuziphatha kwesigamu-up. Uma udinga ukufana okuqondile kokuphumayo okulindelekile, indlela ye-NumPy ithembeke kakhulu kunokwakhelwe ngaphakathi kokuzungeza kunoma yimuphi umtapo wolwazi.
// Izixazululo
1. Isixazululo sePandas
Sihlanganisa izincwadi nama-oda, sihlunga ukuya ku-2022, sihlanganise ukuthengiswa kwanyanga zonke, futhi sisebenzise .expanding().mean() ukubala isilinganiso esiqongelelwe.
import pandas as pd
import numpy as np
import datetime as dt
merged = pd.merge(book_orders, amazon_books, on="book_id", how="inner")
merged["order_date"] = pd.to_datetime(merged["order_date"])
merged["order_month"] = merged["order_date"].dt.month
merged["year"] = merged["order_date"].dt.year
merged["sales"] = merged["unit_price"] * merged["quantity"]
merged = merged.loc[(merged["year"] == 2022), :]
result = (
merged.groupby("order_month")["sales"]
.sum()
.to_frame("monthly_sales")
.sort_values(by="order_month")
.reset_index()
)
result["rolling_average"] = result["monthly_sales"].expanding().mean().round(0)
result
2. Ama-Polar: Ukwakha Ipayipi Elivilapha Nokuqoqa
Sijoyina amatafula amabili ngaphakathi kweketango elivilaphayo, sibala ukuthengisa njenge unit_price * quantityhlunga ukuya ku-2022, hlanganisa ngenyanga, bese ushaya ucingo .collect() ukushintshela kumodi yokulangazelela ngaphambi kwesinyathelo sokugoqa se-NumPy.
import polars as pl
import numpy as np
# Step 1: Prepare monthly sales (LazyFrame)
monthly_sales_lazy = (
book_orders.lazy()
.join(amazon_books.lazy(), on="book_id", how="inner")
.with_columns([
(pl.col("unit_price") * pl.col("quantity")).alias("sales"),
pl.col("order_date").cast(pl.Datetime),
pl.col("order_date").dt.year().alias("year"),
pl.col("order_date").dt.month().alias("order_month")
])
.filter(pl.col("year") == 2022)
.group_by("order_month")
.agg(pl.col("sales").sum().alias("monthly_sales"))
.sort("order_month")
)
# Step 2: Switch to eager mode for rolling computation
monthly_sales = monthly_sales_lazy.collect()
3. Ukubala Isilinganiso Sokugingqika kanye Nokuphothula
Ngokudayiswa kwanyanga zonke njenge-NumPy array, sisebenzisa ukuzungezisa kwesigamu-up, sengeza umphumela emuva ku-Polars DataFrame, bese sikhetha amakholomu okukhiphayo.
# Step 3: Rolling average with round-half-up
sales_np = monthly_sales["monthly_sales"].to_numpy()
cumsum = np.cumsum(sales_np)
rolling_avg = np.floor(cumsum / np.arange(1, len(cumsum)+1) + 0.5).astype(int)
# Step 4: Add back to Polars DataFrame
monthly_sales = monthly_sales.with_columns([
pl.Series("rolling_average", rolling_avg)
])
# Step 5: Final result with correct column names
result = monthly_sales.select(["order_month", "monthly_sales", "rolling_average"])
// Ukuqhathanisa Ukusebenza
Lo mbuzo unemisebenzi emibili ethinta ukusebenza kakhulu: ukujoyina kanye newindi lokuqongelela. KwaPandas, pd.merge ihlanganisa yonke imigqa esuka kuwo womabili amathebula ngaphambi kokuhlungwa kuka-2022. Lokhu kusho ukuthi inani lama-oda laminyaka yonke licutshungulwa ngaphambi kokuba imigqa engaphandle kwesikhathi esihlosiwe ilahlwe. Ama-polars akha uhlelo lokubuza imibuzo evilaphayo futhi aphushe i- filter(year == 2022) isimo ngaphambi kokuthi ukujoyina kuqalise, ngakho kujoyina idathasethi encane kusukela ekuqaleni. Leso sibikezelo pushdown senzeka ngokuzenzakalelayo, ngaphandle kokubhala okwengeziwe okudingekayo.
Umehluko ophawuleka kakhulu yigebe eliphakathi nendawo. Pandas' .expanding().mean() ikhulisa iwindi layo umugqa owodwa ngesikhathi, ibiza ku-C ngengxenye ngayinye ngenkathi isala ilawulwa iluphu yePython. I-Polar' cum_mean() ibala ikholomu yonke kuluphu ye-Rust eyodwa ngaphandle kwe-Python ngaphezulu. Nakuba umehluko ungase ungabonakali ngedatha yanyanga zonke, uma usebenzisa lo mbuzo ofanayo kudatha yansuku zonke iminyaka emithathu (cishe imigqa engu-1,000), inguqulo ye-Polars iqeda ngamasekhondi amancane kuyilapho i-Pandas ibonisa ukubambezeleka okulinganisekayo ngenxa yewindi elikhulayo.
Nakhu ukubuka kuqala kokuphuma kwekhodi:
| oda_inyanga | ukuthengisa | ukugoqa_okumaphakathi |
|---|---|---|
| 1 | 145 | 145 |
| 2 | 250 | 198 |
| 3 | 315 | 237 |
| … | … | … |
| 12 | 710 | 402 |
# Isiphetho
Kuzo zontathu izinkinga, izixazululo ze-Polars zilandela iphethini efanayo: yakha icebo lemibuzo elivilaphayo, cindezela ukubala okuningi ngangokunokwenzeka kusilungiseleli, bese ushaya ucingo. .collect() kuphela uma udinga umphumela okhonkolo.
I-syntax ithatha ukulungiswa okuthile uma, njengabahlaziyi abaningi, unemikhuba yePandas yeminyaka, kodwa imisebenzi iqondana eduze. .groupby() iba .group_by(), .rename() ithatha isisho esilula esikhundleni sika-a columns= igama elingukhiye, futhi izinga liba uhlobo olulandelwa .with_row_count().
Umehluko weqiniso ubonakala esikalini. Uma usebenzisana namasethi edatha amancane, womabili amalabhulali abuyisela imiphumela ngokushesha ngokwanele ukuthi umehluko ungabonakali. Njengoba ukubalwa kwemigqa kufinyelela ezigidini, ukufana kweleveli ye-Polars' Rust kanye ne-algorithms yephasi eyodwa kusebenza kahle kakhulu. Uma uhlangabezana nezinkinga zokusebenza ngama-Panda, lezi zinselelo ezintathu ziyisiqalo esihle sokufuduka.
Nate Rosidi ungusosayensi wedatha nakusu lomkhiqizo. Uphinde abe nguprofesa osizayo ofundisa izibalo, futhi ungumsunguli we-StrataScratch, inkundla esiza ososayensi bedatha ukulungiselela izingxoxo zabo ngemibuzo yenhlolokhono yangempela evela ezinkampanini eziphezulu. U-Nate ubhala ngamathrendi akamuva emakethe yemisebenzi, unikeza izeluleko zenhlolokhono, wabelane ngamaphrojekthi wesayensi yedatha, futhi uhlanganisa yonke into ye-SQL.



