ANI

Ukusebenzisa ama-Polars Esikhundleni Se-Pandas: I-Performance Deep Dive

# Isingeniso

Kule minyaka eyishumi edlule, AmaPanda kube yisisekelo somsebenzi wedatha kuPython. Kumadathasethi alingana enkumbulweni, kuyashesha futhi kujwayeleke ngokwanele ukuthi ukushintsha amalabhulali akuvamisile ukweqa noma yimuphi umcabango wohlelo.

Kodwa-ke, uma uqala ukusebenza ngezigidi zemigqa, amaphutha aqala ukuvela: imisebenzi yeqembu ethatha imizuzwana embalwa, amakhophi aphakathi nendawo asebenzisa i-RAM, kanye nemisebenzi yewindi esebenza njengamalophu eleveli yePython esikhundleni sokuvezwa. C noma Ukugqwala ikhodi.

Amapulangwe iyilabhulali ye-DataFrame eyakhelwe ku-Rust phezu kwayo Umcibisholo we-Apache. Yaklanywa ngokuhambisana nokuhlola okuvilaphayo njengezici zesigaba sokuqala. I-Pandas yenza umsebenzi ngamunye ngaphambili nangokulandelana, kuyilapho i-Polars ingakwazi ukwakha uhlelo lombuzo futhi ilwenze ngaphambi kokwenza, ngemisebenzi eminingi eyenziwa kanyekanye kuwo wonke ama-CPU cores atholakalayo ngokuzenzakalelayo.

Kulesi sihloko, sihlola izinkinga ezintathu zedatha yangempela sisebenzisa imibuzo yangempela evela ku- I-StrataScratch iplathifomu yokufaka amakhodi. Enkingeni ngayinye, siqhathanisa izixazululo zombili zamalabhulali bese sikhomba lapho umehluko wokusebenza ubaluleke kakhulu.

I-Polar vs Pandas

# Ukusebenzisa izinga() vs. with_row_count(): Izinga Lomsebenzi

Kulo mbuzo, umgomo uwukuthola izinga lomsebenzi we-imeyili womsebenzisi ngamunye ngokusekelwe enanini eliphelele lama-imeyili athunyelwe. Umsebenzisi onama-imeyili amaningi uthola izinga 1. Imiphumela kufanele ihlungwe ngesamba sama-imeyili ngokulandelana okwehlayo, kusetshenziswa ukulandelana kwezinhlamvu njenge- tiebreaker, futhi izinga ngalinye kufanele lihluke, ngisho noma abasebenzisi ababili benesibalo esifanayo sama-imeyili.

// Ukubuka Kwedatha

I google_gmail_emails ithebula ligcina umugqa owodwa nge-imeyili ngayinye ethunyelwe, ene-ID yomthumeli (from_user), i-ID yomamukeli (to_user), kanye nosuku i-imeyili eyathunyelwa ngalo. Nakhu ukubuka kuqala kwethebula:

id kusuka_kumsebenzisi kumsebenzisi usuku
0 6edf0be4b2267df1fa 75d295377a46f83236 10
1 6edf0be4b2267df1fa 32ded68d89443e808 6
2 6edf0be4b2267df1fa 55e60cfcc9dc49c17e 10
3 6edf0be4b2267df1fa e0e0defbb9ec47f6f7 6
4
314 e6088004caf0c8cc51 e6088004caf0c8cc51 5

Okusanhlamvu (okushiwo umugqa owodwa wokukhiphayo): umsebenzisi oyedwa, nenani labo eliphelele le-imeyili kanye nezinga lomsebenzi oyingqayizivele.

// Iphutha Elivamile

Umbuzo ucela izinga elihlukile ngisho noma abasebenzisi ababili benesibalo esifanayo sama-imeyili. Iphutha elivamile ukusebenzisa i- rank(method='dense') indlela kumaPanda, enikeza izinga elifanayo kubasebenzisi ababoshiwe. Indlela elungile ngu 'first'enqamula izibopho ngendawo kuhlaka oluhlungiwe. Njengoba sihlunga ngama-alfabhethi nge user_id ngaphambi kokukleliswa, amazinga angumphumela ahlukile futhi ayanquma.

Isixazululo esiphezulu se-Polars sigwema i rank ukusebenza ngokuphelele. Ngemva kokuhlunga ["total_emails", "user_id"] ngohlelo lokwehla nokwenyuka, ngokulandelana, i .with_row_count("activity_rank", offset=1) isigatshana sinikeza izinombolo ezilandelanayo kusukela ku-1. Akukho mqondo wokunqamula isibopho odingekayo ngenxa yokuthi uhlobo seluyibambile kakade.

// Izixazululo

1. Isixazululo sePandas

Siqamba kabusha from_user ku user_idiqembu ngomsebenzisi, bala ama-imeyili, hlanganisa izinga lokuqala, futhi uhlunge ngesibalo se-imeyili ngohlelo olwehlayo, ngokunqanyulwa kwezinhlamvu ngokulandelana kwezinhlamvu.

import pandas as pd
import numpy as np
google_gmail_emails = google_gmail_emails.rename(columns={"from_user": "user_id"})
result = google_gmail_emails.groupby(
    ['user_id']).size().to_frame('total_emails').reset_index()
result['activity_rank'] = result['total_emails'].rank(method='first', ascending=False)
result = result.sort_values(by=['total_emails', 'user_id'], ascending=[False, True])

2. Isixazululo se-Polar

Sisebenzisa iketango elivilaphayo eliqamba kabusha, amaqembu, lihlunge, futhi linikeze izinombolo zemigqa ngephasi eyodwa. Iyafona .collect() ekugcineni kwenza umphumela.

import polars as pl
google_gmail_emails = google_gmail_emails.rename({"from_user": "user_id"})
result = (
    google_gmail_emails.lazy()
    .group_by("user_id")
    .agg(total_emails = pl.count())
    .sort(
        by=["total_emails", "user_id"],
        descending=[True, False]
    )
    .with_row_count("activity_rank", offset=1)
    .select([
        pl.col("user_id"),
        "total_emails",
        "activity_rank"
    ])
    .collect()
)

// Ukuqhathanisa Ukusebenza

I-Polar vs Pandas

Isixazululo se-Pandas siphindaphinda idatha kabili ngemva kokuqoqa: kanye ukubala osayizi kanye nokwabela amazinga. Ngaphakathi, rank(method='first') yabela uhlu lwezinga, ixazulula izibopho nge-argsort, futhi ibhale emuva – okubiza kakhulu kunokubheka ikholomu eyodwa. I-Polar group_by umsebenzi uhlukanisa umsebenzi kuwo wonke ama-CPU cores atholakalayo, okuholela ekuhlanganisweni okusheshayo kwamatafula amakhulu. Futhi kusukela .with_row_count() isigatshana siyiphasi eyodwa engu-O(n) elandelanayo ngemva kokuhlunga, sithatha isikhundla somsebenzi wezinga ngokusebenza okushibhile okungaba khona. Kuthebula eliqukethe izigidi zamarekhodi e-imeyili, ukusetshenziswa kokuhlanganiswa okuhambisanayo ngaphandle komsebenzi wezinga kungaholela ekuthuthukisweni okungu-5–10x esikhathini sewashi eliwudonga uma kuqhathaniswa nendlela ye-Pandas.

Nakhu ukubuka kuqala kokuphuma kwekhodi:

I-ID Yomsebenzisi ama-imeyili_aphelele izinga_lomsebenzi
32ded68d89443e808 19 1
ef5fe98c6b9f313075 19 2
5b8754928306a18b68 18 3
55e60cfcc9dc49c17e 16 4
91f59516cb9dee1e88 16 5
e6088004caf0c8cc51 6 25

# Ukusebenzisa i-cumcount() + pivot() vs. over(): Ukuthola Ukuthengwa Kwabasebenzisi

Kulo mbuzo, sicelwa ukuthi sikhombe abasebenzisi abasebenzayo ababuyayo – ikakhulukazi, labo abathenge okwesibili phakathi no-1 no-7 izinsuku ngemuva kokuqala kwabo. Ukuthenga okwenziwa ngosuku olufanayo akufanele kufakwe. Umphumela uwuhlu lokufaneleka nje user_id amanani.

// Ukubuka Kwedatha

I amazon_transactions ithebula linomugqa owodwa ngokuthenga ngakunye, nge user_id, item, created_at usuku, futhi revenue.

Nakhu ukubuka kuqala kwethebula:

id I-ID Yomsebenzisi into kudalwe_ku imali engenayo
1 109 ubisi 2020-03-03 123
2 139 ibhisikidi 2020-03-18 421
3 120 ubisi 2020-03-18 176
100 117 isinkwa 2020-03-10 209

Okusanhlamvu (okushiwo umugqa owodwa wokukhiphayo): I-ID eyodwa yomsebenzisi oyenze ukuthenga okufanelekile phakathi kwezinsuku eziyi-7 zokuqala.

// I-Edge Case

Ukuthenga kosuku olufanayo kufanele kuzitshwe, okusho ukuthi igebe phakathi kokuthenga kokuqala nokwesibili kufanele lidlule izinsuku ezingu-0 futhi okungenani libe yizinsuku ezingu-7. Ikhasimende elithenga kabili ngosuku olufanayo alifaneleki.

// Izixazululo

Zombili izixazululo zithola idethi yokuqala yokuthenga yomsebenzisi ngamunye bese zihlunga ngokuthenga okulandelayo phakathi nesikhathi esibekelwe usuku olu-1 kuya kweziyi-7. Into eyodwa okufanele uyibuke: uma created_at inezitembu zesikhathi esikhundleni samadethi angenalutho, udinga ukuncishiselwa kudethi ngaphambi kokuqhathanisa. Uma kungenjalo, ukuthenga okubili okwenziwe ngezikhathi ezihlukile ngosuku olufanayo kuzodlula ngokungalungile ukungalingani okuqinile.

1. Isixazululo sePandas

Ku-Pandas, isixazululo sibandakanya ukuhlukanisa izinsuku zokuthenga eziyingqayizivele ngomsebenzisi ngamunye, uzilinganise ngazo cumcount()zulazula ukuze uthole amadethi okuqala nawesibili eceleni, futhi ubale umehluko wosuku.

import pandas as pd
amazon_transactions["purchase_date"] = pd.to_datetime(amazon_transactions["created_at"]).dt.date
daily = amazon_transactions[["user_id", "purchase_date"]].drop_duplicates()
ranked = daily.sort_values(["user_id", "purchase_date"])
ranked["rn"] = ranked.groupby("user_id").cumcount() + 1
first_two = (ranked[ranked["rn"] <= 2]
             .pivot(index="user_id", columns="rn", values="purchase_date")
             .reset_index()
             .rename(columns={1: "first_date", 2: "second_date"}))
first_two = first_two.dropna(subset=["second_date"])
first_two["diff"] = (pd.to_datetime(first_two["second_date"]) - pd.to_datetime(first_two["first_date"])).dt.days
result = first_two[(first_two["diff"] >= 1) & (first_two["diff"] <= 7)][["user_id"]]

2. Isixazululo se-Polar

Isixazululo se-Polars sibandakanya ukwenza ikhompuyutha idethi yokuqala yokuthenga ngomsebenzisi ngamunye njengokuchazwa kwewindi nge .over("user_id")ukuhlunga ekuthengeni okulingana newindi lesikhathi, nokubuyisela okukopishiwe user_id uhlu.

import polars as pl
# returning active users: 2nd purchase 1–7 days after the first (ignore same-day)
returning_users = (
    amazon_transactions
    .lazy()
    # first purchase date per user (window so we avoid .groupby on LazyFrame)
    .with_columns(
        pl.col("created_at").min().over("user_id").alias("first_purchase_date")
    )
    # keep transactions strictly 1-7 days after that first purchase
    .filter(
        (pl.col("created_at") > pl.col("first_purchase_date")) &
        (pl.col("created_at") <= pl.col("first_purchase_date") + pl.duration(days=7))
    )
    # distinct user list
    .select("user_id")
    .unique()
    .sort("user_id", descending=[False])
)

// Ukuqhathanisa Ukusebenza

I-Polar vs Pandas

Qaphela inani lokwabiwa kwe-DataFrame ehlukile kusixazululo se-Pandas: ithebula lansuku zonke elikhishiwe, ithebula lezinga elihlungiwe, uhlaka oluphikisiwe, dropna umphumela, kanye nokuphumayo okuhlungiwe. Lokhu kuqukethe izinto ezinhlanu ezihlukene, ngayinye ekopisha idatha kubhulokhi yememori entsha. Kuthebula elikhulu lemisebenzi, isinyathelo se-pivot sisodwa singakhuphula kakhulu ukusetshenziswa kwememori, njengoba sibumba kabusha yonke idathasethi ibe yifomethi ebanzi.

I-Polars lazy chain ayibeki noma iyiphi inkumbulo kuze kube .collect(). I .over("user_id") isisho sewindi sihlanganisa idethi yokuqala yokuthenga yomsebenzisi ngamunye ngephasi eyodwa, i .filter() isebenza ngokushesha esinyathelweni esifanayo, futhi .unique() isebenza kanyekanye kuwo wonke ama-CPU cores. Ayikho i-pivot, ayikho ikhophi ehleliwe emaphakathi, futhi asikho isinyathelo sokusakaza sedethi esihlukile – Ama-Polar aphatha izibalo zedethi ngokomdabu ngaphakathi kwenjini yokusho. Le ndlela idla inkumbulo encane futhi isebenza ngokushesha, ngisho nakumadathasethi anosayizi omaphakathi.

Nakhu ukubuka kuqala kokuphuma kwekhodi:

I-ID Yomsebenzisi
100
103
105
143

# Ukusebenzisa i-expanding().mean() vs. cum_mean(): Isilinganiso Sokuthengiswa Kwanyanga Zonke

Kulo mbuzo, sicelwa ukuthi sinqume isilinganiso esiqongelelekayo sokuthengiswa kwezincwadi kwanyanga zonke ngo-2022. Isilinganiso sikhula inyanga ngayinye kusetshenziswa zonke izinyanga ezandulele: Isilinganiso sikaFebruwari sikaJanuwari noFebruwari, uMashi isilinganiso sokuthathu, njalo njalo. Okukhiphayo kufanele kubandakanye inyanga, isamba semali ethengisiwe yaleyo nyanga, kanye nesilinganiso esihlanganisiwe esifinyezwa enombolweni ephelele eseduze.

// Ukubuka Kwedatha

I amazon_books ithebula linomugqa owodwa incwadi ngayinye kanye nenani leyunithi. I book_orders ithebula linomugqa owodwa nge-oda ngalinye, elixhumanisa i-ID yebhuku nenani kanye nedethi yoku-oda. Nakhu ukubuka kuqala kwethebula:

incwadi_id incwadi_isihloko Intengo yokukodwa
B001 Imidlalo Yendlala 25
B002 Abangaphandle 50
B003 Ukubulala I-Mockingbird 100
B020 Izinsika Zomhlaba 60

I book_orders ithebula linomugqa owodwa nge-oda lebhuku ngalinye, elixhumanisa i-ID ye-oda ngalinye nedethi yoku-oda, i-ID yebhuku, kanye nenani eli-odiwe:

i-oda_id idethi_ye-oda incwadi_id ubuningi
1001 2022-01-10 B001 1
1002 2022-01-10 B009 1
1003 2022-01-15 B012 2
1084 2023-02-01 B009 1

Okusanhlamvu (okushiwo umugqa owodwa wokukhiphayo): inyanga eyodwa ngo-2022, okuthengisiwe okuphelele kwaleyo nyanga kanye nesilinganiso esiqoqiwe sakho konke ukuthengiswa kwanyanga zonke kuze kufike futhi kuhlanganise naleyo nyanga.

// Ukuhwebelana

Ukusebenzisa i-Pandas, i- .expanding().mean() isigatshana sifanelekile, kodwa sisebenza ngaphakathi ngeluphu yezinga lePython phezu kwezingcezu zewindi ezikhulayo. Ukuze uthole isifinyezo sanyanga zonke esinemigqa engu-12, lezi zindleko azinakwa. Kudatha yansuku zonke noma yehora esikalini (isibonelo, iminyaka emithathu yemisebenzi yehora), ucezu ngalunye lwewindi olukhulayo lwengeza phezulu oluhlanganisa umugqa ngomugqa.

I-Polar' cum_mean() isebenzisa iphasi eyodwa ku-Rust futhi ishesha ngokwemvelo esikalini. Kukhona okubanjwayo okukodwa: umbuzo udinga ukufinyezwa enombolweni ephelele eseduze, futhi i-Pandas isebenzisa ukuzungezisa kwebhange (ukuzungeza uhhafu ukuya kokulinganayo) ngokuzenzakalelayo. Isixazululo se-Polars sisebenzisa i-NumPy's cumsum enombhalo ocacile floor(x + 0.5) ifomula yokuphoqelela ukuziphatha kwesigamu-up. Uma udinga ukufana okuqondile kokuphumayo okulindelekile, indlela ye-NumPy ithembeke kakhulu kunokwakhelwe ngaphakathi kokuzungeza kunoma yimuphi umtapo wolwazi.

// Izixazululo

1. Isixazululo sePandas

Sihlanganisa izincwadi nama-oda, sihlunga ukuya ku-2022, sihlanganise ukuthengiswa kwanyanga zonke, futhi sisebenzise .expanding().mean() ukubala isilinganiso esiqongelelwe.

import pandas as pd
import numpy as np
import datetime as dt
merged = pd.merge(book_orders, amazon_books, on="book_id", how="inner")
merged["order_date"] = pd.to_datetime(merged["order_date"])
merged["order_month"] = merged["order_date"].dt.month
merged["year"] = merged["order_date"].dt.year
merged["sales"] = merged["unit_price"] * merged["quantity"]
merged = merged.loc[(merged["year"] == 2022), :]
result = (
    merged.groupby("order_month")["sales"]
    .sum()
    .to_frame("monthly_sales")
    .sort_values(by="order_month")
    .reset_index()
)
result["rolling_average"] = result["monthly_sales"].expanding().mean().round(0)
result

2. Ama-Polar: Ukwakha Ipayipi Elivilapha Nokuqoqa

Sijoyina amatafula amabili ngaphakathi kweketango elivilaphayo, sibala ukuthengisa njenge unit_price * quantityhlunga ukuya ku-2022, hlanganisa ngenyanga, bese ushaya ucingo .collect() ukushintshela kumodi yokulangazelela ngaphambi kwesinyathelo sokugoqa se-NumPy.

import polars as pl
import numpy as np
# Step 1: Prepare monthly sales (LazyFrame)
monthly_sales_lazy = (
    book_orders.lazy()
    .join(amazon_books.lazy(), on="book_id", how="inner")
    .with_columns([
        (pl.col("unit_price") * pl.col("quantity")).alias("sales"),
        pl.col("order_date").cast(pl.Datetime),
        pl.col("order_date").dt.year().alias("year"),
        pl.col("order_date").dt.month().alias("order_month")
    ])
    .filter(pl.col("year") == 2022)
    .group_by("order_month")
    .agg(pl.col("sales").sum().alias("monthly_sales"))
    .sort("order_month")
)
# Step 2: Switch to eager mode for rolling computation
monthly_sales = monthly_sales_lazy.collect()

3. Ukubala Isilinganiso Sokugingqika kanye Nokuphothula

Ngokudayiswa kwanyanga zonke njenge-NumPy array, sisebenzisa ukuzungezisa kwesigamu-up, sengeza umphumela emuva ku-Polars DataFrame, bese sikhetha amakholomu okukhiphayo.

# Step 3: Rolling average with round-half-up
sales_np = monthly_sales["monthly_sales"].to_numpy()
cumsum = np.cumsum(sales_np)
rolling_avg = np.floor(cumsum / np.arange(1, len(cumsum)+1) + 0.5).astype(int)
# Step 4: Add back to Polars DataFrame
monthly_sales = monthly_sales.with_columns([
    pl.Series("rolling_average", rolling_avg)
])
# Step 5: Final result with correct column names
result = monthly_sales.select(["order_month", "monthly_sales", "rolling_average"])

// Ukuqhathanisa Ukusebenza

I-Polar vs Pandas

Lo mbuzo unemisebenzi emibili ethinta ukusebenza kakhulu: ukujoyina kanye newindi lokuqongelela. KwaPandas, pd.merge ihlanganisa yonke imigqa esuka kuwo womabili amathebula ngaphambi kokuhlungwa kuka-2022. Lokhu kusho ukuthi inani lama-oda laminyaka yonke licutshungulwa ngaphambi kokuba imigqa engaphandle kwesikhathi esihlosiwe ilahlwe. Ama-polars akha uhlelo lokubuza imibuzo evilaphayo futhi aphushe i- filter(year == 2022) isimo ngaphambi kokuthi ukujoyina kuqalise, ngakho kujoyina idathasethi encane kusukela ekuqaleni. Leso sibikezelo pushdown senzeka ngokuzenzakalelayo, ngaphandle kokubhala okwengeziwe okudingekayo.

Umehluko ophawuleka kakhulu yigebe eliphakathi nendawo. Pandas' .expanding().mean() ikhulisa iwindi layo umugqa owodwa ngesikhathi, ibiza ku-C ngengxenye ngayinye ngenkathi isala ilawulwa iluphu yePython. I-Polar' cum_mean() ibala ikholomu yonke kuluphu ye-Rust eyodwa ngaphandle kwe-Python ngaphezulu. Nakuba umehluko ungase ungabonakali ngedatha yanyanga zonke, uma usebenzisa lo mbuzo ofanayo kudatha yansuku zonke iminyaka emithathu (cishe imigqa engu-1,000), inguqulo ye-Polars iqeda ngamasekhondi amancane kuyilapho i-Pandas ibonisa ukubambezeleka okulinganisekayo ngenxa yewindi elikhulayo.

Nakhu ukubuka kuqala kokuphuma kwekhodi:

oda_inyanga ukuthengisa ukugoqa_okumaphakathi
1 145 145
2 250 198
3 315 237
12 710 402

# Isiphetho

Kuzo zontathu izinkinga, izixazululo ze-Polars zilandela iphethini efanayo: yakha icebo lemibuzo elivilaphayo, cindezela ukubala okuningi ngangokunokwenzeka kusilungiseleli, bese ushaya ucingo. .collect() kuphela uma udinga umphumela okhonkolo.

I-syntax ithatha ukulungiswa okuthile uma, njengabahlaziyi abaningi, unemikhuba yePandas yeminyaka, kodwa imisebenzi iqondana eduze. .groupby() iba .group_by(), .rename() ithatha isisho esilula esikhundleni sika-a columns= igama elingukhiye, futhi izinga liba uhlobo olulandelwa .with_row_count().

Umehluko weqiniso ubonakala esikalini. Uma usebenzisana namasethi edatha amancane, womabili amalabhulali abuyisela imiphumela ngokushesha ngokwanele ukuthi umehluko ungabonakali. Njengoba ukubalwa kwemigqa kufinyelela ezigidini, ukufana kweleveli ye-Polars' Rust kanye ne-algorithms yephasi eyodwa kusebenza kahle kakhulu. Uma uhlangabezana nezinkinga zokusebenza ngama-Panda, lezi zinselelo ezintathu ziyisiqalo esihle sokufuduka.

Nate Rosidi ungusosayensi wedatha nakusu lomkhiqizo. Uphinde abe nguprofesa osizayo ofundisa izibalo, futhi ungumsunguli we-StrataScratch, inkundla esiza ososayensi bedatha ukulungiselela izingxoxo zabo ngemibuzo yenhlolokhono yangempela evela ezinkampanini eziphezulu. U-Nate ubhala ngamathrendi akamuva emakethe yemisebenzi, unikeza izeluleko zenhlolokhono, wabelane ngamaphrojekthi wesayensi yedatha, futhi uhlanganisa yonke into ye-SQL.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button