Machine Learning

Isifundo: Ukuhlanganiswa kwe-semantic yemiyalezo yomsebenzisi nge-LLM Prots


Njengommeli wonjiniyela, kuyinselele ukuhambisana nemiyalezo yenkundla yomsebenzisi futhi uqonde isithombe esikhulu salokho abasebenzisi abakushoyo. Kukhona okuqukethwe okuningi okubalulekile – kepha ungabona kanjani ngokushesha izingxoxo ezibalulekile? Kulesi sifundo, ngizokukhombisa i-AI Hack ukwenza ukungqubuzana kwe-semantic ngokumane nje kuthumele i-LLMS!

TL; DR 🔄 Lokhu okuthunyelwe kwebhulogi kumayelana nokuthi ungasuka kanjani (idatha yesayensi + ikhodi) 🤖⚡. Ihlelwe kanjena:

  • Ugqozi kanye nemithombo yedatha
  • Ukuhlola imininingwane ngama-dashboards
  • I-LLM iyeka ukukhiqiza amaqoqo e-knn
  • Ukuzama ukushumeka ngokwezifiso
  • Ukuhlanganisa amaseva amaningi okuvuma

Ugqozi kanye nemithombo yedatha

Okokuqala, ngizonikela IPHEPHA LOKUQHAWULA KWESIKHATHI SE-DECEMBER 2024 Umuntu oyedwa (I-Claude imqondo nokubuka)ipulatifomu yobumfihlo – okulondolozwa okusebenzisa abasizi be-AI ukuhlaziya kanye namaphethini wokusebenzisa ahlanganisiwe ezigidini zezingxoxo. Ukufunda leli phepha kungikhuthaze ukuthi ngizame lokhu.

Umbhalo. Ngisebenzise imilayezo etholakala emphakathini etholakala emphakathini, ikakhulukazi “imicu yeforamu”, lapho abasebenzisi becela usizo lwezobuchwepheshe. Ngaphezu kwalokho, ngihlanganise futhi okuqukethwe okungaziwa kwaleli bhulogi. Ngentambo ngakunye, ngafonelela imininingwane ngefomethi yokujika kwengxoxo, nezindima zomsebenzisi ekhonjwe njengaba “Umsebenzisi”, ngicela lowo mbuzo noma “noma ngubani ophendula umbuzo wokuqala womsebenzisi. Ngiphinde ngangezela i-Binary Senta Score elula, engu-0 Kwabathengisi beVectordb ngisebenzise i-zilliz / milvus, i-Chroma, ne-qdrant.

Isinyathelo sokuqala bekuwukuguqula imininingwane ibe yifreyimu yedatha ye-pandas. Ngezansi kucashunwe. Ungabona ngentambo_id = 2, umsebenzisi ubuze kuphela umbuzo ongu-1. Kepha ngentambo_id = 3, umsebenzisi wabuza imibuzo ehlukene engu-4 emthanjeni ofanayo (eminye imibuzo emi-2 ekude kakhulu ama-timestamp, awakhonjiswa ngezansi).

Ngingeze imizwa engenacala 0 | 1 umsebenzi wokushaya amagoli.

def calc_score(df):
   # Define the target words
   target_words = ["thanks", "thank you", "thx", "🙂", "😉", "👍"]


   # Helper function to check if any target word is in the concatenated message content
   def contains_target_words(messages):
       concatenated_content = " ".join(messages).lower()
       return any(word in concatenated_content for word in target_words)


   # Group by 'thread_id' and calculate score for each group
   thread_scores = (
       df[df['role_name'] == 'user']
       .groupby('thread_id')['message_content']
       .apply(lambda messages: int(contains_target_words(messages)))
   )
   # Map the calculated scores back to the original DataFrame
   df['score'] = df['thread_id'].map(thread_scores)
   return df


...


if __name__ == "__main__":
  
   # Load parameters from YAML file
   config_path = "config.yaml"
   params = load_params(config_path)
   input_data_folder = params['input_data_folder']
   processed_data_dir = params['processed_data_dir']
   threads_data_file = os.path.join(processed_data_dir, "thread_summary.csv")
  
   # Read data from Discord Forum JSON files into a pandas df.
   clean_data_df = process_json_files(
       input_data_folder,
       processed_data_dir)
  
   # Calculate score based on specific words in message content
   clean_data_df = calc_score(clean_data_df)


   # Generate reports and plots
   plot_all_metrics(processed_data_dir)


   # Concat thread messages & save as CSV for prompting.
   thread_summary_df, avg_message_len, avg_message_len_user = 
   concat_thread_messages_df(clean_data_df, threads_data_file)
   assert thread_summary_df.shape[0] == clean_data_df.thread_id.nunique()

Ukuhlola imininingwane ngama-dashboards

Ukusuka kwimininingwane esetshenzisiwe ngenhla, ngakha amadeshibhodi endabuko:

  • Umlayezo we-Message: Iziqongo ze-ONE-Off in themers ezifana ne-qdrant kanye ne-milvus (mhlawumbe ngenxa yemicimbi yokuthengisa).
  • Ukuzibandakanya komsebenzisi: Abasebenzisi abaphezulu Basebenzisi Bar Charts kanye ne-Speest Phendula Isikhathi vs Inombolo Yokujika Komsebenzisi Kukhombisa lokho, ngokujwayelekile, ukujika komsebenzisi okwengeziwe okusho ukwaneliseka okuphezulu. Kepha, ukwaneliseka akubukeki kuhlobene nesikhathi sokuphendula. Iskey PHAKATHI DOTS FRACK ibonakale ngokungahleliwe maqondana ne-Y-axis (isikhathi sokuphendula). Mhlawumbe abasebenzisi abakhiqizi, imibuzo yabo ayiphuthumayo kakhulu? Abathengisi bakhona, njenge-QDRATE neChroma, okungenzeka babe nama-anamalies aqhutshwa ngama-bot.
  • Izitayela Zokwenelisa: Cishe ama-70% abasebenzisi abonakala ejabule ngokuba nanoma yikuphi ukuxhumana. QAPHELA: Qinisekisa ukubheka ama-emojis ngomthengisi ngamunye, kwesinye isikhathi abasebenzisi baphendule besebenzisa i-emojis esikhundleni samagama! Isibonelo ikhenti ne-Chroma.
Isithombe nguMlobi wedatha ehlanganisiwe, engaziwa. Izinxele eziphezulu: Amashadi abonisa ivolumu yemilayezo ephakeme kakhulu yeChroma, elandelwa yi-QDRATE, bese u-Milvus. Amalungelo aphezulu: Abasebenzisi abaphezulu bemiyalezo, ama-qdanta + Chroma kungenzeka (bheka ibha ephezulu kumashadi abasebenzisi abathumeli abaphezulu). Amalungelo aphakathi: I-Scatterplets yesikhathi sokuphendula vs Inombolo yokujika komsebenzisi ikhombisa ukuxhumeka maqondana namachashazi amnyama kanye ne-Y-axis (isikhathi sokuphendula). Imvamisa ukwaneliseka okuphezulu kwe-WRT X-axis (umsebenzisi ajika), ngaphandle kweChroma. Kwesobunxele: Amashadi we-Bar Amazinga Wokwanelisa, qiniseka ukuthi ubamba impendulo esekelwe e-Emoji, bheka i-QDRATE NE-CHROMA.

I-LLM iyeka ukukhiqiza amaqoqo e-knn

Ngokuyala, isinyathelo esilandelayo bekuwukuhlanganisa idatha ye-Thry_Id. Okwe-LLMS, udinga imibhalo ehlanganisiwe. Ngihlukanisa imilayezo yomsebenzisi evela emivuzweni yonke yentambo, ukubona ukuthi enye noma enye ingakhiqiza amaqoqo angcono. Ngagcina ngisebenzisa nje imiyalezo yomsebenzisi.

Isibonelo idatha engaziwa yokuyeka. Yonke imiyalezo yemiyalezo ihlanganiswe ndawonye.

Ngefayela le-CSV lokuyala, usukulungele ukuqala ukusebenzisa i-LLM ukwenza isayensi yedatha!

!pip install -q google.generativeai
import os
import google.generativeai as genai


# Get API key from local system
api_key=os.environ.get("GOOGLE_API_KEY")


# Configure API key
genai.configure(api_key=api_key)


# List all the model names
for m in genai.list_models():
   if 'generateContent' in m.supported_generation_methods:
       print(m.name)


# Try different models and prompts
GEMINI_MODEL_FOR_SUMMARIES = "gemini-2.0-pro-exp-02-05"
model = genai.GenerativeModel(GEMINI_MODEL_FOR_SUMMARIES)
# Combine the prompt and CSV data.
full_input = prompt + "nnCSV Data:n" + csv_data
# Inference call to Gemini LLM
response = model.generate_content(full_input)


# Save response.text as .json file...


# Check token counts and compare to model limit: 2 million tokens
print(response.usage_metadata)
Isithombe nguMlobi. Phezulu: Isibonelo amagama amamodeli we-LLM. Phansi: isibonelo sokuqanjwa kwesibonelo ku-Gemini LLM Amathokheni ukubalwa: Product_token_count = amathokheni wokufaka; Abaqoqi_Token_Count = Amathokheni okukhipha; Ingqikithi_token_count = Inani eliphelele lamathokheni asetshenzisiwe.

Ngeshwa iGemini API iqhubeke nokusika amafushane response.text. Ngibe nenhlanhla ngisebenzisa i-AI Studio ngqo.

Image by Umbhali: isithombe-skrini sohlelo lokuphuma kusuka ku-Google Ai Studio.

I-My 5 Prompts to Gemini Flash & Pro (izinga lokushisa elisethwe ku-0) lingezansi.

Prompt # 1: Thola izifingqo zentambo:

Unikezwe leli fayela le-.csv, umugqa ngamunye, engeza amakholomu ama-3:
– I-Thread_Summary = izinhlamvu ezingama-205 noma isifinyezo esingaphansi somlayezo wekholomu kamugqa '
– Umsebenzisi_threadred_summary = izinhlamvu eziyi-126 noma isifinyezo esingaphansi somugqa wekholomu 'umlayezo_content_user'
– I-Thread_topic = 3-5 igama eliphakeme lesigaba esiphakeme
Qiniseka ukuthi izifingqo zithwebula okuqukethwe okuyinhloko ngaphandle kokulahlekelwa imininingwane eminingi. Yenza izifinyezo zentambo yomsebenzisi ziqonde ngqo ephuzwini, zithwebula okuqukethwe okuyinhloko ngaphandle kokulahlekelwa imininingwane eminingi, yeqa umbhalo we-intro. Uma isifinyezo esifushane sihle ngokwanele sikhetha ukufingqwa okufushane. Qiniseka ukuthi isihloko sijwayelekile ngokwanele ukuthi kunezihloko eziphakeme ezingaphansi kwama-20 zayo yonke imininingwane. Khetha izihloko ezimbalwa. Okukhipha ikholomu ye-json: intambo_id, intambo_summary, umsebenzisi_threadread_summary, intambo_topic.

Prompt # 2: Thola izibalo zeqoqo:

Uma unikezwe le fayela le-CSV yemiyalezo, sebenzisa ikholomu = 'umsebenzisi_threadred_summary' ukwenza ukuhlanganiswa kwe-semantic yayo yonke imigqa. Sebenzisa i-Technique = Silhouette, ngendlela yokuxhumana = Ward, kanye nebanga_metric = ukufana kwe-cosine. Vele unginike izibalo zokuhlaziywa kwendlela ye-Silhouette manje.

Prompt # 3: Yenza ukuhlanganiswa kokuqala:

Uma unikezwe le fayela le-CSV yemiyalezo, sebenzisa ikholomu = 'umsebenzisi_threadred_summary' ukwenza ukuhlanganiswa kwe-semantic yawo wonke imigqa ku-n = 6 amaqoqo asebenzisa indlela ye-silhouette. Sebenzisa ikholomu = “Thread_Topic” ukufingqa isihloko ngasinye se-cluster kumagama ayi-1-3. Okukhipha i-JSON ngekholomu: Intambo_ID, izinga0_cluster_id, izinga0_cluster_topic.

Isikolo se-Silhouette Indlela efanayo nento eyenzeka ngayo kwiqoqo yayo (ukuhlangana) nokuqhathanisa amanye amaqoqo (ukwahlukana). Izikolo zisukela ku-1 kuye ku-1. Isilinganiso esiphakeme se-silhouette esivamile ngokuvamile sibonisa amaqoqo achazwe kangcono ahlukaniswe kahle. Ngemininingwane engaphezulu, hlola imibhalo ye-skikit-funda ama-silhouette amadokhumenti.

Kuyisebenzisa kudatha yeChroma. Ngezansi, ngibonisa imiphumela kusuka ku-Prompt # 2, njengesakhiwo sezikolo ze-silhouette. Ngikhethile N = 6 amaqoqo njengokuyekethisa phakathi kwesikolo esiphakeme kanye namaqoqo ambalwa. Iningi le-LLMS Lezi zinsuku zokuhlaziywa kwedatha zithatha okokufaka njenge-CSV nokukhipha i-JSON.

Isithombe nguMlobi wedatha ehlanganisiwe, engaziwa. Kwesobunxele: Ngikhethe i-N = 6 amaqoqo njengokuyekethisa phakathi kwesikolo esiphakeme kanye namaqoqo ambalwa. Kwesokudla: Amaqoqo wangempela asebenzisa uN = 6. Umbono ophakeme kakhulu (izikolo eziphakeme kakhulu) zingezihloko mayelana nombuzo. Umuzwa ophansi kakhulu (izikolo eziphansi) zingezihloko mayelana “nezinkinga zamakhasimende”.

Kusuka esizeni esingenhla, ungabona ekugcineni singena emzimbeni walokho abasebenzisi abakushoyo!

Prompt # 4: Thola izibalo ze-hierarychical cluster:

Njengoba unikezwe le fayela le-CSV yemiyalezo, sebenzisa ikholomu = 'Thread_Summary_User' ukwenza ukuhlanganiswa kwe-semantic yawo wonke amagiya angenayo (agglomerave) ngamazinga ama-2. Sebenzisa isikolo se-silhouette. Yiliphi inani elifanele lamaqoqo alandelayo we-Level0 kanye ne-Level1? Mingaki imicu ye-level1 cluster? Vele unginike izibalo manje, sizokwenza ukuhlanganiswa kwangempela ngokuhamba kwesikhathi.

Prompt # 5: Yenza i-hierartichical clustering:

Yamukela lokhu kuhlangana ngamazinga ama-2. Faka izihloko ze-cluster ezifingqa ikholomu yombhalo “intambo_topic”. Izihloko ze-Cluster kufanele zibe mfushane ngangokunokwenzeka ngaphandle kokulahlekelwa imininingwane eminingi ngencazelo ye-cluster.
– Izihloko ze-Level0 Cluster ~ 1-3 amagama.
– Izihloko ze-Level1 Cluster ~ 2-5 amagama.
Ukukhipha i-JSON ngekholomu: I-Thread_id, izinga0_cluster_id, izinga0_cluster_topic, izinga1_cluster_id, izinga le-Level1_Cluster_ID, Level1_Cluster_ID, Level1_Cluster_ID, Level1_Cluster

Ngiphinde ngayazisa ukukhiqiza ikhodi yokusakaza ukubona ngeso lengqondo amaqoqo (kusukela angilona ochwepheshe be-JS 😄). Imiphumela yedatha efanayo ye-Chroma iboniswa ngezansi.

Isithombe nguMlobi wedatha ehlanganisiwe, engaziwa. Isithombe ngakwesobunxele: I-Dot ngayinye ye-SpappLot iyintambo ene-hover-info. Isithombe esilungile: Ukuhlangana kwe-Hierarchical nge-Acoping yedatha eluhlaza yedatha. Amaphutha we-API kanye namaphakethe abukeka njengesihloko esiphuthumayo sikaChroma ukulungiselela, ngoba umbono uphansi futhi umthamo wemiyalezo ephezulu.

Ngikutholile lokhu kuqonda. Nge-CHRCOMA, ukuqunjelwa kwembula ukuthi ngenkathi abasebenzisi bejabule ngezihloko ezifana nombuzo, ibanga, kanye nokusebenza, bezingajabule ngezindawo ezinjengedatha, iklayenti, kanye nokuhanjiswa.

Ukuzama ukushumeka ngokwezifiso

Ngiphinde ngaphinda lokhu kushukumisa okuhlangenwe nakho okungenhla, ngisebenzisa ukushumeka kwamanani (“umsebenzisi_embedding”) ku-CSV esikhundleni semibhalo eluhlaza (“Umsebenzisi_Text”). Ngihlose ukunxusa ngokuningiliziwe kuma-blogs adlule kumamodeli adlule. kumabhodi wabaphambili. I-OpenAI inezinyuka ezithembekile ezingabizi kakhulu ngocingo lwe-API. Ngezansi kuyisibonelo ikhodi yesibonelo snippet ukuthi ungazakha kanjani ukushumeddings.

from openai import OpenAI


EMBEDDING_MODEL = "text-embedding-3-small"
EMBEDDING_DIM = 512 # 512 or 1536 possible


# Initialize client with API key
openai_client = OpenAI(
   api_key=os.environ.get("OPENAI_API_KEY"),
)


# Function to create embeddings
def get_embedding(text, embedding_model=EMBEDDING_MODEL,
                 embedding_dim=EMBEDDING_DIM):
   response = openai_client.embeddings.create(
       input=text,
       model=embedding_model,
       dimensions=embedding_dim
   )
   return response.data[0].embedding


# Function to call per pandas df row in .apply()
def generate_row_embeddings(row):
   return {
       'user_embedding': get_embedding(row['user_thread_summary']),
   }


# Generate embeddings using pandas apply
embeddings_data = df.apply(generate_row_embeddings, axis=1)
# Add embeddings back into df as separate columns
df['user_embedding'] = embeddings_data.apply(lambda x: x['user_embedding'])
display(df.head())


# Save as CSV ...
Isibonelo idatha yokuyeka. Ikholamu “Umsebenzisi_Umsebenzisi_Umbumbane ubude bawo = 512 yezinombolo zamaphoyinti ezintantayo.

Kuyathakazelisa ukuthi, bobabili i-Deleplekisonity Pro ne-Gemini 2.0 PRO ngezinye izikhathi ngezinye izikhathi izihloko ze-cluster ezihlangene (isib., DiscLassifess DiscLassifing umbuzo mayelana nemibuzo ehamba kancane ngokuthi “udaba lomuntu siqu”).

Isiphetho: Uma wenza i-NLP ngezikhuthazo, vumela i-LLM ikhiqize eyayo ukushumeka – ngaphandle kwangaphandle kubonakala kudida imodeli.

Isithombe nguMlobi wedatha ehlanganisiwe, engaziwa. Kokubili i-Deleplexitity Pro ne-Gemini 1.5 Pro Hlunga Izihloko ze-Cluster lapho unikezwa ikholomu yokufaka ngaphandle. Isiphetho – Uma wenza i-NLP ngezikhuthazo, vele ugcine umbhalo ovuthiwe bese uvumela i-LLM idale ukushumeka kwayo ngemuva kwezigcawu. Ukondla kuma-Ejeddings akhiqizwa ngaphandle kubonakala kudida i-LLM!

Ukuhlanganisa amaseva amaningi okuvuma

Ekugcineni, ngandisa ukuhlaziya ukufaka imilayezo yokungenisa okuvela kwabathengisi abathathu beVectordb. Ukubona okulandelayo kwaqokomisa izindaba ezijwayelekile – njengazo zombili milvus ne-Chroma ebheka izinkinga zokufakazela ubuqiniso.

Izithombe ezenziwa ngumlobi wedatha ehlanganisiwe, engaziwa Into eyodwa evelele zombili i-milvus kanti uChroma unenkinga yokufakazela ubuqiniso.

Ukubeka kafushane

Nasi isifinyezo sezinyathelo engizilandele ukuzenzela ukuqubuka kwe-semantic usebenzisa i-LLM Products:

  1. Khipha imicu yokunciphisa.
  2. Idatha yefomethi engxoxweni iphenduka ngezindima (“Umsebenzisi”, “umsizi”).
  3. Shone umuzwa bese ugcine njenge-CSV.
  4. Prompt Google Gemini 2.0 Flash flash izifingqo zentambo.
  5. Prompt Production PRO noma i-Gemini 2.0 Pro for Clustering kususelwa ku-Thrent Sumfication usebenzisa i-CSV efanayo.
  6. Prompt Promplexitity Pro noma i-Gemini 2.0 Pro ukubhala ikhodi yokusakaza ukubona ngeso lengqondo amaqoqo (ngoba angiyena ochwepheshe be-JS 😆).

Ngokulandela lezi zinyathelo, ungaguqula ngokushesha idatha yeforamu eluhlaza ibe ngokuqonda okusebenzayo – okusetshenziselwa ukuthatha izinsuku zokufaka amakhodi manje kungenziwa ngolunye usuku ntambama!

Ukunqubekela phambili

  1. I-Clio: Ubumfihlo – Ukulondolozwa kwemininingwane ku-Real-World AI ukusetshenziswa kwe-AI,
  2. I-Anthropic Blog mayelana Clio,
  3. I-Milvus Discord Server, Kutholwe okokugcina ngoFebhuwari 7, 2025
    Iseva ye-Chroma Discord Server, okugcine ukutholwa ngoFebhuwari 7, 2025
    Iseva ye-Qdrant Discord Server, kutholwe okokugcina ngoFebhuwari 7, 2025
  4. Amamodeli we-Gemini,
  5. I-BLOG nge-Gemini 2.0 Models,
  6. Scikit-Funda Isikolo Silhouette
  7. Vula Matyoshka Emposhdings
  8. Ukusekelwa

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button