Machine Learning

Ukulingiswa kwesihloko esithuthukile nama-llms

Le ndatshana ingukuqhubeka kwezihloko Modeling Open-Source Intelligence (OSInt) kusuka ku-Opelalex API. Esihlokweni esedlule, nginikeza isingeniso esihlokweni semodeli, idatha esetshenzisiwe, nendlela yendabuko ye-NLP esebenzisa i-latent Dirichlet Allocation (LDA).

Bona i-athikili edlule lapha:

Lo mbhalo usebenzisa indlela esezingeni eliphakeme kakhudlwana modeling ngamamodeli wokumelwa, ukukhiqizwa kwe-AI, namanye amasu athuthukile. Sifaka amandla e-bertopic ukuletha amamodeli ambalwa ndawonye epayipini elilodwa, ngeso lengqondo izihloko zethu, futhi zihlole ukuhlukahluka kwamamodeli wesihloko.

Isithombe nguMlobi

Ipayipi le-bertopic

Usebenzisa indlela yendabuko yokuvumelana ngesihloko kungaba nzima, ukudinga ukwakha eyakho ipayipi ukuze uhlanze idatha yakho, i-tokonize, ukudala izici, amamodeli wendabuko abizwa ngokuthi ama-LDA noma ama-LSA abiza kakhulu.

I-bertopic ibeka izakhiwo ze-transformer ngokusebenzisa amamodeli okushumeka, futhi ifaka ezinye izingxenye ezinjengokuncishiswa kobukhulu kanye namamodeli wokumelwa ngesihloko, ukudala amamodeli wezihloko angenza kahle. I-bertopic ibuye ihlinzeke ngokuhlukahluka kwamamodeli ukuze ivumelane nemininingwane ehlukahlukene futhi usebenzise amacala, ukubonwa okubonakalayo ukuhlola imiphumela, nokuningi.

Isithombe nguMlobi

Inzuzo enkulu ye-bertopic iyinkulumo yayo. Kubonwa ngenhla, ipayipi lakhiwa amamodeli ahlukahlukene ahlukahlukene:

  1. Ukushumeka imodeli
  2. Imodeli yokunciphisa ubukhulu
  3. Imodeli yokuhlanganisa
  4. Tokenizer
  5. Isisindo sohlelo
  6. Imodeli Yesethulo (Ongakukhetha)

Ngakho-ke, singazama amamodeli ahlukene kwingxenye ngayinye, ngalinye linamapharamitha alo. Isibonelo, singazama amamodeli ahlukile ukushumeka, shintsha ukuncishiswa kobukhulu kusuka ku-PCA kuya kwi-UMap, noma uzame kahle amapharamitha emodeli yethu yokuhlanganisa. Le yinzuzo enkulu esivumela ukuthi silingane nemodeli yesihloko kwidatha yethu kanye necala lokusebenzisa.


Okokuqala, sidinga ukungenisa kumamojula adingekayo. Iningi lalokhu ukwakha izakhi zemodeli yethu ye-bertopic.

#import packages for data management
import pickle

#import packages for topic modeling
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from bertopic.vectorizers import ClassTfidfTransformer
from sentence_transformers import SentenceTransformer
from umap.umap_ import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer

#import packages for data manipulation and visualization
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy as sch

Ukushumeka imodeli

Ingxenye esemqoka ye-BERTOPIC Model yimodeli yokushumeka. Okokuqala, siqala imodeli sisebenzisa i-transformer yomusho. Ungacacisa imodeli yokushumeka ongathanda ukuyisebenzisa.

Kulokhu, ngisebenzisa imodeli encane (~ 30 million parameter). Ngenkathi singathola imiphumela engcono esebenzisa amamodeli akhuthazayo ashumeka, nginqume ukusebenzisa imodeli encane ukugcizelela ijubane kule pipeline. Ungathola futhi uqhathanise amamodeli wokushumeka ngokusekelwe kusayizi wazo, ukusebenza, ukusetshenziswa okuhlosiwe, njll ngokusebenzisa i-MTEB yabaphambili kusuka ekugobeni kobuso (I-https: //huggapp.comface.co/spaces/mteb/leaderboard).

#initalize embedding model
embedding_model = SentenceTransformer('thenlper/gte-small')

#calculate embeddings
embeddings = embedding_model.encode(data['all_text'].tolist(), show_progress_bar=True)

Uma sesigijime imodeli yethu, singasebenzisa umsebenzi we .Shape ukubona usayizi we-veector ekhiqizwayo. Ngezansi, siyabona ukuthi ukushumeka ngakunye kuqukethe amanani angama-384 enza incazelo yedokhumenti ngalinye.

#invesigate shape and size of vectors
embeddings.shape

#output: (6102, 384)

Imodeli yokunciphisa ubukhulu

Ingxenye elandelayo ye-Bertopic Model yimodeli yokunciphisa ubukhulu. Njengoba idatha ephezulu enobukhulu ingaba nzima ukumodeli, singasebenzisa imodeli yokunciphisa ubukhulu ukumela ukushumeka esimenyezelweni esisezingeni eliphansi ngaphandle kokulahlekelwa imininingwane eminingi.

Isithombe nguMlobi

Kunezinhlobo eziningana ezahlukahlukene zokuncishiswa kwezimo zokuncipha, ngokuhlaziywa okuyinhloko kwengxenye (i-PCA) kokuba ethandwa kakhulu. Kulokhu, sizosebenzisa imodeli yokulinganisa nemodeli ye-Uniform. Imodeli ye-UMap yimodeli engeyona eqondile futhi kungenzeka ukuthi ibhekele kangcono ubudlelwano obunzima kwidatha yethu kangcono kune-PCA.

#initialize dimensionality reduction model and reduce embeddings
umap_model = UMAP(n_neighbors=5, min_dist=0.0, metric='cosine', random_state=42)
reduced_embeddings = umap_model.fit_transform(embeddings)

Kubalulekile ukuqaphela ukuthi ukuncishiswa kobukhulu akuyona i-Sole-Yonke imininingwane enobukhulu obukhulu. Ukuncishiswa kobukhulu kuveza ukuhweba phakathi kwejubane nokunemba njengoba imininingwane ilahlekile. Lawa mamodeli adinga ukucatshangwe kahle ngaphandle futhi ahlolwe ukugwema ukulahlekelwa imininingwane eminingi ngenkathi ugcina isivinini kanye nokukhubazeka.

Imodeli yokuhlanganisa

Isinyathelo sesithathu ukusebenzisa ukunxusa okuncishisiwe futhi wakhe amaqoqo. Ngenkathi ukuhlangana kungadingekile kudingekile ukuthola imodeli yezihloko, Singabheka amamodeli asuselwa ekunciphiseni ama-density ukuhlukanisa abathengisi bese uqeda umsindo kwidatha yethu. Ngezansi, siqala ukufakwa kwezindawo okususelwa ku-Hierarchical Density of Izinhlelo zokusebenza ngomsindo (imodeli ye-HDBSCAN) bese udala amaqoqo ethu.

#initialize clustering model and cluster
hdbscan_model = HDBSCAN(min_cluster_size=30, metric='euclidean', cluster_selection_method='eom').fit(reduced_embeddings)
clusters = hdbscan_model.labels_

Indlela esekwe desity isinika izinzuzo ezimbalwa. Amadokhumenti awaphoqelelwa ukuba abe ngamaqoqo abangafanele abelwe kuwo, ngakho-ke ahlukanise ngaphandle kanye nokunciphisa umsindo kwidatha yethu. Futhi, njengoba kuqhathaniswa namamodeli asuselwa ku-centeroid, asicacisi inani lamaqoqo, namaqoqo amathuba okuthi achazwe kahle.

Bona umhlahlandlela wami kuma-algorithms ahlanganayo:

Bona ikhodi engezansi ukuze ibone ngeso lengqondo imiphumela yemodeli yokuhlanganisa.

#create dataframe of reduced embeddings and clusters
df = pd.DataFrame(reduced_embeddings, columns = ['x', 'y'])
df['Cluster'] = [str(c) for c in clusters]

#split between clusters and outliers
to_plot = df.loc[df.Cluster != '-1', :]
outliers = df.loc[df.Cluster == '-1', :]

#plot clusters
plt.scatter(outliers.x, outliers.y, alpha = 0.05, s = 2, c = 'grey')
plt.scatter(to_plot.x, to_plot.y, alpha = 0.6, s = 2, c = to_plot.Cluster.astype(int), cmap = 'tab20b')
plt.axis('off')
Isithombe nguMlobi

Singabona amaqoqo achazwe kahle angadluli. Futhi singabona iqembu elincane labancane ndawonye ukwenza izihloko ezisezingeni eliphakeme. Okokugcina, singabona amadokhumenti amaningana akhishwe futhi akhonjwe njengabathengisi.


Ukwakha ipayipi le-bertopic

Manje sinezinto ezidingekayo ukwakha ipayipi lethu le-bertopic (ukushumeka imodeli, imodeli yokunciphisa isici, imodeli yokuhlanganisa). Singasebenzisa amamodeli esiwaqalise futhi awalinganise kwidatha yethu esebenzisa umsebenzi we-bertopic.

#use models above to BERTopic pipeline
topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
  verbose = True).fit(data['all_text'].tolist(), embeddings)

Njengoba ngazi ukuthi ngithathile amaphepha ngenhlanziswa yomshini wabantu (okungokoqobo okungathandwa kwabathelisi esikubona, okungokoqobo okungokoqobo), ake sibone ukuthi yisiphi isihloko sihambisana negama elithi “ukungathandwa kwabathelisi esikubona ngokoqobo”.

#topics most similar to 'augmented reality'
topic_model.find_topics("augmented reality")

#output: ([18, 3, 16, 24, 12], [0.9532771, 0.9498462, 0.94966936, 0.9451431, 0.9417263])

Ukusuka kokuphuma ngenhla, siyabona ukuthi izihloko eziyi-18, 3, 16, 24, futhi eziyi-12 ziqondanisa kakhulu igama elithi “ukungathandwa kwabathelisi esikubona ngokoqobo”. Zonke lezi zihloko kufanele (ngethemba) zifaka isandla kwitimu elibanzi leqiniso elingokoqobo, kepha isembozo ngasinye.

Ukuze uqinisekise lokhu, ake siphenye izethulo zesihloko. Ukumelwa kwesihloko kuhlu lwemigomo ehlose ukumela kahle ingqikithi yesihloko. Isibonelo, amagama athi “ikhekhe”, “amakhandlela”, “umndeni”, futhi “izipho” angahle amele isihloko sezinsuku zokuzalwa noma amaqembu okuzalwa.

Singasebenzisa umsebenzi we-Get_topic () ukuphenya izethulo zesihloko 18.

#investigate topic 18
topic_model.get_topic(18)
Isithombe nguMlobi

Kulesi simele esingenhla, sibona amagama athile awusizo njenge- “Reality”, “Virtual”, “Okubonakalayo”, “axtegated”, njll. Lokhu kungenxa yokuthi i-bertopic isebenzisa isikhwama samagama njengendlela ezenzakalelayo yokumela izihloko. Lo mmeleli futhi ungafanisa ezinye izethulo mayelana namaqiniso augmented.

Okulandelayo, sizothuthukisa ipayipi lethu le-bertopic ukudala izethulo ezivelele ze-POEST ezisinika ukuqonda okwengeziwe kulezi zingqikithi.


Ukwenza ngcono izethulo zesihloko

Singathuthukisa izihloko izethulo ngokungeza uhlelo lwesisindo, ezizogqamisa amagama abaluleke kakhulu futhi sihlukanise ngezihloko zethu.

Lokhu akubuthanga isikhwama semodeli yamagama, kepha kuthuthuka phezu kwaso. Ngezansi sifaka imodeli ye-TF-IDF ukuze unqume kangcono ukubaluleka kwethemu ngalinye. Sisebenzisa umsebenzi wokuvuselela_topics () ukuvuselela ipayipi lethu.

#initialize tokenizer model
vectorizer_model = CountVectorizer(stop_words="english")

#initialize ctfidf model to weight terms
ctfidf_model = ClassTfidfTransformer()

#add tokenizer and ctfidf to pipeline
topic_model.update_topics(data['all_text'].tolist(), vectorizer_model=vectorizer_model, ctfidf_model=ctfidf_model)
#investigate how topic representations have changed
topic_model.get_topic(18)
Isithombe nguMlobi

Nge-TF-IDF, lezi zembulelo zezihloko ziwusizo kakhulu. Singabona ukuthi amagama okuma angenamqondo aphelile, amanye amagama avela kulokho usizo luchaza isihloko, futhi amagama ahlelwe kabusha ngokubaluleka kwawo.

Kepha akudingeki siyeke lapha. Ngenxa yentuthuko emisha engenakubalwa emhlabeni we-AI ne-NLP, kunezindlela esingazitholela ukulungisa kahle lezi zifanekiselo.

Ukuze sikwazi kahle, singathatha enye yezindlela ezimbili:

  1. Imodeli yokumelwa
  2. Imodeli yokukhulisa

Ukuhleleka okuhle ngemodeli yokumelwa

Okokuqala, ake sengeze imodeli esezingeni eliphakeme njengemodeli yethu yokumelwa. Lokhu kutholwa uBert ukuqhathanisa ukufana kwe-semantic yezethulo ze-tf-idf namadokhumenti ngokwako ukuze kunqunywe kangcono ukuhambisana kwethemu ngalinye, kunokubaluleka.

Bona zonke izinketho zemodeli yezethulo lapha:

#initilzae representation model and add to pipeline
representation_model = KeyBERTInspired()
topic_model.update_topics(data['all_text'].tolist(), vectorizer_model=vectorizer_model, ctfidf_model=ctfidf_model, representation_model=representation_model)
Isithombe nguMlobi

Lapha, sibona ushintsho olukhulu kahle ngokwemigomo, ngemigomo eyengeziwe nama-akhronimi. Ukuqhathanisa lokhu nokumelwa kwe-TF-IDF, siphinde sithole ukuqonda okungcono kwalokho okushiwo yilesi sihloko. Futhi qaphela ukuthi izikolo ziguqulwe kusuka ezinsimbi ze-TF-IDF, ezingenazo ezinye izincazelo ngaphandle komongo, izikolo eziphakathi kuka-0-1. Lezi zikolo ezintsha zimelela izikolo ezifana.

Ukubonwa ngemodeli yesihloko

Ngaphambi kokuthi sithuthele kumamodeli akhiqizayo ukuze kulungiswe kahle, ake sihlole okunye kokubonwa okunikezwa yi-bertopic. Ukubona ngeso lengqondo amamodeli wezihloko abalulekile ekuqondeni idatha yakho nokuthi imodeli isebenza kanjani.

Okokuqala, singabona ngeso lengqondo izihloko zethu esikhaleni esinezinhlangothi ezi-2, zisivumele ukubona usayizi wezihloko nokuthi yiziphi ezinye izihloko ezifanayo. Ngezansi, singabona sinezihloko eziningi, ngamaqoqo ezihloko ezakha izingqikithi ezinkulu. Futhi singabona isihloko esikhulu futhi esihlukanisiwe, esibonisa ukuthi kunocwaningo oluningi olufanayo mayelana neCrispr.

Isithombe nguMlobi

Masisondeze kulezi zihloko zezihloko ukubona ukuthi zibhidliza kanjani izingqikithi eziphezulu. Ngezansi, sisondeza ngezihloko maqondana neqiniso elingokoqobo nangokoqobo bese ubona ukuthi ezinye izihloko zimboza kanjani izizinda ezahlukahlukene kanye nezicelo.

Isithombe nguMlobi
Isithombe nguMlobi

Futhi singabona ngeso lengqondo amagama abaluleke kakhulu noma afanele kakhulu esihlokweni ngasinye. Futhi, lokhu kuncike ekusondelweni kwakho ezihlokweni zezethulo.

Isithombe nguMlobi

Singasebenzisa futhi i-heatmap ukuze ihlole ukufana phakathi kwezihloko.

Isithombe nguMlobi

Lokhu kumane nje kuboniswa okubonakalayo okunikezwa yi-bertopic. Bona uhlu olugcwele lapha:

Amamodeli akhiqizwayo

Ngokwesinyathelo sethu sokugcina sokuhlelela kahle izihloko zethu, singakwazi ukubekezelela i-AI ekhiqizayo ukukhiqiza izethulo eziyizincazelo ezibumbene zesihloko.

I-bertopic inikeza indlela elula yokuthola amamodeli we-GPT e-Opelai e-Opelai ukuze axhumane nemodeli yesihloko. Siqala ukusungula ngokushesha okukhombisa imodeli imininingwane kanye nokumelwa kwamanje kwezihloko. Sibe sesiyicela ukuthi ikhiqize ilebula elifushane ngesihloko ngasinye.

Sibe sesiqala iklayenti nemodeli, futhi sivuselele ipayipi lethu.

import openai
from bertopic.representation import OpenAI

#promt for GPT to create topic labels
prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following key words: [KEYWORDS]

Based on the information above, extract a short topic label in the following format:
topic: 
"""

#import GPT
client = openai.OpenAI(api_key='API KEY')

#add GPT as representation model
representation_model = OpenAI(client, model = 'gpt-3.5-turbo', exponential_backoff=True, chat=True, prompt=prompt)
topic_model.update_topics(data['all_text'].tolist(), representation_model=representation_model)

Manje, ake sibuyele esihlokweni esingokoqobo esingathandeki.

#investigate how topic representations have changed
topic_model.get_topic(18)

#output: [('Comparative analysis of virtual and augmented reality for immersive analytics',1)]

Ukumelwa kwezihloko manje kufundwa “Ukuhlaziywa kokuqhathanisa okungokoqobo okungokoqobo okungokoqobo kwama-analytics aphakathi '. Isihloko manje sicace kakhulu, njengoba sibona izinhloso, ubuchwepheshe kanye nesizinda esifakiwe kule mibhalo.

Ngezansi uhlu oluphelele lwezinto esizwa ngenye yezihloko ezintsha.

Isithombe nguMlobi

Akuthathi ikhodi eningi ukubona ukuthi i-AI enamandla kangakanani isekela imodeli yethu yesihloko kanye nezethulo zayo. Kubaluleke kakhulu ukumba ngokujulile futhi uqinisekise le miphumela njengoba wakhe imodeli yakho nokwenza inqwaba yokuhlolwa ngamamodeli ahlukile, amapharamitha, kanye nezindlela.

Isihloko se-LeveRaling Photomels

Okokugcina, iBertopic ihlinzeka ngokuhlukahluka okuningana kwamamodeli wesihloko ukuhlinzeka ngezisombululo zemininingwane ehlukile futhi usebenzise amacala. Lokhu kufaka phakathi uchungechunge lwesikhathi, olusezingeni eliphezulu, olugadiwe, olugadiwe, nangokwengeziwe okuningi.

Bona uhlu olugcwele nemibhalo lapha:

Masibheke ngokushesha elinye lalawa mathuba nge-Hierarchical isihloko Modeling. Ngezansi, sakha umsebenzi wokuxhumana sisebenzisa iScipy, esungule amabanga phakathi kwezihloko zethu. Singayivumela kalula kwidatha yethu futhi sibone ngeso lengqondo isikhundla se-Hierarchy yezihloko.

#create linkages between topics
linkage_function = lambda x: sch.linkage(x, 'single', optimal_ordering=True)
hierarchical_topics = topic_model.hierarchical_topics(data['all_text'], linkage_function=linkage_function)

#visualize topic model hierarchy
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
Isithombe nguMlobi

E-Visualization ngenhla, singabona ukuthi izihloko zizibeka kanjani ndawonye ukudala izihloko ezibanzi nezibanzi. Isibonelo, sibona izihloko ezingama-25 no-30 zihlangana ndawonye ukwakha “amadolobha ahlakaniphile nentuthuko esimeme”. Le modeli ihlinzeka ngamandla amangalisayo okwazi ukusondeza nokuphuma nokunquma ukuthi sibanzi kangakanani noma sincane kakhulu ukuthi singathanda ukuthi izihloko zethu zibe yizo.


Ukugcina

Kulesi sihloko, kufanele sibone amandla e-bertopic ngesihloko se-Modeling. Ukusetshenziswa kwe-bertopics kwabaguquli kanye namamodeli wokushumeka kuthuthuka ngokumangazayo imiphumela kusuka ezindleleni zendabuko. Ipayipi le-bertopic futhi linikeza amandla namandla okusebenza, lifaka amamodeli amaningana futhi likuvumela ukuthi uxhumaniswe namanye amamodeli ukuze uvumelane nemininingwane yakho. Zonke lezi zinhlobo zalezi zinhlobo zingahle zihlelwe kahle futhi zihlanganise ndawonye ukudala imodeli yesihloko esinamandla.

Ungahlanganisa futhi ukumelwa kanye namamodeli akhiqizayo ukwenza ngcono izethulo ezihloko futhi uthuthukise ukutolizwa. I-bertopic futhi inikeza ukubona okubonakalayo okuningi ukuze ihlole idatha yakho futhi iqinisekise imodeli yakho. Okokugcina, i-bertopic inikeza ukuhlukahluka okuningana kwezihloko zemodeli, njengochungechunge lwesikhathi noma nge-Hierarchical isihloko, ukuze uvumelane kangcono necala lakho lokusebenzisa.


Ngiyethemba ukuthi ujabulele i-athikili yami! Sicela uzizwe ukhululekile ukuphawula, ubuze imibuzo, noma ucele ezinye izihloko.

Xhuma nami ku-LinkedIn:

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button