Machine Learning

LyRec: Lyrics Song Recommendation 🎶 | by Sujan Dutta | January, 2025

Data set

Of course, the first thing I needed was a dataset of lyrics. Luckily, I found one on Kaggle! This dataset is licensed under a Creative Commons (CC0: Public Domain) license.

This dataset contains about 60K song lyrics and title and artist name. I know 60K it may not cover all your favorite songs, but I think it's a good start LyRec.

songs_df = pd.read_csv(f"{root_dir}/spotify_millsongdata.csv")
songs_df = songs_df.drop(columns=["link"])
songs_df["song_id"] = songs_df.index + 1

I didn't need to do any pre-processing on this data. I just removed the link column and added i ID for each song.

Models

I needed to choose two LLMs: One for embedding and one for producing song snippets. Choosing the right LLM for your career can be tricky because of the variety! It's a good idea to check the leaderboard to find the current best. For the embedding model, I checked the MTEB leaderboard hosted by HuggingFace.

I wanted a smaller model (obviously!) without compromising a lot of accuracy; therefore, I decided GTE-Qwen2-1.5B-Yala.

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer(
"Alibaba-NLP/gte-Qwen2-1.5B-instruct",
model_kwargs={"torch_dtype": torch.float16}
)

To sum it up, I just needed a little enough instruction to pursue an LLM, so I went with it Gemma-2–2b-It. In my experience, it is one of the best small models right now.

import torch
from transformers import pipeline

pipe = pipeline(
"text-generation",
model="google/gemma-2-2b-it",
model_kwargs={"torch_dtype": torch.bfloat16},
device="cuda",
)

Pre-computing the Embeddings

Computing the word embedding was straightforward. I just used the .write the code(…) method with a batch_size of 32 for immediate processing.

song_lyrics = songs_df["text"].values

lyrics_embeddings = model.encode(
song_lyrics,
batch_size=32,
show_progress_bar=True
)

np.save(f"{root_dir}/60k_song_lyrics_embeddings.npy", lyrics_embeddings)

At this point, I saved this embed in a .npy file. I could have used a more structured format, but it did the job for me.

Coming to the summary embedding, I need to first make the summaries. I had to make sure the summary captured the mood and theme of the song while not being too long. So, I came up with the following information for Gemma-2.

You are an expert song summarizer. 
You will be given the full lyrics to a song.
Your task is to write a concise, cohesive summary that
captures the central emotion, overarching theme, and
narrative arc of the song in 150 words.

{song lyrics}

Here is a code snippet to generate the snapshot. For simplicity, the following shows sequential processing. I have included the batch-processing version in the GitHub repo.

def get_summary(song_lyrics):
messages = [
{"role": "user",
"content": f'''You are an expert song summarizer.
You will be given the full lyrics to a song.
Your task is to write a concise, cohesive summary that
captures the central emotion, overarching theme, and
narrative arc of the song in 150 words.nn{song_lyrics}'''},
]

outputs = pipe(messages, max_new_tokens=256)
assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
return assistant_response

songs_df["summary"] = songs_df["text"].progress_apply(get_description)

Unsurprisingly, this step took a lot of time. Fortunately, this only needs to be done once, and that's when we want to update the database with new songs.

Then, I did the computer and saved the embed as the last one.

song_summary = songs_df["summary"].values

summary_embeddings = model.encode(
song_summary,
batch_size=32,
show_progress_bar=True
)

np.save(f"{root_dir}/60k_song_summary_embeddings.npy", summary_embeddings)

Vector Search

With the embeddings in place, it was time to implement semantic search based on embedding similarity. There are many wonderful open source vector templates available for this task. I decided to use a simple one called FAISS (Facebook AI Match Search). It only takes two lines to add an embed to the database. First, we create the FAISS index. Here, we need to specify the similarity metric that you want to use for the search and the dimension of the vectors. I used the dot product (inner product) as a measure of similarity. Then, we insert the embedding into the index.

Note: Our database is small enough to perform a comprehensive search using the dot product. For larger information, it is recommended to perform nearest neighbor (ANN) search. FAISS has that support.

import faiss

lyrics_embeddings = np.load(f"{root_dir}/60k_song_lyrics_embeddings.npy")
lyrics_index = faiss.IndexFlatIP(lyrics_embeddings.shape[1])
lyrics_index.add(lyrics_embeddings.astype(np.float32))

summary_embeddings = np.load(f"{root_dir}/60k_song_summary_embeddings.npy")
summary_index = faiss.IndexFlatIP(summary_embeddings.shape[1])
summary_index.add(summary_embeddings.astype(np.float32))

To find the most similar songs given a query, we first need to do the embedding of the query and then make the call .search(…) way in index. Under the hood, this method calculates the matches between the query and all the entries in our database and returns the top. k entries and corresponding scores. Here is the code that performs a semantic search on word embeddings.

query_lyrics = 'Imagine the last song you fell in love with'
query_embedding = model.encode(f'''Instruct: Given the lyrics,
retrieve relevant songsnQuery: {query_lyrics}''')
query_embedding = query_embedding.reshape(1, -1).astype(np.float32)
lyrics_scores, lyrics_ids = lyrics_index.search(query_embedding, 10)

Note that I added a simple prompt to the question. This is recommended for this model. The same applies to embedded embeddings.

query_description = 'Describe the type of song you wanna listen to'
query_embedding = model.encode(f'''Given a description,
retrieve relevant songsnQuery: {query_description}''')
query_embedding = query_embedding.reshape(1, -1).astype(np.float32)
summary_scores, summary_ids = summary_index.search(query_embedding, k)

Pro tip: How do you perform a psychological test?
Just enter any entry from the database into the query and see if the search returns the same as the top scoring entry!

Using Features

At this stage, I had the building blocks of LyRec. Now, it was time to put these together. Remember the three goals I laid out earlier? Here's how I did that.

To keep things tidy, I created a named class LyRec that would have a method for each element. The first two features are very easy to use.

The method.find_songs_with_similar_names(…) takes a string (alphabet) and an integer k as input and returns a list of k very similar songs based on the similarity of the characters. Each element in the list is a dictionary containing the artist's name, song title, and lyrics.

Likewise, .find_songs_with_similar_description(…) takes a free-form text and an integer k as input and returns a list of k many similar songs based on the description.

Here is the relevant code snippet.

class LyRec:
def __init__(self, songs_df, lyrics_index, summary_index, embedding_model):
self.songs_df = songs_df
self.lyrics_index = lyrics_index
self.summary_index = summary_index
self.embedding_model = embedding_model

def get_records_from_id(self, song_ids):
songs = []
for _id in song_ids:
songs.extend(self.songs_df[self.songs_df["song_id"]==_id+1].to_dict(orient='records'))
return songs

def get_songs_with_similar_lyrics(self, query_lyrics, k=10):
query_embedding = self.embedding_model.encode(
f"Instruct: Given the lyrics, retrieve relevant songsn Query: {query_lyrics}"
).reshape(1, -1).astype(np.float32)

scores, song_ids = self.lyrics_index.search(query_embedding, k)
return self.get_records_from_id(song_ids[0])

def get_songs_with_similar_description(self, query_description, k=10):
query_embedding = self.embedding_model.encode(
f"Instruct: Given a description, retrieve relevant songsn Query: {query_description}"
).reshape(1, -1).astype(np.float32)

scores, song_ids = self.summary_index.search(query_embedding, k)
return self.get_records_from_id(song_ids[0])

The last feature was a little trickier to use. Remember that we need to first retrieve the top songs based on the lyrics and then reorder them based on the text description. The first retrieval was easy. For the second one, we only need to consider songs with high scores. I decided to create a temporary FAISS index of the top songs and search for songs with the highest summary scores for the same. Here is my implementation.

def get_songs_with_similar_lyrics_and_description(self, query_lyrics, query_description, k=10):
query_lyrics_embedding = self.embedding_model.encode(
f"Instruct: Given the lyrics, retrieve relevant songsn Query: {query_lyrics}"
).reshape(1, -1).astype(np.float32)

scores, song_ids = self.lyrics_index.search(query_lyrics_embedding, 500)
top_k_indices = song_ids[0]

summary_candidates = []
for idx in top_k_indices:
emb = self.summary_index.reconstruct(int(idx))
summary_candidates.append(emb)
summary_candidates = np.array(summary_candidates, dtype=np.float32)

temp_index = faiss.IndexFlatIP(summary_candidates.shape[1])
temp_index.add(summary_candidates)

query_description_embedding = self.embedding_model.encode(
f"Instruct: Given a description, retrieve relevant songsn Query: {query_description}"
).reshape(1, -1).astype(np.float32)

scores, temp_ids = temp_index.search(query_description_embedding, k)
final_song_ids = [top_k_indices[i] for i in temp_ids[0]]

return self.get_records_from_id(final_song_ids)

Viola! Finally, LyRec it is ready. You can find the complete implementation in this repo. Please leave a star if you found this helpful! 😃

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button