I have built an AI Pipeline for highlighting Kindle

I read, I like to highlight things (I use Kindle). I feel that by reading I can't retain more than 10% of the information I use but it is by re-reading the highlights or summarizing the book using them that I understand what I read.
The problem is, sometimes, I end up highlighting too much.
And by a lot I mean A LOT. We can't even call them “important notes.”
So in those cases, after reading the book, I either end up spending a lot of time summarizing or I just stop doing it (the latter is more common).
I just read a book that I really enjoyed and I would like to keep in full what struck me. But, again, it was one of those books that I really enjoyed.
And I didn't want to spend too much of my free time on it. So I decided to automate the process and use my tech/data skills. Because I'm happy with the result, I thought I'd share it so that anyone interested can take advantage of this tool.
Disclaimer: My Kindle is quite old so this should work for new ones too. In fact, there is a better way for newer versions of Kindle (also explained in this post).
The project
Let's define the goal: generate a snapshot from our best Kindle images.
When I thought about it, I thought of the following simple pipeline for one book:
- Get the highlights of the book
- Create a RAG or something similar
- Extract the summary
The result is different from the first part, but it is all due to the pre-processing that is required considering the way the data is structured.
So I'm going to organize this post into two main sections:
- Data retrieval and processing
- AI model and output
1. Data Retrieval and Processing
My intuition told me there was a way to extract highlights from my Kindle. Finally, they are stored there, so I need a way to get them out.
There are several ways to do it but I wanted a way that works with both books bought from the official Kindle store but also PDFs or files I've sent from my laptop.
And I decided not to use any existing software to extract the data. My ebook and my laptop (and the USB that connects the two).
Luckily for us, no jailbreak is required and there are two ways to do that depending on your Kindle version:
- All versions (maybe) have a file in the named documents folder My clips.txt. It contains literally any cut you've made at any time with any book.
- New Kindles also have an SQLite file in the system directory named after them annotations.db. This has its highlights in a very structured way.
For this post I will use method 1 (My Clippings.txt) mainly because my Kindle does not have an annotations.db database. But if you are lucky enough to have a DB, use it as it will be clear and of high quality (most of the processing that we will see next will probably not be necessary).
So finding clips is as easy as reading a TXT. Here are some key features and problems I encountered using this method:
- All books are in the same file.
- I'm not sure about the exact definition of “recording” on Amazon's side but the way I've seen it is: whatever you highlight at any given time. Even if you delete or expand it, the original will remain in the TXT. I think this is because, of course, we are working with a TXT file and it is very difficult to remove unreferenced objects in any way.
- There is a cut limit: I don't know the exact limit but once we cross it, we won't be able to retrieve any more clips. This is done because someone might highlight the full book, download it and share it illegally.
And here is the anatomy of the clip:
==========
Book Name (Author Name)
- Your Highlight on page 145 | Location 2212-2212 | Added on Sunday, August 30, 2020 11:25:29 PM
transparency problem ends up in the same place as
==========
So the first step is to separate the highlights, and this is where we first see the Python code:
def parse_clippings(file_path):
raw = Path(file_path).read_text(encoding="utf-8")
entries = raw.split("==========")
highlights = []
for entry in entries:
lines = [l.strip() for l in entry.strip().split("n") if l.strip()]
if len(lines) < 3:
continue
book = lines[0]
if "Highlight" not in lines[1]:
continue
location_match = re.search(r"Location (d+)", lines[1])
if not location_match:
continue
location = int(location_match.group(1))
text = " ".join(lines[2:]).strip()
highlights.append(
{
"book": book,
"location": location,
"text": text
}
)
return highlights
Given an attachment file path, all this does is split the text into separate entries and step through them. For each entry, it outputs the subject name, location and highlighted text.
This last structure (dictionary list) makes it easy to sort by letter:
[
h for h in highlights
if book_name.lower() in h["book"].lower()
]
Once sorted, we must order the highlights. Since the attachment is in a TXT file, the order is based on the highlight, not the text area.
And I personally want my results to appear as they do in the book, so the order is necessary:
sorted(highlights, key=lambda x: x["location"])
Now, if you check your clip file, you may find duplicate clips (or duplicate clips). This happens because whenever you edit a highlight (that you failed to include all the words you intended, for example), it is counted as a new one. So there will be two identical clips in the TXT. Or even more if you edit it multiple times.
We need to handle this by using deduplication somehow. It's easier than expected:
def deduplicate(highlights):
clean = []
for h in highlights:
text = h["text"]
duplicate = False
for c in clean:
if text == c["text"]:
duplicate = True
break
if text in c["text"]:
duplicate = True
break
if c["text"] in text:
c["text"] = text
duplicate = True
break
if not duplicate:
clean.append(h)
return clean
It's very simple and can be perfected, but basically we check if there are any consecutive attachments with the same text (or part of it) and save the longest one.
We currently have the book's best images sorted, and we can stop pre-processing here. But I can't do that. I like to highlight topics all the time because, when summarizing, I can properly assign a paragraph to each highlight.
But our code can't differentiate between the actual highlight and the section title⦠For now. See below:
def is_probable_title(text):
text = text.strip()
if len(text) > 120:
return False
if text.endswith("."):
return False
words = text.split()
if len(words) > 12:
return False
# chapter style prefix
if has_chapter_prefix(text):
return True
# capitalization ratio
capitalized = sum(
1 for w in words if w[0].isupper()
)
cap_ratio = capitalized / len(words)
# stopword ratio
stopword_count = sum(
1 for w in words if w.lower() in STOPWORDS
)
stop_ratio = stopword_count / len(words)
score = 0
if cap_ratio > 0.6:
score += 1
if stop_ratio < 0.3:
score += 1
if len(words) <= 6:
score += 1
return score >= 2
It may seem silly, and it's not the best solution to this problem, but it works very well. It uses a heuristic based on capitalization, length, stops and initials.
This function is called inside the loop for every highlight, as we have seen in the previous functions, to check whether the highlight is a subject or not. The result is a list of dictionary “categories” where the dictionary has two keys:
- Title: the title of the section.
- Highlight: section highlights.
Right now, yes, we are ready to summarize.
AI Model and Output
I wanted this to be a free project, so we need an open source AI model.
I thought it was Ollama [1] it was one of the best options for running a project like this (at least locally). Also, our data always belongs to them and we can use the models offline.
Once installed, the code was simple. I'm not a fast developer so anyone with experience can get better results, but this is what works for me:
def summarize_with_ollama(text, model):
prompt = f"""
You are summarizing a book from reader highlights.
Produce a structured summary with:
- Main thesis
- Brief summary
- Key ideas
- Important concepts
- Practical takeaways
Highlights:
{text}
"""
result = subprocess.run(
["ollama", "run", model],
input=prompt,
text=True,
capture_output=True
)
return result.stdout
Simple, I know. But it works partly because the data processing was intensive but also because we already use the models built there.
But what do we do with the summary? I like to use Obsidian [2] so posting a Markdown file is what makes more sense. Here you have it:
def export_markdown(book, sections, summary, output):
md = f"# {book}nn"
for section in sections:
md += f"## {section['title']}nn"
for h in section["highlights"]:
md += f"- {h}n"
md += "n"
md += "n---nn"
md += "## Book Summarynn"
md += summary
output_path = Path(output)
output_path.parent.mkdir(parents=True, exist_ok=True)
output_path.write_text(md, encoding="utf-8")
print(f"nSaved to {output_path}")
Et voilĆ .
And this is how I go from highlights to a full Markdown summary (directly to Obsidian if I want) in less than 300 lines of Python code!
Full Code and Testing
Here is the full code, if you just want to copy and paste it. It contains observations and other helper functions and analysis of the argument:
import re
import argparse
from pathlib import Path
import subprocess
# ---------- PARSE CLIPPINGS ----------
def parse_clippings(file_path):
raw = Path(file_path).read_text(encoding="utf-8")
entries = raw.split("==========")
highlights = []
for entry in entries:
lines = [l.strip() for l in entry.strip().split("n") if l.strip()]
if len(lines) < 3:
continue
book = lines[0]
if "Highlight" not in lines[1]:
continue
location_match = re.search(r"Location (d+)", lines[1])
if not location_match:
continue
location = int(location_match.group(1))
text = " ".join(lines[2:]).strip()
highlights.append(
{
"book": book,
"location": location,
"text": text
}
)
return highlights
# ---------- FILTER BOOK ----------
def filter_book(highlights, book_name):
return [
h for h in highlights
if book_name.lower() in h["book"].lower()
]
# ---------- SORT ----------
def sort_by_location(highlights):
return sorted(highlights, key=lambda x: x["location"])
# ---------- DEDUPLICATE ----------
def deduplicate(highlights):
clean = []
for h in highlights:
text = h["text"]
duplicate = False
for c in clean:
if text == c["text"]:
duplicate = True
break
if text in c["text"]:
duplicate = True
break
if c["text"] in text:
c["text"] = text
duplicate = True
break
if not duplicate:
clean.append(h)
return clean
# ---------- TITLE DETECTION ----------
STOPWORDS = {
"the","and","or","but","of","in","on","at","for","to",
"is","are","was","were","be","been","being",
"that","this","with","as","by","from"
}
def has_chapter_prefix(text):
return bool(
re.match(
r"^(chapter|part|section)s+d+|^d+[.)]|^[ivxlcdm]+.",
text.lower()
)
)
def is_probable_title(text):
text = text.strip()
if len(text) > 120:
return False
if text.endswith("."):
return False
words = text.split()
if len(words) > 12:
return False
# chapter style prefix
if has_chapter_prefix(text):
return True
# capitalization ratio
capitalized = sum(
1 for w in words if w[0].isupper()
)
cap_ratio = capitalized / len(words)
# stopword ratio
stopword_count = sum(
1 for w in words if w.lower() in STOPWORDS
)
stop_ratio = stopword_count / len(words)
score = 0
if cap_ratio > 0.6:
score += 1
if stop_ratio < 0.3:
score += 1
if len(words) <= 6:
score += 1
return score >= 2
# ---------- GROUP SECTIONS ----------
def group_by_sections(highlights):
sections = []
current = {
"title": "Introduction",
"highlights": []
}
for h in highlights:
text = h["text"]
if is_probable_title(text):
sections.append(current)
current = {
"title": text,
"highlights": []
}
else:
current["highlights"].append(text)
sections.append(current)
return sections
# ---------- SUMMARY ----------
# ---------- EXPORT MARKDOWN ----------
def export_markdown(book, sections, summary, output):
md = f"# {book}nn"
for section in sections:
md += f"## {section['title']}nn"
for h in section["highlights"]:
md += f"- {h}n"
md += "n"
md += "n---nn"
md += "## Book Summarynn"
md += summary
output_path = Path(output)
output_path.parent.mkdir(parents=True, exist_ok=True)
output_path.write_text(md, encoding="utf-8")
print(f"nSaved to {output_path}")
# ---------- MAIN ----------
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--book", required=True)
parser.add_argument("--output", required=False, default=None)
parser.add_argument(
"--clippings",
default="Data/My Clippings.txt"
)
parser.add_argument(
"--model",
default="mistral"
)
args = parser.parse_args()
highlights = parse_clippings(args.clippings)
highlights = filter_book(highlights, args.book)
highlights = sort_by_location(highlights)
highlights = deduplicate(highlights)
sections = group_by_sections(highlights)
all_text = "n".join(
h["text"] for h in highlights
)
summary = summarize_with_ollama(all_text, args.model)
if args.output:
export_markdown(
args.book,
sections,
summary,
args.output
)
else:
print("n---- HIGHLIGHTS ----n")
for h in highlights:
print(f"{h['text']}n")
print("n---- SUMMARY ----n")
print(summary)
if __name__ == "__main__":
main()
But let's see how it works! The code itself is useful but I bet you are willing to see the results. It's long so I decided to remove the first part as all it does is copy and paste highlights.
I randomly chose a book I read 6 years ago (2020) called Talking to Strangers by Malcolm Gladwell (a bestseller, a great read). See the model's printed output (not Markdown):
$ python3 kindle_summary.py --book "Talking to Strangers"
---- HIGHLIGHTS ----
...
---- SUMMARY ----
Title: Talking to Strangers: What We Should Know About Human Interaction
Main Thesis: The book explores the complexities and paradoxes of human
interaction, particularly in conversations with strangers, and emphasizes
the importance of caution, humility, and understanding the context in
which these interactions occur.
Brief Summary: The author delves into the misconceptions and shortcomings
in our dealings with strangers, focusing on how we often make incorrect
assumptions about others based on limited information or preconceived
notions. The book offers insights into why this happens, its consequences,
and strategies for improving our ability to understand and communicate
effectively with people we don't know.
Key Ideas:
1. The transparency problem and the default-to-truth problem: People often
assume that others are open books, sharing their true emotions and
intentions, when in reality this is not always the case.
2. Coupling: Behaviors are strongly linked to specific circumstances and
conditions, making it essential to understand the context in which a
stranger operates.
3. Limitations of understanding strangers: There is no perfect mechanism
for peering into the minds of those we do not know, emphasizing the need
for restraint and humility when interacting with strangers.
Important Concepts:
1. Emotional responses falling outside expectations
2. Defaulting to truth
3. Transparency as an illusion
4. Contextual understanding in dealing with strangers
5. The paradox of talking to strangers (need versus terribleness)
6. The phenomenon of coupling and its influence on behavior
7. Blaming the stranger when things go awry
Practical Takeaways:
1. Recognize that people may not always appear as they seem, both
emotionally and behaviorally.
2. Understand the importance of context in interpreting strangers'
behaviors and intentions.
3. Be cautious and humble when interacting with strangers, acknowledging
our limitations in understanding them fully.
4. Avoid jumping to conclusions about strangers based on limited
information or preconceived notions.
5. Accept that there will always be some degree of ambiguity and
complexity in dealing with strangers.
6. Avoid penalizing others for defaulting to truth as a defense mechanism.
7. When interactions with strangers go awry, consider the role one might
have played in contributing to the situation rather than solely blaming
the stranger.
And all this in a few seconds. Very good in my opinion.
The conclusion
And that way I can now save a lot of free time (which I can use to write posts like this) by using my data and AI skills.
I hope you enjoyed reading and feel inspired to give it a try! It can't be better than a summary of your book idea… But it can't be far from that!
Thanks for your attention, feel free to comment if you have any other ideas or suggestions!
Resources
[1] Ollama. (nd). Ollama.
[2] Obsidian. (nd). Obsidian.



