I have built an AI Pipeline for highlighting Kindle

nimda April 24, 2026

0 11 10 minutes read

I have built an AI Pipeline for highlighting Kindle

I read, I like to highlight things (I use Kindle). I feel that by reading I can't retain more than 10% of the information I use but it is by re-reading the highlights or summarizing the book using them that I understand what I read.

The problem is, sometimes, I end up highlighting too much.

And by a lot I mean A LOT. We can't even call them “important notes.”

So in those cases, after reading the book, I either end up spending a lot of time summarizing or I just stop doing it (the latter is more common).

I just read a book that I really enjoyed and I would like to keep in full what struck me. But, again, it was one of those books that I really enjoyed.

And I didn't want to spend too much of my free time on it. So I decided to automate the process and use my tech/data skills. Because I'm happy with the result, I thought I'd share it so that anyone interested can take advantage of this tool.

Disclaimer: My Kindle is quite old so this should work for new ones too. In fact, there is a better way for newer versions of Kindle (also explained in this post).

The project

Let's define the goal: generate a snapshot from our best Kindle images.

When I thought about it, I thought of the following simple pipeline for one book:

Get the highlights of the book
Create a RAG or something similar
Extract the summary

The result is different from the first part, but it is all due to the pre-processing that is required considering the way the data is structured.

So I'm going to organize this post into two main sections:

Data retrieval and processing
AI model and output

1. Data Retrieval and Processing

My intuition told me there was a way to extract highlights from my Kindle. Finally, they are stored there, so I need a way to get them out.

There are several ways to do it but I wanted a way that works with both books bought from the official Kindle store but also PDFs or files I've sent from my laptop.

And I decided not to use any existing software to extract the data. My ebook and my laptop (and the USB that connects the two).

Luckily for us, no jailbreak is required and there are two ways to do that depending on your Kindle version:

All versions (maybe) have a file in the named documents folder My clips.txt. It contains literally any cut you've made at any time with any book.
New Kindles also have an SQLite file in the system directory named after them annotations.db. This has its highlights in a very structured way.

For this post I will use method 1 (My Clippings.txt) mainly because my Kindle does not have an annotations.db database. But if you are lucky enough to have a DB, use it as it will be clear and of high quality (most of the processing that we will see next will probably not be necessary).

So finding clips is as easy as reading a TXT. Here are some key features and problems I encountered using this method:

All books are in the same file.
I'm not sure about the exact definition of “recording” on Amazon's side but the way I've seen it is: whatever you highlight at any given time. Even if you delete or expand it, the original will remain in the TXT. I think this is because, of course, we are working with a TXT file and it is very difficult to remove unreferenced objects in any way.
There is a cut limit: I don't know the exact limit but once we cross it, we won't be able to retrieve any more clips. This is done because someone might highlight the full book, download it and share it illegally.

And here is the anatomy of the clip:

==========
Book Name (Author Name)
- Your Highlight on page 145 | Location 2212-2212 | Added on Sunday, August 30, 2020 11:25:29 PM

transparency problem ends up in the same place as
==========

So the first step is to separate the highlights, and this is where we first see the Python code:

def parse_clippings(file_path):

    raw = Path(file_path).read_text(encoding="utf-8")
    entries = raw.split("==========")

    highlights = []

    for entry in entries:

        lines = [l.strip() for l in entry.strip().split("n") if l.strip()]

        if len(lines) < 3:
            continue

        book = lines[0]

        if "Highlight" not in lines[1]:
            continue

        location_match = re.search(r"Location (d+)", lines[1])
        if not location_match:
            continue

        location = int(location_match.group(1))

        text = " ".join(lines[2:]).strip()

        highlights.append(
            {
                "book": book,
                "location": location,
                "text": text
            }
        )

    return highlights

Given an attachment file path, all this does is split the text into separate entries and step through them. For each entry, it outputs the subject name, location and highlighted text.

This last structure (dictionary list) makes it easy to sort by letter:

[
    h for h in highlights
    if book_name.lower() in h["book"].lower()
]

Once sorted, we must order the highlights. Since the attachment is in a TXT file, the order is based on the highlight, not the text area.

And I personally want my results to appear as they do in the book, so the order is necessary:

sorted(highlights, key=lambda x: x["location"])

Now, if you check your clip file, you may find duplicate clips (or duplicate clips). This happens because whenever you edit a highlight (that you failed to include all the words you intended, for example), it is counted as a new one. So there will be two identical clips in the TXT. Or even more if you edit it multiple times.

We need to handle this by using deduplication somehow. It's easier than expected:

def deduplicate(highlights):

    clean = []
    for h in highlights:

        text = h["text"]
        duplicate = False

        for c in clean:

            if text == c["text"]:
                duplicate = True
                break
            if text in c["text"]:
                duplicate = True
                break
            if c["text"] in text:
                c["text"] = text
                duplicate = True
                break

        if not duplicate:
            clean.append(h)

    return clean

It's very simple and can be perfected, but basically we check if there are any consecutive attachments with the same text (or part of it) and save the longest one.

We currently have the book's best images sorted, and we can stop pre-processing here. But I can't do that. I like to highlight topics all the time because, when summarizing, I can properly assign a paragraph to each highlight.

But our code can't differentiate between the actual highlight and the section title… For now. See below:

def is_probable_title(text):

    text = text.strip()
    if len(text) > 120:
        return False
    if text.endswith("."):
        return False

    words = text.split()
    if len(words) > 12:
        return False

    # chapter style prefix
    if has_chapter_prefix(text):
        return True

    # capitalization ratio
    capitalized = sum(
        1 for w in words if w[0].isupper()
    )

    cap_ratio = capitalized / len(words)

    # stopword ratio
    stopword_count = sum(
        1 for w in words if w.lower() in STOPWORDS
    )

    stop_ratio = stopword_count / len(words)

    score = 0
    if cap_ratio > 0.6:
        score += 1
    if stop_ratio < 0.3:
        score += 1
    if len(words) <= 6:
        score += 1

    return score >= 2

It may seem silly, and it's not the best solution to this problem, but it works very well. It uses a heuristic based on capitalization, length, stops and initials.

This function is called inside the loop for every highlight, as we have seen in the previous functions, to check whether the highlight is a subject or not. The result is a list of dictionary “categories” where the dictionary has two keys:

Title: the title of the section.
Highlight: section highlights.

Right now, yes, we are ready to summarize.

AI Model and Output

I wanted this to be a free project, so we need an open source AI model.

I thought it was Ollama [1] it was one of the best options for running a project like this (at least locally). Also, our data always belongs to them and we can use the models offline.

Once installed, the code was simple. I'm not a fast developer so anyone with experience can get better results, but this is what works for me:

def summarize_with_ollama(text, model):

    prompt = f"""
    You are summarizing a book from reader highlights.

    Produce a structured summary with:

    - Main thesis
    - Brief summary
    - Key ideas
    - Important concepts
    - Practical takeaways

    Highlights:

    {text}
    """

    result = subprocess.run(
        ["ollama", "run", model],
        input=prompt,
        text=True,
        capture_output=True
    )

    return result.stdout

Simple, I know. But it works partly because the data processing was intensive but also because we already use the models built there.

But what do we do with the summary? I like to use Obsidian [2] so posting a Markdown file is what makes more sense. Here you have it:

def export_markdown(book, sections, summary, output):

    md = f"# {book}nn"
    for section in sections:
        md += f"## {section['title']}nn"
        for h in section["highlights"]:
            md += f"- {h}n"
        md += "n"

    md += "n---nn"
    md += "## Book Summarynn"
    md += summary

    output_path = Path(output)
    output_path.parent.mkdir(parents=True, exist_ok=True)
    output_path.write_text(md, encoding="utf-8")
    print(f"nSaved to {output_path}")

Et voilà.

And this is how I go from highlights to a full Markdown summary (directly to Obsidian if I want) in less than 300 lines of Python code!

Full Code and Testing

Here is the full code, if you just want to copy and paste it. It contains observations and other helper functions and analysis of the argument:

import re
import argparse
from pathlib import Path
import subprocess


# ---------- PARSE CLIPPINGS ----------

def parse_clippings(file_path):

    raw = Path(file_path).read_text(encoding="utf-8")
    entries = raw.split("==========")

    highlights = []

    for entry in entries:

        lines = [l.strip() for l in entry.strip().split("n") if l.strip()]

        if len(lines) < 3:
            continue

        book = lines[0]

        if "Highlight" not in lines[1]:
            continue

        location_match = re.search(r"Location (d+)", lines[1])
        if not location_match:
            continue

        location = int(location_match.group(1))

        text = " ".join(lines[2:]).strip()

        highlights.append(
            {
                "book": book,
                "location": location,
                "text": text
            }
        )

    return highlights


# ---------- FILTER BOOK ----------

def filter_book(highlights, book_name):

    return [
        h for h in highlights
        if book_name.lower() in h["book"].lower()
    ]


# ---------- SORT ----------

def sort_by_location(highlights):

    return sorted(highlights, key=lambda x: x["location"])


# ---------- DEDUPLICATE ----------

def deduplicate(highlights):

    clean = []

    for h in highlights:

        text = h["text"]
        duplicate = False

        for c in clean:

            if text == c["text"]:
                duplicate = True
                break

            if text in c["text"]:
                duplicate = True
                break

            if c["text"] in text:
                c["text"] = text
                duplicate = True
                break

        if not duplicate:
            clean.append(h)

    return clean


# ---------- TITLE DETECTION ----------

STOPWORDS = {
    "the","and","or","but","of","in","on","at","for","to",
    "is","are","was","were","be","been","being",
    "that","this","with","as","by","from"
}


def has_chapter_prefix(text):

    return bool(
        re.match(
            r"^(chapter|part|section)s+d+|^d+[.)]|^[ivxlcdm]+.",
            text.lower()
        )
    )


def is_probable_title(text):

    text = text.strip()
    if len(text) > 120:
        return False
    if text.endswith("."):
        return False

    words = text.split()

    if len(words) > 12:
        return False
    # chapter style prefix
    if has_chapter_prefix(text):
        return True

    # capitalization ratio
    capitalized = sum(
        1 for w in words if w[0].isupper()
    )
    cap_ratio = capitalized / len(words)

    # stopword ratio
    stopword_count = sum(
        1 for w in words if w.lower() in STOPWORDS
    )
    stop_ratio = stopword_count / len(words)

    score = 0
    if cap_ratio > 0.6:
        score += 1
    if stop_ratio < 0.3:
        score += 1
    if len(words) <= 6:
        score += 1

    return score >= 2


# ---------- GROUP SECTIONS ----------

def group_by_sections(highlights):

    sections = []
    current = {
        "title": "Introduction",
        "highlights": []
    }

    for h in highlights:
        text = h["text"]

        if is_probable_title(text):
            sections.append(current)
            current = {
                "title": text,
                "highlights": []
            }
        else:
            current["highlights"].append(text)
    sections.append(current)
    return sections


# ---------- SUMMARY ----------




# ---------- EXPORT MARKDOWN ----------

def export_markdown(book, sections, summary, output):

    md = f"# {book}nn"
    for section in sections:
        md += f"## {section['title']}nn"
        for h in section["highlights"]:
            md += f"- {h}n"
        md += "n"

    md += "n---nn"
    md += "## Book Summarynn"
    md += summary

    output_path = Path(output)
    output_path.parent.mkdir(parents=True, exist_ok=True)
    output_path.write_text(md, encoding="utf-8")
    print(f"nSaved to {output_path}")


# ---------- MAIN ----------

def main():

    parser = argparse.ArgumentParser()

    parser.add_argument("--book", required=True)
    parser.add_argument("--output", required=False, default=None)
    parser.add_argument(
        "--clippings",
        default="Data/My Clippings.txt"
    )
    parser.add_argument(
        "--model",
        default="mistral"
    )

    args = parser.parse_args()

    highlights = parse_clippings(args.clippings)
    highlights = filter_book(highlights, args.book)
    highlights = sort_by_location(highlights)
    highlights = deduplicate(highlights)
    sections = group_by_sections(highlights)

    all_text = "n".join(
        h["text"] for h in highlights
    )

    summary = summarize_with_ollama(all_text, args.model)

    if args.output:
        export_markdown(
            args.book,
            sections,
            summary,
            args.output
        )
    else:
        print("n---- HIGHLIGHTS ----n")
        for h in highlights:
            print(f"{h['text']}n")

        print("n---- SUMMARY ----n")
        print(summary)


if __name__ == "__main__":
    main()

But let's see how it works! The code itself is useful but I bet you are willing to see the results. It's long so I decided to remove the first part as all it does is copy and paste highlights.

I randomly chose a book I read 6 years ago (2020) called Talking to Strangers by Malcolm Gladwell (a bestseller, a great read). See the model's printed output (not Markdown):

$ python3 kindle_summary.py --book "Talking to Strangers"

---- HIGHLIGHTS ----

...


---- SUMMARY ----

 Title: Talking to Strangers: What We Should Know About Human Interaction

Main Thesis: The book explores the complexities and paradoxes of human 
interaction, particularly in conversations with strangers, and emphasizes 
the importance of caution, humility, and understanding the context in 
which these interactions occur.

Brief Summary: The author delves into the misconceptions and shortcomings 
in our dealings with strangers, focusing on how we often make incorrect 
assumptions about others based on limited information or preconceived 
notions. The book offers insights into why this happens, its consequences, 
and strategies for improving our ability to understand and communicate 
effectively with people we don't know.

Key Ideas:
1. The transparency problem and the default-to-truth problem: People often 
assume that others are open books, sharing their true emotions and 
intentions, when in reality this is not always the case.
2. Coupling: Behaviors are strongly linked to specific circumstances and 
conditions, making it essential to understand the context in which a 
stranger operates.
3. Limitations of understanding strangers: There is no perfect mechanism 
for peering into the minds of those we do not know, emphasizing the need 
for restraint and humility when interacting with strangers.

Important Concepts:
1. Emotional responses falling outside expectations
2. Defaulting to truth
3. Transparency as an illusion
4. Contextual understanding in dealing with strangers
5. The paradox of talking to strangers (need versus terribleness)
6. The phenomenon of coupling and its influence on behavior
7. Blaming the stranger when things go awry

Practical Takeaways:
1. Recognize that people may not always appear as they seem, both 
emotionally and behaviorally.
2. Understand the importance of context in interpreting strangers' 
behaviors and intentions.
3. Be cautious and humble when interacting with strangers, acknowledging 
our limitations in understanding them fully.
4. Avoid jumping to conclusions about strangers based on limited 
information or preconceived notions.
5. Accept that there will always be some degree of ambiguity and 
complexity in dealing with strangers.
6. Avoid penalizing others for defaulting to truth as a defense mechanism.
7. When interactions with strangers go awry, consider the role one might 
have played in contributing to the situation rather than solely blaming 
the stranger.

And all this in a few seconds. Very good in my opinion.