ANI

A Beginner's Guide to Data Extraction with Langexcract and LLMS

0 1 4 minutes read

A Beginner's Guide to Data Extraction with Langexcract and LLMS

Photo by the Author

The obvious Getting started

Did you know that most of the important information is still in unedited text? For example, research papers, clinical notes, financial reports, etc. Extracting reliable and systematic information from these documents has always been a challenge. LangecTract Is an open source Python library (released by Google) that solves this problem using large-scale language models (LLMS). You can explain what you can extract with simple prompts and a few examples, then use LLMS (like Gemini's Gemini, Vulai, or local models) to extract that information from any documents. Another thing that makes it useful is its support for very long documents (using chunking and multi-processing) and a functional view of the results. Let's explore this library in more detail.

The obvious 1. Installation and setup

To install LangexCTRCTRCTER, first make sure you have Python 3.10+ installed. A library is available PYPI. In terminal or virtual mode, run:

With the environment set, you can start creating and deploying virtual environments:

python -m venv langextract_env
source langextract_env/bin/activate  # On Windows: .langextract_envScriptsactivate
pip install langextract

There are other methods from discovery and implementation An artist And you can check from here.

The obvious 2. Setting API keys (cloud models)

LangexCRCRCRCRCRCT itself is free and open source, but if you use a cloud-hosted LLMS (such as Google Gemini or Opelai GPT Models), you must provide an API key. You can set LANGEXTRACT_API_KEY environment varies or keep it in .env The file in your working directory. For example:

export LANGEXTRACT_API_KEY="YOUR_API_KEY_HERE"

or to .env File:

cat >> .env << 'EOF'
LANGEXTRACT_API_KEY=your-api-key-here
EOF
echo '.env' >> .gitignore

In llms by device with I'm recovering or other local backups do not require an API key. Allowing Open ithe will run pip install langextract[openai]set your own OPENAI_API_KEYand use OpenAI model_id. It's simple Vertex ai (Enterprise users), service account authentication is supported.

The obvious 3. Defining the output function

Langexctract works by you telling it what information to extract. You do this by writing a quick, clear description and providing one or more ExampleData Annotation showing what a correct output looks like with sample text. For example, to extract characters, emotions, and relationships from a string of letters, you would write:

import langextract as lx

prompt = """
  Extract characters, emotions, and relationships in order of appearance.
  Use exact text for extractions. Do not paraphrase or overlap entities.
  Provide meaningful attributes for each entity to add context."""
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? ...",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            )
        ]
    )
]

These examples (taken from langextract's Readme) tell the model what kind of systematic output to expect. You can create similar examples for your domain.

The obvious 4. Running the release

As soon as your Prompt and Examples are defined, you simply call lx.extract() work. The key arguments are:

text_or_documents: Your input text, or list of texts, or URL string (Langectract can download and process text from gutenberg or other URL).
prompt_description: Release commands (string).
examples: List of ExampleData showing the desired result.
model_id: LLM indicator to use (eg "gemini-2.5-flash" With Google Gemini Flash, or similar Ollama model "gemma2:2b"or Opelai model like "gpt-4o").
Other parameters are optional: extraction_passes (Resuming high recall output from long scripts), max_workers (doing parallel processing on chunks), fence_output, use_schema_constraintsetc.

For example:

input_text=""'JULIET. O Romeo, Romeo! wherefore art thou Romeo?
Deny thy father and refuse thy name;
Or, if thou wilt not, be but sworn my love,
And I'll no longer be a Capulet.
ROMEO. Shall I hear more, or shall I speak at this?
JULIET. 'Tis but thy name that is my enemy;
Thou art thyself, though not a Montague.
What’s in a name? That which we call a rose
By any other name would smell as sweet.'''


result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash"
)

This sends examples and examples and text to the selected LLM and returns to a Result thing. LangexCRENCTRCCRCRECT Automatically handles splitting long texts into chunks, calling parallelism, and concatenating the results.

The obvious 5. Output handling and detection

The result of lx.extract() is a python object (often called result) That contains extracted entities and attributes. You can check it thoroughly or save it for later. LangexCRCRCTR also provides Helper functions to save the results: For example, you can write the results to a JSONL (JSON Lines) file (one document per line) and generate an active HTML update. For example:

lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")
html = lx.visualize("extraction_results.jsonl")
with open("viz.html", "w") as f:
    f.write(html if isinstance(html, str) else html.data)

This writing i extraction_results.jsonl File and interaction viz.html file. The JSSL format is ideal for large datasets and continuous processing, and the HTML file highlights each span extracted from the context (color-coded by class).

Extraction and visualization: Langexctract

The obvious 6. Support for input formats

Langexctract is flexible with installation. You can give:

Plain text strings: Any text you load into Python (eg from a file or database) can be processed.
URLs: As shown above, you can pass a URL (eg A Gutberg project link) as text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt". Langexctract will download and extract from that document.
Directory: Pass a python list of strings to process multiple documents in a single call.
Rich text or Markdown: Since langexcract works at the text level, you can also eat Marking or html if you are processing the raw text first. (LangexCRCRCRCRCRCRCRCCRECT itself doesn't even accept PDFs or images, you need to extract the text first.)

The obvious 7. Conclusion

LangexCRECTRCRECT makes it easy to convert random text into structured data. With high accuracy, clear source mapping, and easy customization, it works well where conventional methods fall short. It is especially useful for complex or domain extractions. While there is room for improvement, langexcract is already a solid tool for extracting basic information in 2025.

Kanwal Mehreen Is a machine learning engineer and technical writer with a strong interest in data science and the intersection of AI and medicine. Authored the eBook “Increasing Productivity with Chatgpt”. As a Google Event 2022 APAC host, she is a symbol of diversity and excellence in education. He has also been recognized as a teradata distinction in tech scholar, a mitacs Globalk research scholar, and a Harvard WeCode Scholar. Kanwal is a passionate advocate for change, who has created femcodes to empower women.

Source link

nimda 1 day ago

0 1 4 minutes read