GliNER2: Extracting Structured Information from Text

we had SpaCy, which was the de facto NLP library for both beginners and advanced users. It made it easy to dip your toes into NLP, even if you weren't a deep learning expert. However, with the rise of ChatGPT and other LLMs, it seems to have been pushed aside.
While LLMs like Claude or Gemini can do all kinds of NLP stuff automatically, you don't always want to bring a rocket launcher to a fist fight. GliNER is leading the way to bring back small, focused models to classic NLP techniques like business and relational outsourcing. It's simple enough to use a CPU, yet powerful enough to build a thriving community around it.
Released earlier this year, GliNER2 is remarkably advanced. Where the original GliNER focused on entity recognition (giving rise to various spin-offs such as GLiREL for relationships and GLiClass for classification), GliNER2 includes business recognition, text classification, relationship domainagain structured data extraction in one frame.
A key change in GliNER2 is schema-driven approachwhich allows you to define declarative extraction requirements and perform multiple tasks with a single declarative call. Despite these extended capabilities, the model remains CPU-efficient, making it an ideal solution for converting messy, unstructured text into clean data without the overhead of a large language model.
As a knowledge graph enthusiast at Neo4j, I was particularly drawn to the new structured data extraction with extract_json way. While the abstraction of entities and relationships is important in its own right, the ability to define a schema and pull structured JSON directly from the document is what excites me the most. It is naturally appropriate for the introduction of an information graph, where structured, consistent output it is important.
In this blog post, we will explore the capabilities of GliNER2, specifically the model fastino/gliner2-big-v1with a focus on how it can help us build clean, structured knowledge graphs.
The code is available on GitHub.
Selecting a data set
We're not using official benchmarks here, just a quick vibe check to see what GliNER2 can do. Here is our test text, taken from Ada Lovelace's Wikipedia page:
Augusta Ada King, Countess of Lovelace (10 December 1815 – 27 November 1852), also known as Ada Lovelace, was an English mathematician and writer best known for Charles Babbage's computer work that proposed a standard machine, the Analytical Engine. He was the first to realize that the machine had applications beyond pure arithmetic. Lovelace is often considered the first computer programmer. Lovelace was the only legitimate child of poet Lord Byron and reformer Anne Isabella Milbanke. All of his siblings, some of Lord Byron's children, were born out of wedlock to other women. Lord Byron separated from his wife a month after Ada was born, and left England for good. He died in Greece during the Greek War of Independence, when he was eight years old. Lady Byron was concerned about her daughter's upbringing and encouraged Lovelace's interest in mathematics and logic, to prevent her from developing her father's supposed insanity. Despite this, Lovelace remained interested in his father, naming one son Byron and the other, after his father's middle name, Gordon. Lovelace was buried next to his father at his request. Although often ill in childhood, Lovelace pursued his studies with determination. She married William King in 1835. The King was a Baron, and was created Viscount Ockham and 1st Earl of Lovelace in 1838. The name Lovelace was chosen because Ada was a descendant of the late Baron Lovelaces. The title given to her husband made Ada the Countess of Lovelace.
At 322 tokens, it's a solid chunk of text to work with. Let's dive in.
Business release
Let's start with the business release. At its core, outsourcing is a process of automatically identifying and classifying important elements within a textlike people, locations, organizationsor technical concepts. GliNER1 already handled this well, but GliNER2 goes a step further by allowing you to add definitions to entity types, giving you even better control over what is output.
entities = extractor.extract_entities(
text,
{
"Person": "Names of people, including nobility titles.",
"Location": "Countries, cities, or geographic places.",
"Invention": "Machines, devices, or technological creations.",
"Event": "Historical events, wars, or conflicts."
}
)
The results are as follows:

Providing custom definitions for each entity type helps resolve ambiguity and improves output accuracy. This is especially useful for broad categories such as event, there itself, the model may not know whether to include wars, festivals, or important personal events. In addition historical events, wars, or conflicts specifies the target range.
Relationship domain
A relationship domain identifies relationships between pairs of entities in a document. For example, in a sentence “Steve Jobs founded Apple”the relationship extraction model will identify the relationship It was established between organizations Steve Jobs again an apple.
With GLiNER2, you only define the relation types you want to extract since you cannot enforce which entity types are allowed as the head or tail of each relation. This simplifies the interface but may require post-processing to filter out unwanted pairings.
Here, I added a simple experiment by adding both nouns and synonyms_like relations.
relations = extractor.extract_relations(
text,
{
"parent_of": "A person is the parent of another person",
"married_to": "A person is married to another person",
"worked_on": "A person contributed to or worked on an invention",
"invented": "A person created or proposed an invention",
"alias": "Entity is an alias, nickname, title, or alternate reference for another entity",
"same_as": "Entity is an alias, nickname, title, or alternate reference for another entity"
}
)
The results are as follows:

The background clearly identified important relationships: King Byron and Anne Isabella Milbanke as Ada's parents, her marriage to William King, Babbage as the inventor of the analytical engine, and Ada's work in it. Remarkably, the model was found Augusta Ada King like noun of Ada Lovelace, but same_as it was not adopted despite having the same meaning. The choice does not appear to be random as the model always fills the noun but never the same_as relation. This highlights how critical the relational domain is to word labeling, not just meanings.
Simply, GLiNER2 allows to combine many types of output in one call to get entity types next to relational types in one pass. However, the operations are independent: entity extraction does not filter or restrict which entities appear in related extractions, and vice versa. Think of it as using both quotes in parallel rather than as a pipeline.
schema = (extractor.create_schema()
.entities({
"Person": "Names of people, including nobility titles.",
"Location": "Countries, cities, or geographic places.",
"Invention": "Machines, devices, or technological creations.",
"Event": "Historical events, wars, or conflicts."
})
.relations({
"parent_of": "A person is the parent of another person",
"married_to": "A person is married to another person",
"worked_on": "A person contributed to or worked on an invention",
"invented": "A person created or proposed an invention",
"alias": "Entity is an alias, nickname, title, or alternate reference for another entity"
})
)
results = extractor.extract(text, schema)
The results are as follows:

The combined release now gives us business types, separated by color. However, several nodes appear isolated (Greece, England, Greek War of Independence) as not all the extracted entities participate in the obtained relationships.
Formatted JSON output
Perhaps the most powerful feature is the structured data extraction with extract_json. This mimics the structured output functionality of LLMs like ChatGPT or Gemini but runs entirely on the CPU. Unlike business and related releases, this allows you to define custom fields and drag them into structured records. The syntax follows a field_name::type::description pattern, where type exists str or list.
results = extractor.extract_json(
text,
{
"person": [
"name::str",
"gender::str::male or female",
"alias::str::brief summary of included information about the person",
"description::str",
"birth_date::str",
"death_date::str",
"parent_of::str",
"married_to::str"
]
}
)
Here we try to overlap: alias, parent_ofagain married_to it can also be modeled as a relationship. It's worth testing which method works best for your use case. Another interesting addition is description field, which pushes the boundaries a bit: it is closer to the generation of abstraction than to pure abstraction.
The results are as follows:
{
"person": [
{
"name": "Augusta Ada King",
"gender": null,
"alias": "Ada Lovelace",
"description": "English mathematician and writer",
"birth_date": "10 December 1815",
"death_date": "27 November 1852",
"parent_of": "Ada Lovelace",
"married_to": "William King"
},
{
"name": "Charles Babbage",
"gender": null,
"alias": null,
"description": null,
"birth_date": null,
"death_date": null,
"parent_of": "Ada Lovelace",
"married_to": null
},
{
"name": "Lord Byron",
"gender": null,
"alias": null,
"description": "reformer",
"birth_date": null,
"death_date": null,
"parent_of": "Ada Lovelace",
"married_to": null
},
{
"name": "Anne Isabella Milbanke",
"gender": null,
"alias": null,
"description": "reformer",
"birth_date": null,
"death_date": null,
"parent_of": "Ada Lovelace",
"married_to": null
},
{
"name": "William King",
"gender": null,
"alias": null,
"description": null,
"birth_date": null,
"death_date": null,
"parent_of": "Ada Lovelace",
"married_to": null
}
]
}
The results reveal some limitations. Everything gender fields are inactive, although Ada is clearly called a daughterthis model doesn't think she's a woman. I description field captures only high-level phrases (“English mathematician and writer”, “translator”) instead of generating meaningful summaries, which are not useful for workflows like Microsoft GraphRAG which relies on rich business definitions. There are also obvious errors: Charles Babbage and William King are incorrectly labeled as parent_of Ada, and Lord Byron is written as a a reformer (that's Anne Isabella). These errors with parent_ofit didn't come up during the relationship release, so maybe that's the best approach here. Overall, the results suggest that the model is very successful in extraction but has difficulty with inference or prediction, which may be a trade-off of its compact size.
Additionally, all attributes are optional, which makes sense and makes things easier. However, you should be careful as sometimes the name attribute will be null, making the record invalid. Finally, we can use something like PyDantic to validate results and cast to appropriate types like floats or dates and handle invalid results.
Creating information graphs
Since GLiNER2 allows multiple types of extraction in a single pass, we can combine all of the above methods to build an information graph. Rather than using separate pipelines for business, relational, and structured data extraction, a single schema definition handles all three. This makes it easy to go from raw to rich text, and to link the representation.
schema = (extractor.create_schema()
.entities({
"Person": "Names of people, including nobility titles.",
"Location": "Countries, cities, or geographic places.",
"Invention": "Machines, devices, or technological creations.",
"Event": "Historical events, wars, or conflicts."
})
.relations({
"parent_of": "A person is the parent of another person",
"married_to": "A person is married to another person",
"worked_on": "A person contributed to or worked on an invention",
"invented": "A person created or proposed an invention",
})
.structure("person")
.field("name", dtype="str")
.field("alias", dtype="str")
.field("description", dtype="str")
.field("birth_date", dtype="str")
)
results = extractor.extract(text, schema)
How you map these results to your graph (nodes, relationships, structures) depends on your data model. In this example, we use the following data model:

What you will notice is that we include part of the original text in the graph as well, which allows us to retrieve and refer to the source of information when we query the graph, allowing for more accurate and traceable results. The import Cypher looks like this:
import_cypher_query = """
// Create Chunk node from text
CREATE (c:Chunk {text: $text})
// Create Person nodes with properties
WITH c
CALL (c) {
UNWIND $data.person AS p
WITH p
WHERE p.name IS NOT NULL
MERGE (n:__Entity__ {name: p.name})
SET n.description = p.description,
n.birth_date = p.birth_date
MERGE (c)-[:MENTIONS]->(n)
WITH p, n WHERE p.alias IS NOT NULL
MERGE (m:__Entity__ {name: p.alias})
MERGE (n)-[:ALIAS_OF]->(m)
}
// Create entity nodes dynamically with __Entity__ base label + dynamic label
CALL (c) {
UNWIND keys($data.entities) AS label
UNWIND $data.entities[label] AS entityName
MERGE (n:__Entity__ {name: entityName})
SET n:$(label)
MERGE (c)-[:MENTIONS]->(n)
}
// Create relationships dynamically
CALL (c) {
UNWIND keys($data.relation_extraction) AS relType
UNWIND $data.relation_extraction[relType] AS rel
MATCH (a:__Entity__ {name: rel[0]})
MATCH (b:__Entity__ {name: rel[1]})
MERGE (a)-[:$(toUpper(relType))]->(b)
}
RETURN distinct 'import completed' AS result
"""
The Cypher query takes the results from the GliNER2 output and stores them in Neo4j. We may also include embedding of text fragments, entities, and more.
Summary
GliNER2 is a step in the right direction for systematic data extraction. With the rise of LLMs, it's easy to reach for ChatGPT or Claude whenever you need to extract information from a text, but that's often overkill. Using a multi-parameter model to extract a few entities and relationships feels like a waste when smaller, specialized tools can do the job on a CPU.
GliNER2 combines named entity recognition, relational extraction, and structured JSON output into a single framework. It is well suited for tasks such as knowledge graph construction, where you need consistent, schema-driven output rather than open source generation.
Although the model has its limitations. It is more effective for direct deduction than guesswork or speculation, and the results may not be consistent. But the progress from the original GliNER1 to GliNER2 is encouraging, and hopefully we'll see further advances in this space. In most use cases, the fixed output model outperforms LLM which does more than you need.
The code is available on GitHub.



