Writing: Document Alchemist | Looking at the data science

Why do we argue with the documents in 2025?
In any applicable organization, and you will meet with the PDF host, name files, PowerPoints, Pictures associated with Half, handwritten blocks, and CSV blocks download from SharePoint folder. Business commentators and data spend many hours changing, separating, and informing those structures in their Piple Pipeline will receive. Even latest latest stacks produced – AI can compute when the basic text is compiled within a graphic or sprayed flammable grids.
The writing was born to resolve that pain. It is released as an open research for IBM study of Zurich and is now hosted under Lenux Foundation AI & Data Foundation, Labrail, audio tool, audio, audio after the exact order of API and the CLI.
Although writing supports the processing of HTML files, MS Office Files, picture formats and others, we will look very much on processing PDF files.
Like a data scientist or ML engineer, why should I take care of writing?
Usually, a real bottle forms the model – to support. We use a large percentage of our time in the fighting data, and there is no product faster than important data locked within the PDF page 100. This is a recreational problem, doing a bridge from the Directories in the formal documents of the Markown Sanity, JSON, or Pandas Databame.
But its energy reaches more than just data releases, directly in modern development, helping. Consider showing text on the HTML page of API information; It translates effort that a complex web formation is a clean, systematic site – the complete context directly feeds on AI cursor who love the cursor, ChatGPT, or Claude.
When writing comes from
The project came from the shallow IBM search group, which prompted back to recover (RAG) Patent Patent Pipes. They opened the essence under the MIT License late in 2024 and have been posting a weekend every week since. The Succane Community Is Just Keeping Only Unity DocLingDocument The model, the pydantic item storing text, pictures, tables, formulas, and langout metadata together such low tools such as Langchain tools, lmaindex, or haystack should not guess the reading order.
Today, writing includes visible moods is a language (vlms), as Smoldocling, to find a picture. It also supports the tesseract, Elether EleaChe is published by the text and cooking for chunking posts, shining, and the vector shop installation. In other words: Identifies to a folder, and you find Markudion, HTML, CSV, PGS object, JSON, or just the remaining CCAFFolding code.
What will we do
To show writing, we will start to install it and use it for three different examples that indicate its variable and usefulness as a parser of the document and the processor. Please note that using writing is more greater, so it will help when you reach GPU in your system.
However, before we begin to enter codes, we need to remove the environment.
Setting up the environmental environment
I started using UV PACKAGE Manager, but feel free to use any tools very comfortable. Note that I will work under WSL2 personality with Windows and use my code using the Jobyter booklet.
Be careful, even using UV, the below code takes a few minutes to get rid of my system, as the best toolbreakte set.
$ uv init docling
Initialized project `docling` at `/home/tom/docling`
$ cd docling
$ uv venv
Using CPython 3.11.10 interpreter at: /home/tom/miniconda3/bin/python
Creating virtual environment at: .venv
Activate with: source .venv/bin/activate
$ source .venv/bin/activate
(docling) $ uv pip install docling pandas jupyter
Now type command,
(docling) $ jupyter notebook
You should also see a brochure open in your browser. If that doesn't automatically, you will probably see wisdom after using JYSTER NETEBOOK command. Next to the floor, you will receive the URL to copy and paste on your browser to introduce the Jobyter textbook.
Your URL will be different with mine, but you should look like such a thing: –
Example 1: Change any PDF or Docx to mark or JSON
The easiest case has used and another will use multiple percent of the time: – Change the text text into a sign
In our many examples, our installation PDF will use a person often before the different test. A copy of the Tesla completion document 10-QEC from September 2023. About five pages long and contain financial information related to Tesla. The full document is publicly available on the Security and Secure (sec) website and can be viewed / downloaded using this link.
Here is a picture of the first page of that article about your reference.
Let's review the drawing code we need to turn into Markdown. It sets the PDF file method, conducting desponverter function on it, and sends a parsed effect on mark format for easy reading, editing, or analysis.
from docling.document_converter import DocumentConverter
import time
from pathlib import Path
inpath = "/mnt/d//tesla"
infile = "tesla_q10_sept_23.pdf"
data_folder = Path(inpath)
doc_path = data_folder / infile
converter = DocumentConverter()
result = converter.convert(doc_path) # → DoclingResult
# Markdown export still works
markdown_text = result.document.export_to_markdown()
This is the release we receive from the further code above (just the first page).
## UNITED STATES SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549 FORM 10-Q
(Mark One)
- x QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the quarterly period ended September 30, 2023
OR
- o TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from _________ to _________
Commission File Number: 001-34756
## Tesla, Inc.
(Exact name of registrant as specified in its charter)
Delaware
(State or other jurisdiction of incorporation or organization)
1 Tesla Road Austin, Texas
(Address of principal executive offices)
## (512) 516-8177
(Registrant's telephone number, including area code)
## Securities registered pursuant to Section 12(b) of the Act:
| Title of each class | Trading Symbol(s) | Name of each exchange on which registered |
|-----------------------|---------------------|---------------------------------------------|
| Common stock | TSLA | The Nasdaq Global Select Market |
Indicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 ('Exchange Act') during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days. Yes x No o
Indicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T (§232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files). Yes x No o
Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company. See the definitions of 'large accelerated filer,' 'accelerated filer,' 'smaller reporting company' and 'emerging growth company' in Rule 12b-2 of the Exchange Act:
Large accelerated filer
x
Accelerated filer
Non-accelerated filer
o
Smaller reporting company
Emerging growth company
o
If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying with any new or revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act. o
Indicate by check mark whether the registrant is a shell company (as defined in Rule 12b-2 of the Exchange Act). Yes o No x
As of October 16, 2023, there were 3,178,921,391 shares of the registrant's common stock outstanding.
By climbing ai editorial and the use of the llms usually, this method is very important and appropriate. The efficiency of llms and the code editories can be largely developed by giving them the relevant context. Usually this will include them text representation of certain documents or freight documents, API and examples of codes.
To change the result of PDFs in JSON FETEL and direct. Just add these two lines of code. You can meet the restrictions on JSON size out, so change the print statement correctly.
json_blob = result.document.model_dump_json(indent=2)
print(json_blob[10000], "…")
Example 2: Uninstall complex tables from PDF
Many PDFs often keep the tables like a chunks of single text or, it's bad, like soft pictures. Docling's Table table Reconds Recleing, columns, and spanning cells, gives you no matter of pandas dandaas or CSV ready for storage. Our pdf is the test has multiple tables. See, for example, on page 11 of the PDF, and we can see the table below,

Let's see if we can get that data out. Code is a little more complex than us is our first example, but do other activities. The PDF is modified and uses Docling's decision worker, producing a formal documentation. After that, at each table, it converts table into pandas data and also returns the table page from the documentation metadata. If the table appears on page 11, print in the Markdown format and break the loop (so only the first comparative table.
import pandas as pd
from docling.document_converter import DocumentConverter
from time import time
from pathlib import Path
inpath = "/mnt/d//tesla"
infile = "tesla_q10_sept_23.pdf"
data_folder = Path(inpath)
input_doc_path = data_folder / infile
doc_converter = DocumentConverter()
start_time = time()
conv_res = doc_converter.convert(input_doc_path)
# Export table from page 11
for table_ix, table in enumerate(conv_res.document.tables):
page_number = table.prov[0].page_no if table.prov else "Unknown"
if page_number == 11:
table_df: pd.DataFrame = table.export_to_dataframe()
print(f"## Table {table_ix} (Page {page_number})")
print(table_df.to_markdown())
break
end_time = time() - start_time
print(f"Document converted and tables exported in {end_time:.2f} seconds.")
And the outgoing is not just shabby.
## Table 10 (Page 11)
| | | Three Months Ended September 30,.2023 | Three Months Ended September 30,.2022 | Nine Months Ended September 30,.2023 | Nine Months Ended September 30,.2022 |
|---:|:---------------------------------------|:----------------------------------------|:----------------------------------------|:---------------------------------------|:---------------------------------------|
| 0 | Automotive sales | $ 18,582 | $ 17,785 | $ 57,879 | $ 46,969 |
| 1 | Automotive regulatory credits | 554 | 286 | 1,357 | 1,309 |
| 2 | Energy generation and storage sales | 1,416 | 966 | 4,188 | 2,186 |
| 3 | Services and other | 2,166 | 1,645 | 6,153 | 4,390 |
| 4 | Total revenues from sales and services | 22,718 | 20,682 | 69,577 | 54,854 |
| 5 | Automotive leasing | 489 | 621 | 1,620 | 1,877 |
| 6 | Energy generation and storage leasing | 143 | 151 | 409 | 413 |
| 7 | Total revenues | $ 23,350 | $ 21,454 | $ 71,606 | $ 57,144 |
Document converted and tables exported in 33.43 seconds.
Retrieving all tables from PDF, you will need to leave an If the peel_frole = … a line from my code.
One thing I've seen with dcrput is not to be quick. As shown above, it took nearly 34 seconds to remove that one table from 50 pages.
Example 3: Make an OCR in the picture.
For this example, I scan a random page from Tesla 10-Q PDF and maintain as a PNG file. Let's see how writing writing is to read that picture and change what you find in Markawn. Here is my picture in scan.

And our code. We use tesseract as our OCR engine (some are available)
from pathlib import Path
import time
import pandas as pd
from docling.document_converter import DocumentConverter, ImageFormatOption
from docling.models.tesseract_ocr_cli_model import TesseractCliOcrOptions
def main():
inpath = "/mnt/d//tesla"
infile = "10q-image.png"
input_doc_path = Path(inpath) / infile
# Configure OCR for image input
image_options = ImageFormatOption(
ocr_options=TesseractCliOcrOptions(force_full_page_ocr=True),
do_table_structure=True,
table_structure_options={"do_cell_matching": True},
)
converter = DocumentConverter(
format_options={"image": image_options}
)
start_time = time.time()
conv_res = converter.convert(input_doc_path).document
# Print all tables as Markdown
for table_ix, table in enumerate(conv_res.tables):
table_df: pd.DataFrame = table.export_to_dataframe(doc=conv_res)
page_number = table.prov[0].page_no if table.prov else "Unknown"
print(f"n--- Table {table_ix+1} (Page {page_number}) ---")
print(table_df.to_markdown(index=False))
# Print full document text as Markdown
print("n--- Full Document (Markdown) ---")
print(conv_res.export_to_markdown())
elapsed = time.time() - start_time
print(f"nProcessing completed in {elapsed:.2f} seconds")
if __name__ == "__main__":
main()
Here is our result.
--- Table 1 (Page 1) ---
| | Three Months Ended September J0,. | Three Months Ended September J0,.2022 | Nine Months Ended September J0,.2023 | Nine Months Ended September J0,.2022 |
|:-------------------------|------------------------------------:|:----------------------------------------|:---------------------------------------|:---------------------------------------|
| Cost ol revenves | 181 | 150 | 554 | 424 |
| Research an0 developrent | 189 | 124 | 491 | 389 |
| | 95 | | 2B3 | 328 |
| Total | 465 | 362 | 1,328 | 1,141 |
--- Full Document (Markdown) ---
## Note 8 Equity Incentive Plans
## Other Pertormance-Based Grants
("RSUs") und stock optlons unrecognized stock-based compensatian
## Summary Stock-Based Compensation Information
| | Three Months Ended September J0, | Three Months Ended September J0, | Nine Months Ended September J0, | Nine Months Ended September J0, |
|--------------------------|------------------------------------|------------------------------------|-----------------------------------|-----------------------------------|
| | | 2022 | 2023 | 2022 |
| Cost ol revenves | 181 | 150 | 554 | 424 |
| Research an0 developrent | 189 | 124 | 491 | 389 |
| | 95 | | 2B3 | 328 |
| Total | 465 | 362 | 1,328 | 1,141 |
## Note 9 Commitments and Contingencies
## Operating Lease Arrangements In Buffalo, New York and Shanghai, China
## Legal Proceedings
Between september 1 which 2021 pald has
Processing completed in 7.64 seconds
If you compare this out of the original picture, the results are disappointing. A lot of text in the picture suddenly missed or caught. This is where the product is text-like the AWS Texcrcrect login, as you pass through the text from a variety of sources.
However, writing gives different OCR options, so if you get poor side effects from one system, you can always switch to another.
I've tried the same work I use the EleaseleccCcCoccoccoccoccoccoccoccoccoccoccoccot, here is code.
from pathlib import Path
import time
import pandas as pd
from docling.document_converter import DocumentConverter, ImageFormatOption
from docling.models.easyocr_model import EasyOcrOptions # Import EasyOCR options
def main():
inpath = "/mnt/d//tesla"
infile = "10q-image.png"
input_doc_path = Path(inpath) / infile
# Configure image pipeline with EasyOCR
image_options = ImageFormatOption(
ocr_options=EasyOcrOptions(force_full_page_ocr=True), # use EasyOCR
do_table_structure=True,
table_structure_options={"do_cell_matching": True},
)
converter = DocumentConverter(
format_options={"image": image_options}
)
start_time = time.time()
conv_res = converter.convert(input_doc_path).document
# Print all tables as Markdown
for table_ix, table in enumerate(conv_res.tables):
table_df: pd.DataFrame = table.export_to_dataframe(doc=conv_res)
page_number = table.prov[0].page_no if table.prov else "Unknown"
print(f"n--- Table {table_ix+1} (Page {page_number}) ---")
print(table_df.to_markdown(index=False))
# Print full document text as Markdown
print("n--- Full Document (Markdown) ---")
print(conv_res.export_to_markdown())
elapsed = time.time() - start_time
print(f"nProcessing completed in {elapsed:.2f} seconds")
if __name__ == "__main__":
main()
Summary
Neighbor – Ai Boom again issued an old fact: trash in, trash outside. The llms may only postpone the smaller when they include a coherent in paragraph. Docling provides cooperation (a lot of time) in all most of your stakeholders who can reflect you, and do it in your area and breeding.
Recognition There are their ways more than Ai World, anyway. Consider the great number of scriptures stored in areas such as banking, offices, and insurance companies worldwide. If this should be financial, write down can give some solutions to that.
Its serious weakness is probably recognized by the visible character of the text within pictures. I tried to use Teseract and Eleacrocm, and the two consequences were disappointing. You will probably need to use the productive product such as AWS Texcrort if you want to be reliably producing text in these types of resources.
It can also be a little. I have a suitable higher desktop PC with GPU, and it took time for many jobs. However, if your PDFS installations documents are primarily, writing can be an important installation of your text processing box.
I only redeem the area of what the text knows, and I encourage you to visit their home page, which can be used using the following link to learn more.



