Machine Learning

Writing: Document Alchemist | Looking at the data science

Why do we argue with the documents in 2025?

In any applicable organization, and you will meet with the PDF host, name files, PowerPoints, Pictures associated with Half, handwritten blocks, and CSV blocks download from SharePoint folder. Business commentators and data spend many hours changing, separating, and informing those structures in their Piple Pipeline will receive. Even latest latest stacks produced – AI can compute when the basic text is compiled within a graphic or sprayed flammable grids.

The writing was born to resolve that pain. It is released as an open research for IBM study of Zurich and is now hosted under Lenux Foundation AI & Data Foundation, Labrail, audio tool, audio, audio after the exact order of API and the CLI.

Although writing supports the processing of HTML files, MS Office Files, picture formats and others, we will look very much on processing PDF files.

Like a data scientist or ML engineer, why should I take care of writing?

Usually, a real bottle forms the model – to support. We use a large percentage of our time in the fighting data, and there is no product faster than important data locked within the PDF page 100. This is a recreational problem, doing a bridge from the Directories in the formal documents of the Markown Sanity, JSON, or Pandas Databame.

But its energy reaches more than just data releases, directly in modern development, helping. Consider showing text on the HTML page of API information; It translates effort that a complex web formation is a clean, systematic site – the complete context directly feeds on AI cursor who love the cursor, ChatGPT, or Claude.

When writing comes from

The project came from the shallow IBM search group, which prompted back to recover (RAG) Patent Patent Pipes. They opened the essence under the MIT License late in 2024 and have been posting a weekend every week since. The Succane Community Is Just Keeping Only Unity DocLingDocument The model, the pydantic item storing text, pictures, tables, formulas, and langout metadata together such low tools such as Langchain tools, lmaindex, or haystack should not guess the reading order.

Today, writing includes visible moods is a language (vlms), as Smoldocling, to find a picture. It also supports the tesseract, Elether EleaChe is published by the text and cooking for chunking posts, shining, and the vector shop installation. In other words: Identifies to a folder, and you find Markudion, HTML, CSV, PGS object, JSON, or just the remaining CCAFFolding code.

What will we do

To show writing, we will start to install it and use it for three different examples that indicate its variable and usefulness as a parser of the document and the processor. Please note that using writing is more greater, so it will help when you reach GPU in your system.

However, before we begin to enter codes, we need to remove the environment.

Setting up the environmental environment

I started using UV PACKAGE Manager, but feel free to use any tools very comfortable. Note that I will work under WSL2 personality with Windows and use my code using the Jobyter booklet.

Be careful, even using UV, the below code takes a few minutes to get rid of my system, as the best toolbreakte set.

$ uv init docling
Initialized project `docling` at `/home/tom/docling`
$ cd docling
$ uv venv
Using CPython 3.11.10 interpreter at: /home/tom/miniconda3/bin/python
Creating virtual environment at: .venv
Activate with: source .venv/bin/activate
$ source .venv/bin/activate
(docling) $ uv pip install docling pandas jupyter

Now type command,

(docling) $ jupyter notebook

You should also see a brochure open in your browser. If that doesn't automatically, you will probably see wisdom after using JYSTER NETEBOOK command. Next to the floor, you will receive the URL to copy and paste on your browser to introduce the Jobyter textbook.

Your URL will be different with mine, but you should look like such a thing: –

Example 1: Change any PDF or Docx to mark or JSON

The easiest case has used and another will use multiple percent of the time: – Change the text text into a sign

In our many examples, our installation PDF will use a person often before the different test. A copy of the Tesla completion document 10-QEC from September 2023. About five pages long and contain financial information related to Tesla. The full document is publicly available on the Security and Secure (sec) website and can be viewed / downloaded using this link.

Here is a picture of the first page of that article about your reference.

Picture from Tesla 10-Q PDF

Let's review the drawing code we need to turn into Markdown. It sets the PDF file method, conducting desponverter function on it, and sends a parsed effect on mark format for easy reading, editing, or analysis.

from docling.document_converter import DocumentConverter
import time
from pathlib import Path

inpath = "/mnt/d//tesla"
infile = "tesla_q10_sept_23.pdf"

data_folder = Path(inpath)

doc_path = data_folder / infile

converter = DocumentConverter()
result    = converter.convert(doc_path)     # → DoclingResult

# Markdown export still works
markdown_text = result.document.export_to_markdown()

This is the release we receive from the further code above (just the first page).

## UNITED STATES SECURITIES AND EXCHANGE COMMISSION

Washington, D.C. 20549 FORM 10-Q

(Mark One)

- x QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934

For the quarterly period ended September 30, 2023

OR

- o TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934

For the transition period from _________ to _________

Commission File Number: 001-34756

## Tesla, Inc.

(Exact name of registrant as specified in its charter)

Delaware

(State or other jurisdiction of incorporation or organization)

1 Tesla Road Austin, Texas

(Address of principal executive offices)

## (512) 516-8177

(Registrant's telephone number, including area code)

## Securities registered pursuant to Section 12(b) of the Act:

| Title of each class   | Trading Symbol(s)   | Name of each exchange on which registered   |
|-----------------------|---------------------|---------------------------------------------|
| Common stock          | TSLA                | The Nasdaq Global Select Market             |

Indicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 ('Exchange Act') during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days. Yes x No o

Indicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T (§232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files). Yes x No o

Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company. See the definitions of 'large accelerated filer,' 'accelerated filer,' 'smaller reporting company' and 'emerging growth company' in Rule 12b-2 of the Exchange Act:

Large accelerated filer

x

Accelerated filer

Non-accelerated filer

o

Smaller reporting company

Emerging growth company

o

If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying with any new or revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act. o

Indicate by check mark whether the registrant is a shell company (as defined in Rule 12b-2 of the Exchange Act). Yes o No x

As of October 16, 2023, there were 3,178,921,391 shares of the registrant's common stock outstanding.

By climbing ai editorial and the use of the llms usually, this method is very important and appropriate. The efficiency of llms and the code editories can be largely developed by giving them the relevant context. Usually this will include them text representation of certain documents or freight documents, API and examples of codes.

To change the result of PDFs in JSON FETEL and direct. Just add these two lines of code. You can meet the restrictions on JSON size out, so change the print statement correctly.

json_blob = result.document.model_dump_json(indent=2)

print(json_blob[10000], "…")

Example 2: Uninstall complex tables from PDF

Many PDFs often keep the tables like a chunks of single text or, it's bad, like soft pictures. Docling's Table table Reconds Recleing, columns, and spanning cells, gives you no matter of pandas dandaas or CSV ready for storage. Our pdf is the test has multiple tables. See, for example, on page 11 of the PDF, and we can see the table below,

Picture from Tesla 10-Q PDF

Let's see if we can get that data out. Code is a little more complex than us is our first example, but do other activities. The PDF is modified and uses Docling's decision worker, producing a formal documentation. After that, at each table, it converts table into pandas data and also returns the table page from the documentation metadata. If the table appears on page 11, print in the Markdown format and break the loop (so only the first comparative table.

import pandas as pd
from docling.document_converter import DocumentConverter
from time import time
from pathlib import Path

inpath = "/mnt/d//tesla"
infile = "tesla_q10_sept_23.pdf"
data_folder = Path(inpath)
input_doc_path = data_folder / infile

doc_converter = DocumentConverter()
start_time = time()
conv_res = doc_converter.convert(input_doc_path)

# Export table from page 11
for table_ix, table in enumerate(conv_res.document.tables):
    page_number = table.prov[0].page_no if table.prov else "Unknown"
    if page_number == 11:
        table_df: pd.DataFrame = table.export_to_dataframe()
        print(f"## Table {table_ix} (Page {page_number})")
        print(table_df.to_markdown())
        break

end_time = time() - start_time
print(f"Document converted and tables exported in {end_time:.2f} seconds.")

And the outgoing is not just shabby.

## Table 10 (Page 11)
|    |                                        | Three Months Ended September 30,.2023   | Three Months Ended September 30,.2022   | Nine Months Ended September 30,.2023   | Nine Months Ended September 30,.2022   |
|---:|:---------------------------------------|:----------------------------------------|:----------------------------------------|:---------------------------------------|:---------------------------------------|
|  0 | Automotive sales                       | $ 18,582                                | $ 17,785                                | $ 57,879                               | $ 46,969                               |
|  1 | Automotive regulatory credits          | 554                                     | 286                                     | 1,357                                  | 1,309                                  |
|  2 | Energy generation and storage sales    | 1,416                                   | 966                                     | 4,188                                  | 2,186                                  |
|  3 | Services and other                     | 2,166                                   | 1,645                                   | 6,153                                  | 4,390                                  |
|  4 | Total revenues from sales and services | 22,718                                  | 20,682                                  | 69,577                                 | 54,854                                 |
|  5 | Automotive leasing                     | 489                                     | 621                                     | 1,620                                  | 1,877                                  |
|  6 | Energy generation and storage leasing  | 143                                     | 151                                     | 409                                    | 413                                    |
|  7 | Total revenues                         | $ 23,350                                | $ 21,454                                | $ 71,606                               | $ 57,144                               |
Document converted and tables exported in 33.43 seconds.

Retrieving all tables from PDF, you will need to leave an If the peel_frole = … a line from my code.

One thing I've seen with dcrput is not to be quick. As shown above, it took nearly 34 seconds to remove that one table from 50 pages.

Example 3: Make an OCR in the picture.

For this example, I scan a random page from Tesla 10-Q PDF and maintain as a PNG file. Let's see how writing writing is to read that picture and change what you find in Markawn. Here is my picture in scan.

Picture from Tesla 10-Q PDF

And our code. We use tesseract as our OCR engine (some are available)

from pathlib import Path
import time
import pandas as pd

from docling.document_converter import DocumentConverter, ImageFormatOption
from docling.models.tesseract_ocr_cli_model import TesseractCliOcrOptions


def main():
    inpath = "/mnt/d//tesla"
    infile = "10q-image.png"

    input_doc_path = Path(inpath) / infile

    # Configure OCR for image input
    image_options = ImageFormatOption(
        ocr_options=TesseractCliOcrOptions(force_full_page_ocr=True),
        do_table_structure=True,
        table_structure_options={"do_cell_matching": True},
    )

    converter = DocumentConverter(
        format_options={"image": image_options}
    )

    start_time = time.time()

    conv_res = converter.convert(input_doc_path).document

    # Print all tables as Markdown
    for table_ix, table in enumerate(conv_res.tables):
        table_df: pd.DataFrame = table.export_to_dataframe(doc=conv_res)
        page_number = table.prov[0].page_no if table.prov else "Unknown"
        print(f"n--- Table {table_ix+1} (Page {page_number}) ---")
        print(table_df.to_markdown(index=False))

    # Print full document text as Markdown
    print("n--- Full Document (Markdown) ---")
    print(conv_res.export_to_markdown())

    elapsed = time.time() - start_time
    print(f"nProcessing completed in {elapsed:.2f} seconds")


if __name__ == "__main__":
    main()

Here is our result.

--- Table 1 (Page 1) ---
|                          |   Three Months Ended September J0,. | Three Months Ended September J0,.2022   | Nine Months Ended September J0,.2023   | Nine Months Ended September J0,.2022   |
|:-------------------------|------------------------------------:|:----------------------------------------|:---------------------------------------|:---------------------------------------|
| Cost ol revenves         |                                 181 | 150                                     | 554                                    | 424                                    |
| Research an0 developrent |                                 189 | 124                                     | 491                                    | 389                                    |
|                          |                                  95 |                                         | 2B3                                    | 328                                    |
| Total                    |                                 465 | 362                                     | 1,328                                  | 1,141                                  |

--- Full Document (Markdown) ---
## Note 8 Equity Incentive Plans

## Other Pertormance-Based Grants

("RSUs") und stock optlons unrecognized stock-based compensatian

## Summary Stock-Based Compensation Information

|                          | Three Months Ended September J0,   | Three Months Ended September J0,   | Nine Months Ended September J0,   | Nine Months Ended September J0,   |
|--------------------------|------------------------------------|------------------------------------|-----------------------------------|-----------------------------------|
|                          |                                    | 2022                               | 2023                              | 2022                              |
| Cost ol revenves         | 181                                | 150                                | 554                               | 424                               |
| Research an0 developrent | 189                                | 124                                | 491                               | 389                               |
|                          | 95                                 |                                    | 2B3                               | 328                               |
| Total                    | 465                                | 362                                | 1,328                             | 1,141                             |

## Note 9 Commitments and Contingencies

## Operating Lease Arrangements In Buffalo, New York and Shanghai, China

## Legal Proceedings

Between september 1 which 2021 pald has

Processing completed in 7.64 seconds

If you compare this out of the original picture, the results are disappointing. A lot of text in the picture suddenly missed or caught. This is where the product is text-like the AWS Texcrcrect login, as you pass through the text from a variety of sources.

However, writing gives different OCR options, so if you get poor side effects from one system, you can always switch to another.

I've tried the same work I use the EleaseleccCcCoccoccoccoccoccoccoccoccoccoccoccot, here is code.

from pathlib import Path
import time
import pandas as pd

from docling.document_converter import DocumentConverter, ImageFormatOption
from docling.models.easyocr_model import EasyOcrOptions  # Import EasyOCR options


def main():
    inpath = "/mnt/d//tesla"
    infile = "10q-image.png"

    input_doc_path = Path(inpath) / infile

    # Configure image pipeline with EasyOCR
    image_options = ImageFormatOption(
        ocr_options=EasyOcrOptions(force_full_page_ocr=True),  # use EasyOCR
        do_table_structure=True,
        table_structure_options={"do_cell_matching": True},
    )

    converter = DocumentConverter(
        format_options={"image": image_options}
    )

    start_time = time.time()

    conv_res = converter.convert(input_doc_path).document

    # Print all tables as Markdown
    for table_ix, table in enumerate(conv_res.tables):
        table_df: pd.DataFrame = table.export_to_dataframe(doc=conv_res)
        page_number = table.prov[0].page_no if table.prov else "Unknown"
        print(f"n--- Table {table_ix+1} (Page {page_number}) ---")
        print(table_df.to_markdown(index=False))

    # Print full document text as Markdown
    print("n--- Full Document (Markdown) ---")
    print(conv_res.export_to_markdown())

    elapsed = time.time() - start_time
    print(f"nProcessing completed in {elapsed:.2f} seconds")


if __name__ == "__main__":
    main()

Summary

Neighbor – Ai Boom again issued an old fact: trash in, trash outside. The llms may only postpone the smaller when they include a coherent in paragraph. Docling provides cooperation (a lot of time) in all most of your stakeholders who can reflect you, and do it in your area and breeding.

Recognition There are their ways more than Ai World, anyway. Consider the great number of scriptures stored in areas such as banking, offices, and insurance companies worldwide. If this should be financial, write down can give some solutions to that.

Its serious weakness is probably recognized by the visible character of the text within pictures. I tried to use Teseract and Eleacrocm, and the two consequences were disappointing. You will probably need to use the productive product such as AWS Texcrort if you want to be reliably producing text in these types of resources.

It can also be a little. I have a suitable higher desktop PC with GPU, and it took time for many jobs. However, if your PDFS installations documents are primarily, writing can be an important installation of your text processing box.

I only redeem the area of ​​what the text knows, and I encourage you to visit their home page, which can be used using the following link to learn more.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button