Collect files

Most fund-ops runbooks start the same way: a file lands in storage (an invoice, a bank statement, a rent roll, a trial balance) and the runbook needs to turn its bytes into something it can reason about. ntro.capabilities.files does that turn.

Install

pip install 'ntro[workflow]'

The capability is bundled with the workflow extra — runbooks always have it.

The API

One public coroutine:

from ntro.capabilities import files

grid = await files.parse(content=<bytes>, format="pdf" | "xlsx")

Returns a CellGrid-shaped object with two key surfaces:

Field	Type	What’s in it
`grid.cells`	`list[list[Cell]]`	2-D cell grid preserving the source’s row / column structure. Each cell knows its position, its value, and its bounding box for PDFs. Useful when you need tabular layout to interpret the data (rent rolls, trial balances).
`grid.plain_text`	`str`	The same content flattened to plain text in reading order. Useful when you only care about the prose — invoice line items, narrative summaries, anything you’ll feed straight to AI extraction.

Both fields are populated by both formats. The downstream AI extraction step typically reads plain_text and passes cells as structured_context so the model can disambiguate when layout matters.

PDF parsing — `format="pdf"`

Backed by pdfplumber. Best for:

Scanned-and-OCR’d documents (invoices, statements, contracts)
Form-style documents with key-value pairs
Documents with tables that have visible borders

Lifted from the document-ingest runbook:

from temporalio import activity

from ntro.capabilities import files
from ntro.data import get_data_plane


@activity.defn(name="document_ingest.parse_pdf")
async def parse_pdf(submitted: DocumentSubmissionPayload) -> RawDocument:
    """Parse the submitted PDF bytes into a structured cell grid + plain text."""
    db = await get_data_plane(submitted.tenant_slug)
    row = await db.fetchrow(
        "SELECT data_bytes FROM submitted_documents WHERE id = $1",
        submitted.document_ref,
    )
    grid = await files.parse(content=bytes(row["data_bytes"]), format="pdf")
    return RawDocument(
        document_ref=submitted.document_ref,
        filename=submitted.filename,
        cell_grid=grid.cells,
        plain_text=grid.plain_text,
    )

Two things to notice:

The bytes come from the tenant data plane (Postgres), not the activity payload. Signals carry only the document_ref so payloads stay small.
Both grid.cells and grid.plain_text flow into the RawDocument so the next step (AI extraction) has both.

Excel parsing — `format="xlsx"`

Backed by openpyxl. Best for:

Trial balances exported from Xero / SAP / Sage
Investor registers, capital call schedules, NAV templates
Anything where preserving sheet / cell coordinates matters

Lifted from the nav-monthly-journals runbook:

from ntro.capabilities import ai, files
from ntro.data import get_data_plane


@activity.defn(name="nav_monthly_journals.parse_starting_tb")
async def parse_starting_tb(ctx: NavMonthlyJournalsContext) -> TrialBalance:
    db = await get_data_plane(ctx.tenant_slug)

    row = await db.fetchrow(
        "SELECT id, data_bytes FROM submitted_documents "
        "WHERE entity_slug = $1 AND source = $2 "
        "ORDER BY uploaded_at DESC LIMIT 1",
        ctx.entity_slug,
        ctx.tb_source,
    )

    grid = await files.parse(content=bytes(row["data_bytes"]), format="xlsx")

    # Cell-grid context helps the LLM disambiguate columns
    result = await ai.extract(
        content=grid.plain_text,
        schema_slug=ctx.tb_schema,
        structured_context={"cell_grid": grid.cells},
    )
    return TrialBalance.from_extraction(result, period=ctx.period)

The pattern is the same: parse → feed the LLM both the prose and the structured cells → produce a typed model.

Choosing between `cells` and `plain_text`

You need…	Use
Free-text extraction (invoice descriptions, prose paragraphs)	`plain_text`
Tabular extraction where row / column position carries meaning	`plain_text` for the prompt, `cells` as `structured_context`
Bounding-box-based document classification (PDF only)	`cells` (each cell has `.bbox`)
Cheap “give me everything as one string”	`plain_text`

When you hand the result to AI extraction, passing both as in the examples above is the safe default — it costs nothing and gives the model the most signal.

Private AI

The natural next step — ai.extract() consumes what files.parse() produces.

Data

Where parsed documents typically come from (storage.read or the data plane).

Get Started

Tooling

Ntro SDK

Deploy & Run

Install

The API

PDF parsing — `format="pdf"`

Excel parsing — `format="xlsx"`

Choosing between `cells` and `plain_text`

Private AI

Data

Get Started

Tooling

Ntro SDK

Deploy & Run

Documentation Index

​Install

​The API

​PDF parsing — format="pdf"

​Excel parsing — format="xlsx"

​Choosing between cells and plain_text

​Related

Private AI

Data

Install

The API

PDF parsing — `format="pdf"`

Excel parsing — `format="xlsx"`

Choosing between `cells` and `plain_text`

Related