Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.ntropii.com/llms.txt

Use this file to discover all available pages before exploring further.

Most fund-ops runbooks start the same way: a file lands in storage (an invoice, a bank statement, a rent roll, a trial balance) and the runbook needs to turn its bytes into something it can reason about. ntro.capabilities.files does that turn.

Install

pip install 'ntro[workflow]'
The capability is bundled with the workflow extra — runbooks always have it.

The API

One public coroutine:
from ntro.capabilities import files

grid = await files.parse(content=<bytes>, format="pdf" | "xlsx")
Returns a CellGrid-shaped object with two key surfaces:
FieldTypeWhat’s in it
grid.cellslist[list[Cell]]2-D cell grid preserving the source’s row / column structure. Each cell knows its position, its value, and its bounding box for PDFs. Useful when you need tabular layout to interpret the data (rent rolls, trial balances).
grid.plain_textstrThe same content flattened to plain text in reading order. Useful when you only care about the prose — invoice line items, narrative summaries, anything you’ll feed straight to AI extraction.
Both fields are populated by both formats. The downstream AI extraction step typically reads plain_text and passes cells as structured_context so the model can disambiguate when layout matters.

PDF parsing — format="pdf"

Backed by pdfplumber. Best for:
  • Scanned-and-OCR’d documents (invoices, statements, contracts)
  • Form-style documents with key-value pairs
  • Documents with tables that have visible borders
Lifted from the document-ingest runbook:
from temporalio import activity

from ntro.capabilities import files
from ntro.data import get_data_plane


@activity.defn(name="document_ingest.parse_pdf")
async def parse_pdf(submitted: DocumentSubmissionPayload) -> RawDocument:
    """Parse the submitted PDF bytes into a structured cell grid + plain text."""
    db = await get_data_plane(submitted.tenant_slug)
    row = await db.fetchrow(
        "SELECT data_bytes FROM submitted_documents WHERE id = $1",
        submitted.document_ref,
    )
    grid = await files.parse(content=bytes(row["data_bytes"]), format="pdf")
    return RawDocument(
        document_ref=submitted.document_ref,
        filename=submitted.filename,
        cell_grid=grid.cells,
        plain_text=grid.plain_text,
    )
Two things to notice:
  • The bytes come from the tenant data plane (Postgres), not the activity payload. Signals carry only the document_ref so payloads stay small.
  • Both grid.cells and grid.plain_text flow into the RawDocument so the next step (AI extraction) has both.

Excel parsing — format="xlsx"

Backed by openpyxl. Best for:
  • Trial balances exported from Xero / SAP / Sage
  • Investor registers, capital call schedules, NAV templates
  • Anything where preserving sheet / cell coordinates matters
Lifted from the nav-monthly-journals runbook:
from ntro.capabilities import ai, files
from ntro.data import get_data_plane


@activity.defn(name="nav_monthly_journals.parse_starting_tb")
async def parse_starting_tb(ctx: NavMonthlyJournalsContext) -> TrialBalance:
    db = await get_data_plane(ctx.tenant_slug)

    row = await db.fetchrow(
        "SELECT id, data_bytes FROM submitted_documents "
        "WHERE entity_slug = $1 AND source = $2 "
        "ORDER BY uploaded_at DESC LIMIT 1",
        ctx.entity_slug,
        ctx.tb_source,
    )

    grid = await files.parse(content=bytes(row["data_bytes"]), format="xlsx")

    # Cell-grid context helps the LLM disambiguate columns
    result = await ai.extract(
        content=grid.plain_text,
        schema_slug=ctx.tb_schema,
        structured_context={"cell_grid": grid.cells},
    )
    return TrialBalance.from_extraction(result, period=ctx.period)
The pattern is the same: parse → feed the LLM both the prose and the structured cells → produce a typed model.

Choosing between cells and plain_text

You need…Use
Free-text extraction (invoice descriptions, prose paragraphs)plain_text
Tabular extraction where row / column position carries meaningplain_text for the prompt, cells as structured_context
Bounding-box-based document classification (PDF only)cells (each cell has .bbox)
Cheap “give me everything as one string”plain_text
When you hand the result to AI extraction, passing both as in the examples above is the safe default — it costs nothing and gives the model the most signal.

Private AI

The natural next step — ai.extract() consumes what files.parse() produces.

Data

Where parsed documents typically come from (storage.read or the data plane).