Integrate NVIDIA Retriever Extraction with Langflow
The NVIDIA Retriever Extraction component integrates with the NVIDIA nv-ingest microservice for data ingestion, processing, and extraction of text files.
The nv-ingest
service supports multiple extraction methods for PDF, DOCX, and PPTX file types, and includes pre- and post-processing services like splitting, chunking, and embedding generation. The extractor service's High Resolution mode uses the nemoretriever-parse
extraction method for better quality extraction from scanned PDF documents. This feature is only available for PDF files.
The NVIDIA Retriever Extraction component imports the NVIDIA Ingestor
client, ingests files with requests to the NVIDIA ingest endpoint, and outputs the processed content as a list of Data
objects. Ingestor
accepts additional configuration options for data extraction from other text formats. To configure these options, see the component parameters.
NVIDIA Retriever Extraction is also known as NV-Ingest and NeMo Retriever Extraction.
Prerequisites
-
An NVIDIA Ingest endpoint. For more information on setting up an NVIDIA Ingest endpoint, see the NVIDIA Ingest quickstart.
-
The NVIDIA Retriever Extraction component requires the installation of additional dependencies to your Langflow environment. To install the dependencies in a virtual environment, run the following commands.
- If you have the Langflow repository cloned and installed from source:
_10source **YOUR_LANGFLOW_VENV**/bin/activate_10uv sync --extra nv-ingest_10uv run langflow run- If you are installing Langflow from the Python Package Index:
_10source **YOUR_LANGFLOW_VENV**/bin/activate_10uv pip install --prerelease=allow 'langflow[nv-ingest]'_10uv run langflow run
Use the NVIDIA Retriever Extraction component in a flow
The NVIDIA Retriever Extraction component accepts Message
inputs, and then outputs Data
. The component calls an NVIDIA Ingest microservice's endpoint to ingest a local file and extract the text.
To use the NVIDIA Retriever Extraction component in your flow, follow these steps:
-
Add the NVIDIA Retriever Extraction component to your flow.
-
In the Base URL field, enter the URL of the NVIDIA Ingest endpoint. You can also store the URL as a global variable to reuse it in multiple components and flows.
-
Click Select Files to select a file to ingest.
-
Select which text type to extract from the file: text, charts, tables, images, or infographics.
-
Optional: For PDF files, enable High Resolution Mode for better quality extraction from scanned documents.
-
Select whether to split the text into chunks.
You can modify the splitting parameters and other hidden settings through the Controls in the component's header menu.
-
Click Run component to ingest the file, and then open the Logs pane to confirm the component ingested the file.
-
To store the processed data in a vector database, add a Vector Store component to your flow, and then connect the NVIDIA Retriever Extraction component's
Data
output to the Vector Store component's input.When you run the flow with a Vector Store component, the processed data is stored in the vector database. You can query your database to retrieve the uploaded data.
NVIDIA Retriever Extraction component parameters
The NVIDIA Retriever Extraction component has the following parameters.
For more information, see the NV-Ingest documentation.
Inputs
Name | Display Name | Info |
---|---|---|
base_url | NVIDIA Ingestion URL | The URL of the NVIDIA Ingestion API. |
path | Path | File path to process. |
extract_text | Extract Text | Extract text from documents. Default: true. |
extract_charts | Extract Charts | Extract text from charts. Default: false. |
extract_tables | Extract Tables | Extract text from tables. Default: true. |
extract_images | Extract Images | Extract images from document. Default: true. |
extract_infographics | Extract Infographics | Extract infographics from document. Default: false. |
text_depth | Text Depth | The level at which text is extracted. Options: 'document', 'page', 'block', 'line', 'span'. Default: page . |
split_text | Split Text | Split text into smaller chunks. Default: true. |
chunk_size | Chunk Size | The number of tokens per chunk. Default: 500 . Make sure the chunk size is compatible with your embedding model. For more information, see Tokenization errors due to chunk size. |
chunk_overlap | Chunk Overlap | Number of tokens to overlap from previous chunk. Default: 150 . |
filter_images | Filter Images | Filter images (see advanced options for filtering criteria). Default: false. |
min_image_size | Minimum Image Size Filter | Minimum image width/length in pixels. Default: 128 . |
min_aspect_ratio | Minimum Aspect Ratio Filter | Minimum allowed aspect ratio (width / height). Default: 0.2 . |
max_aspect_ratio | Maximum Aspect Ratio Filter | Maximum allowed aspect ratio (width / height). Default: 5.0 . |
dedup_images | Deduplicate Images | Filter duplicated images. Default: true. |
caption_images | Caption Images | Generate captions for images using the NVIDIA captioning model. Default: true. |
high_resolution | High Resolution (PDF only) | Process PDF in high-resolution mode for better quality extraction from scanned PDF. Default: false. |
Outputs
The NVIDIA Retriever Extraction component outputs a list of Data
objects where each object contains:
text
: The extracted content.- For text documents: The extracted text content.
- For tables and charts: The extracted table/chart content.
- For images: The image caption.
- For infographics: The extracted infographic content.
file_path
: The source file name and path.document_type
: The type of the document, which can betext
,structured
, orimage
.description
: Additional description of the content.
The output varies based on the document_type
:
-
Documents with
document_type: "text"
contain:- Raw text content extracted from documents, for example, paragraphs from PDFs or DOCX files.
- Content stored directly in the
text
field. - Content extracted using the
extract_text
parameter.
-
Documents with
document_type: "structured"
contain:- Text extracted from tables, charts, and infographics and processed to preserve structural information.
- Content extracted using the
extract_tables
,extract_charts
, andextract_infographics
parameters. - Content stored in the
text
field after being processed from thetable_content
metadata.
-
Documents with
document_type: "image"
contain:- Image content extracted from documents.
- Caption text stored in the
text
field whencaption_images
is enabled. - Content extracted using the
extract_images
parameter.