Documentation Index Fetch the complete documentation index at: https://mintlify.com/JanuaryLabs/deepagents/llms.txt
Use this file to discover all available pages before exploring further.
PDF Connector
The PDF connector extracts text from PDF documents, supporting both local files and remote URLs with glob pattern matching.
Import
import { pdf , pdfFile } from '@deepagents/retrieval/connectors' ;
Two Variants
The package provides two PDF connectors:
pdf(pattern) - Glob pattern matching for multiple PDFs
pdfFile(source) - Single PDF from file path or URL
PDF Pattern Matching
Ingest multiple PDFs using glob patterns:
import { pdf } from '@deepagents/retrieval/connectors' ;
const connector = pdf ( '**/*.pdf' );
Basic Usage
import { pdf } from '@deepagents/retrieval/connectors' ;
import { ingest , fastembed , SqliteStore } from '@deepagents/retrieval' ;
import Database from 'better-sqlite3' ;
const db = new Database ( './vectors.db' );
const store = new SqliteStore ( db , 384 );
const embedder = fastembed ();
// Ingest all PDFs in a directory
await ingest ({
connector: pdf ( 'docs/**/*.pdf' ),
store ,
embedder ,
});
Pattern Examples
// All PDFs recursively
pdf ( '**/*.pdf' )
// PDFs in specific directory
pdf ( 'research/**/*.pdf' )
// PDFs in current directory only
pdf ( '*.pdf' )
// Multiple directories
pdf ( '{docs,papers}/**/*.pdf' )
Source ID
const connector = pdf ( '**/*.pdf' );
console . log ( connector . sourceId );
// "pdf:**/*.pdf"
Excluded Directories
These directories are automatically excluded:
**/node_modules/**
**/.git/**
Single PDF File
Ingest a single PDF from a file path or URL:
import { pdfFile } from '@deepagents/retrieval/connectors' ;
const connector = pdfFile ( './manual.pdf' );
Local File
import { pdfFile } from '@deepagents/retrieval/connectors' ;
// Relative path
const connector = pdfFile ( './docs/manual.pdf' );
// Absolute path
const connector = pdfFile ( '/Users/you/documents/paper.pdf' );
await ingest ({ connector , store , embedder });
Remote URL
import { pdfFile } from '@deepagents/retrieval/connectors' ;
const connector = pdfFile ( 'https://example.com/whitepaper.pdf' );
await ingest ({ connector , store , embedder });
Source ID
// Local file
const connector1 = pdfFile ( './manual.pdf' );
console . log ( connector1 . sourceId );
// "pdf:file:./manual.pdf"
// Remote URL
const connector2 = pdfFile ( 'https://example.com/paper.pdf' );
console . log ( connector2 . sourceId );
// "pdf:url:https://example.com/paper.pdf"
Both connectors use the unpdf library for text extraction:
import { extractText , getDocumentProxy } from 'unpdf' ;
const buffer = await readFile ( path );
const pdf = await getDocumentProxy ( new Uint8Array ( buffer ));
const { text } = await extractText ( pdf , { mergePages: true });
Merged Pages
Pages are automatically merged into a single text document:
This creates cohesive content for better embedding quality.
Extracted text is ingested as-is:
[Page 1 text]
[Page 2 text]
[Page 3 text]
...
All pages are combined into a single document.
Examples
Research Papers
import { pdf } from '@deepagents/retrieval/connectors' ;
import { similaritySearch } from '@deepagents/retrieval' ;
const connector = pdf ( 'research/**/*.pdf' );
// Ingest all papers
await ingest ({ connector , store , embedder });
// Search
const results = await similaritySearch (
'What methodology was used for the experiment?' ,
{ connector , store , embedder }
);
console . log ( results [ 0 ]. content );
User Manual
import { pdfFile } from '@deepagents/retrieval/connectors' ;
const connector = pdfFile ( './docs/user-manual.pdf' );
await ingest ({ connector , store , embedder });
const results = await similaritySearch (
'How do I configure authentication?' ,
{ connector , store , embedder }
);
Remote PDF
import { pdfFile } from '@deepagents/retrieval/connectors' ;
const connector = pdfFile (
'https://arxiv.org/pdf/2103.00020.pdf'
);
await ingest ({ connector , store , embedder });
const results = await similaritySearch (
'What are the main contributions?' ,
{ connector , store , embedder }
);
Multiple PDFs
const pdfs = [
pdfFile ( './docs/manual.pdf' ),
pdfFile ( './docs/guide.pdf' ),
pdfFile ( 'https://example.com/whitepaper.pdf' ),
];
for ( const connector of pdfs ) {
await ingest ({ connector , store , embedder });
console . log ( `Ingested: ${ connector . sourceId } ` );
}
File Validation
Only .pdf files are processed:
if ( ! path . toLowerCase (). endsWith ( '.pdf' )) continue ;
Non-PDF files are skipped.
Memory Usage
PDFs are loaded into memory for processing:
const buffer = await readFile ( path );
const pdf = await getDocumentProxy ( new Uint8Array ( buffer ));
Large PDFs may consume significant memory.
Network Requests
Remote PDFs are downloaded completely:
const response = await fetch ( url );
const buffer = new Uint8Array ( await response . arrayBuffer ());
Error Handling
Invalid PDFs
try {
await ingest ({
connector: pdfFile ( './corrupted.pdf' ),
store ,
embedder ,
});
} catch ( error ) {
console . error ( 'PDF processing failed:' , error );
}
HTTP Errors
const response = await fetch ( url );
if ( ! response . ok ) {
throw new Error ( `HTTP ${ response . status } : ${ response . statusText } ` );
}
File Not Found
try {
const buffer = await readFile ( path );
} catch ( error ) {
console . error ( 'File not found:' , error );
}
Document IDs
Pattern Matching
Document IDs are file paths:
const connector = pdf ( 'docs/**/*.pdf' );
for await ( const doc of connector . sources ()) {
console . log ( doc . id );
// "/Users/you/project/docs/manual.pdf"
// "/Users/you/project/docs/guide.pdf"
}
Single File
Document ID is the source:
const connector = pdfFile ( './manual.pdf' );
for await ( const doc of connector . sources ()) {
console . log ( doc . id );
// "./manual.pdf"
}
Remote URL
Document ID is the URL:
const connector = pdfFile ( 'https://example.com/paper.pdf' );
for await ( const doc of connector . sources ()) {
console . log ( doc . id );
// "https://example.com/paper.pdf"
}
Text extraction quality depends on the PDF:
Good Quality
Text-based PDFs (searchable)
Well-structured documents
Standard fonts
Poor Quality
Scanned images (requires OCR, not supported)
Complex layouts
Heavy graphics
No OCR Support
The connector does not perform OCR on scanned PDFs. Only text-based PDFs are supported.
Chunking
PDF text is chunked using the default text splitter:
import { MarkdownTextSplitter } from 'langchain/text_splitter' ;
const splitter = new MarkdownTextSplitter ();
const chunks = await splitter . splitText ( pdfText );
For custom chunking:
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter' ;
const customSplitter = async ( id : string , content : string ) => {
const splitter = new RecursiveCharacterTextSplitter ({
chunkSize: 1000 ,
chunkOverlap: 200 ,
});
return await splitter . splitText ( content );
};
await ingest ({
connector: pdf ( '**/*.pdf' ),
store ,
embedder ,
splitter: customSplitter ,
});
Best Practices
Validate PDFs
Ensure PDFs are text-based, not scanned images.
Use Specific Patterns
Be specific to avoid processing unnecessary files:
pdf ( 'research/papers/**/*.pdf' ) // Good
pdf ( '**/*.pdf' ) // May include unwanted files
Handle Large PDFs
Large PDFs may need custom chunking:
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter' ;
const splitter = new RecursiveCharacterTextSplitter ({
chunkSize: 2000 ,
chunkOverlap: 400 ,
});
Cache Remote PDFs
Download and cache remote PDFs locally for faster re-ingestion.
Check HTTP Status
Validate remote URLs before ingestion:
const response = await fetch ( url , { method: 'HEAD' });
if ( ! response . ok ) {
console . error ( `URL not accessible: ${ url } ` );
}
Limitations
No OCR
Scanned PDFs require OCR, which is not supported.
Memory Usage
Large PDFs are loaded entirely into memory.
Layout Preservation
Complex layouts may not extract well. Text order may be incorrect.
Images and Graphics
Images are ignored. Only text is extracted.
Comparison
Feature pdf(pattern)pdfFile(source)Multiple files Yes No Glob patterns Yes No Local files Yes Yes Remote URLs No Yes Excluded dirs Yes No Use case Batch processing Single document
Next Steps
Local Files Work with local files
Linear Connector Ingest Linear issues
Ingestion Learn about ingestion