Documentation Index Fetch the complete documentation index at: https://mintlify.com/JanuaryLabs/deepagents/llms.txt
Use this file to discover all available pages before exploring further.
Document Ingestion
Ingestion is the process of loading content from a connector, splitting it into chunks, generating embeddings, and storing them for later retrieval.
Basic Ingestion
import { ingest , fastembed , SqliteStore } from '@deepagents/retrieval' ;
import { local } from '@deepagents/retrieval/connectors' ;
import Database from 'better-sqlite3' ;
const db = new Database ( './vectors.db' );
const store = new SqliteStore ( db , 384 );
const embedder = fastembed ();
await ingest ({
connector: local ( '**/*.md' ),
store ,
embedder ,
});
Ingestion Process
The ingestion pipeline performs these steps:
Fetch Content - Connector yields documents with id, content, and metadata
Content Hashing - Generate SHA-256 hash (CID) to detect changes
Skip Unchanged - Skip documents with matching CID (no changes)
Split into Chunks - Use text splitter to break content into smaller pieces
Generate Embeddings - Create vector embeddings for each chunk
Store Vectors - Save embeddings and metadata to SQLite
Configuration Options
export interface IngestionConfig {
connector : Connector ; // Source of documents
store : Store ; // Vector storage backend
embedder : Embedder ; // Embedding function
splitter ?: Splitter ; // Optional custom text splitter
}
Connector
Any connector that implements the Connector interface:
await ingest ({
connector: local ( '**/*.md' ),
// ... other config
});
See Connectors for available options.
Store
The vector store where embeddings are saved:
import { SqliteStore } from '@deepagents/retrieval' ;
import Database from 'better-sqlite3' ;
const db = new Database ( './vectors.db' );
const store = new SqliteStore ( db , 384 ); // Must match embedder dimensions
Embedder
Function that converts text to vector embeddings:
import { fastembed } from '@deepagents/retrieval' ;
const embedder = fastembed ({
model: 'BGESmallENV15' , // 384 dimensions
});
Splitter (Optional)
Custom text splitting function:
import { splitTypeScript } from '@deepagents/retrieval' ;
await ingest ({
connector: local ( '**/*.ts' ),
store ,
embedder ,
splitter: splitTypeScript , // TypeScript-aware splitting
});
Text Splitting
By default, ingestion uses MarkdownTextSplitter from LangChain:
// Default splitter
function split ( id : string , content : string ) {
const splitter = new MarkdownTextSplitter ();
return splitter . splitText ( content );
}
TypeScript Splitting
For code files, use language-aware splitting:
import { splitTypeScript } from '@deepagents/retrieval' ;
const splitter = splitTypeScript ;
await ingest ({
connector: local ( 'src/**/*.ts' ),
store ,
embedder ,
splitter ,
});
The TypeScript splitter:
Uses recursive character splitting with 512 character chunks
Includes 100 character overlap between chunks
Preserves code structure and context
Custom Splitting
Create your own splitter:
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter' ;
const customSplitter = async ( id : string , content : string ) => {
const splitter = new RecursiveCharacterTextSplitter ({
chunkSize: 1000 ,
chunkOverlap: 200 ,
});
return await splitter . splitText ( content );
};
await ingest ({
connector: local ( '**/*.txt' ),
store ,
embedder ,
splitter: customSplitter ,
});
Change Detection
Ingestion automatically detects content changes using SHA-256 hashing:
import { cid } from '@deepagents/retrieval' ;
// Content ID (CID) is a SHA-256 hash
const contentId = cid ( 'file content here' );
// "bafkreih..."
When a document is ingested:
Calculate CID from content
Compare with stored CID
Skip if CID matches (no changes)
Re-process if CID differs (content changed)
This ensures efficient re-ingestion:
// First run: processes all files
await ingest ({ connector , store , embedder });
// Second run: only processes changed files
await ingest ({ connector , store , embedder });
Ingestion Strategies
Connectors can specify when to ingest using ingestWhen:
contentChanged (Default)
const connector = local ( '**/*.md' , {
ingestWhen: 'contentChanged' , // Re-ingest if content changed
});
Always attempts ingestion. Skips unchanged documents via CID comparison.
never
const connector = local ( '**/*.md' , {
ingestWhen: 'never' , // Only ingest if source doesn't exist
});
Only ingests if the source has never been ingested before.
expired
const connector = local ( '**/*.md' , {
ingestWhen: 'expired' ,
expiresAfter: 24 * 60 * 60 * 1000 , // 24 hours in milliseconds
});
Only ingests if the source doesn’t exist or has expired.
Batching
Ingestion automatically batches embeddings to control memory usage:
const batchSize = 40 ; // Default batch size
for ( let i = 0 ; i < chunks . length ; i += batchSize ) {
const batch = chunks . slice ( i , i + batchSize );
const { embeddings } = await embedder ( batch );
// Store batch...
}
This prevents memory issues when processing large documents.
Progress Tracking
Track ingestion progress with a callback:
await ingest (
{
connector: local ( '**/*.md' ),
store ,
embedder ,
},
( documentId ) => {
console . log ( `Processing: ${ documentId } ` );
}
);
The callback receives the document ID for each processed document.
Multiple Sources
Ingest from multiple connectors:
import { github , local , rss } from '@deepagents/retrieval/connectors' ;
const sources = [
github . file ( 'facebook/react/README.md' ),
local ( 'docs/**/*.md' ),
rss ( 'https://blog.example.com/feed.xml' ),
];
for ( const connector of sources ) {
await ingest ({ connector , store , embedder });
console . log ( `Ingested: ${ connector . sourceId } ` );
}
Each connector has a unique sourceId for tracking.
Error Handling
try {
await ingest ({
connector: local ( '**/*.md' ),
store ,
embedder ,
});
console . log ( 'Ingestion complete' );
} catch ( error ) {
console . error ( 'Ingestion failed:' , error );
}
Ingestion skips empty files automatically:
if ( ! content . trim ()) {
continue ; // Skip empty files
}
Best Practices
Choose Appropriate Chunk Sizes
Smaller chunks (512 chars) for code, larger chunks (1000+ chars) for prose.
Use Language-Aware Splitting
For code files, use language-specific splitters like splitTypeScript.
Batch Large Jobs
Ingestion automatically batches, but you can also batch connector sources.
Track Progress
Use the progress callback for long-running ingestion jobs.
Handle Errors Gracefully
Wrap ingestion in try-catch and log failures without stopping the entire job.
Next Steps
Connectors Explore available data connectors
Search Search ingested content
Embeddings Learn about embedding models