Duplicate documents increase vector database size, waste storage, increase token usage and reduce retrieval quality. Many organizations unknowingly index duplicate content across large document collections, causing the AI to surface redundant information rather than distinct, useful insights.
Titan-Purge uses MinHash-based similarity analysis to identify duplicate and near-duplicate content before ingestion. By filtering your dataset offline, you ensure that your vector store remains lean, efficient, and highly relevant.
$ titan-purge -in ./documents