Similar Document Detection

Available from version 11.48.0 — Requires OPENSEARCH_ENABLED license.

What does this script do?

Uses vector-based similarity search to find documents that are semantically similar to the current one. If a document with more than 95% similarity is found, the invoice number is flagged as potentially fraudulent or duplicate.

Trigger

AFTER_FORMATTING on document type INVOICE

Full Script

doc_id = document_json["doc_id"]
similar = vector_search(doc_id, k=5)

for doc in similar:
    if doc["similarity_percent"] > 95:
        set_field_as_invalid(
            document_data, "invoice_id",
            f"95%+ similar to: {doc['name']} (Score: {doc['similarity_percent']}%)"
        )
        break

Step-by-Step Explanation

Get current document ID from document_json
Find similar documents with vector_search() returning the 5 nearest neighbors
Check similarity threshold: If any document exceeds 95% similarity, flag it
Mark as invalid with the similar document's name and similarity score

How Vector Search Works

Each document's OCR text is converted to a 384-dimensional vector embedding when indexed. vector_search() finds the nearest neighbors in this vector space using k-NN (k-Nearest Neighbors), returning documents whose content is semantically similar — even if the exact words differ.

Use cases:

Fraud detection (nearly identical invoices from different "vendors")
Duplicate detection that goes beyond exact text matching
Finding related documents across different formats or languages

Functions Used

vector_search() — Find semantically similar documents
set_field_as_invalid() — Show validation error

PreviousDuplicate Invoice Detection NextCompliance Text Search

Last updated 7 days ago

Was this helpful?

hashtagWhat does this script do?

hashtagTrigger

hashtagFull Script

hashtagStep-by-Step Explanation

hashtagHow Vector Search Works

hashtagFunctions Used