Vector database

You can enable semantic search in Neptune DXP - Open Edition. Semantic search is a search technology that relies on semantic similarity representations to identify suitable results.

Specifically (given the below prerequisites) you can extend any table definition to be searchable via semantic similarity.

Vector search can be a useful technical approach to provide your AI agent with context. However, it may not always be the best suited one.

Do use vectorization when you want to search a large body of unstructured data.

Do not use vectorization to search structured data. If you have a table that contains structured data (for example, sales orders, containing data about customer, product, dates, etc.), use an AI tool of type Table Definition.

Prerequisites

  • You are using PostgresSQL or Microsoft SQL Server as a database and have installed and enabled the extension pgvector (postgres) or the experimental vector feature (mssql). Search online for an explanation on how to enable the vector extension in the database ouf your choice.

  • You have set up an embedding model in the Model tool.

Enable vectorization of a table definition

  • To enable the vectorization of a given table, open the Table Definition tool and go to Properties. Enter edit mode and select Vector Store.

    Result: A dialog opens that lets you configure the vectorization of the table.

    • Select Enable Vectorization of this table to add vectorization to the table. If you do this, a custom column with the name rowVector is added to the table and a custom routine is hooked to CREATE/UPDATE operations on the table, to perform the vectorization.

    • Vectorizer: Select an AI model that performs the actual vectorization, turning text input into a semantic vector representation. You can only select models with output type vector.

    • CoFlumns to vectorize: You can customize which columns are part of the vectorization. Typically, you would select those columns that contain text.

    • Result template: If the table is selected as a knowledge source in an agent, the result template renders a row for the agent.

Example

You have created a table that contains the text of instruction manuals. To prepare the data, you have parsed your instruction manuals and chunked the bare text into overlapping chunks of length 1024 chars. In your table, you are representing the data as a table with fields:

  • filename: The name of the instruction manual

  • productName: The name of the product, the instruction is about

  • pageNr: The page a specific chunk is found on

  • chunkIdx: The index of the chunk

  • chunkTxt: The text of the chunk

To use this data as a data source in an agent, you set up the vectorization as followings:

  • Columns to vectorize: productName, chunkTxt

  • Result template:

=== {filename} / Instruction manual for {productName}
Found on page nr. {pageNr}
{chunkTxt}

Now, the semantic representation should capture both the product as well as the content of the respective chunk. When the agent "sees" a result, it has necessary metadata to help the user to not only provide context, but also to point the user to the right file/page.

Add an index on a vector column

If the embedding model that you are using has the vector dimension set, and you are using postgres with pgvector, you can add an index on the rowVector column to enable fast, but approximate semantic search. If you add such an index through the Table Definition tool, internally the HNSW index type is used.

While using an index can provide a significant speedup for the similarity search, it requires caution when used together with WHERE clauses, because the filtering will be applied after the approximate search.

Refer to the pgvector documentation for more details. If you require to combine WHERE clauses and an index on the vector column, consider enabling an iterative scan by running SET hnsw.iterative_scan = strict_order in your database.

Perform a semantic search in a script

In the Script Editor you can perform semantic search on vector-enabled tables. To do so, add the table to the script and search using the following syntax:

const data = await entities.testtable.findSimilar({
    search: "My query",
    select: ["id"], // normal typeorm syntax, select any fields you are looking for
    where: [ id: Not("test") ] // again, normal typeorm
    // additional typeorm arguments
});

Use a property of type vector

When adding a new property to a table, you can select the type to be vector. This doesn’t provide any out-of-the-box functionality like when you enable vectorization as described above.

However, you can use it to build your bespoke solution that involves vector similarity search. You may, for instance, implement a use case, where items are represented by both image as well as text, and you have decided to provide semantic similarity search based on both modalities.

With appropriate embedding models set up, you can store both the text and image embedding in the table, and write your custom script that searches in the table based on image or text similarity.

Adding a vector property on your table doesn’t automatically give you the vectorization capabilities as described above. You must implement your own similarity search.