Java and the OpenAI API: How to Persist Embeddings With Milvus

If you're reading this, that probably means you already know how to retrieve Embeddings with the OpenAI API. And now you want to persist them in your Java application.

In this guide, I'll show you to do that with the aid of Milvus.

Happily enough, Milvus also offers a Java API that makes it easy for you to persist those Embeddings. Even better, the Milvus API lets you index and search the Embeddings you persisted.

So feel free to follow along and learn more.

Prereqs

There are prereqs to this tutorial.

For starters, this guide is part of a series. And I'm going to assume you've followed along in the series so that you know how to interact with Microsoft's OpenAI API in your Java application.

If not, you should go back to the beginning of the series and start there.

Also, you need Milvus. And for that, you need your own server.

The good news is that you can install Milvus on your server and start playing around for free.

The bad news is... well, there ain't no bad news, actually. That's pretty cool that you can do that.

I could walk you through how to install Milvus but truth be told I can't do any better than the guide that's on the company website. So go follow that guide and come on back here.

Once you've got Milvus up and running, you're ready for action.

Oh Yeah: What Is Milvus?

Maybe at this point you're asking yourself: "What, exactly, is Milvus?"

In a nutshell: it's a database.

In a bigger nutshell: it's a vector database. That means it's a database designed specifically to store those Embeddings you get back from OpenAI.

That also means that it's a database that handles unstructured data.

Recall from your DB 101 days that relational databases support structured data. And I think it's safe to say that document-oriented databases like MongoDB do the same.

But the whole point of AI and Embeddings is that you're working with unstructured data: plain text in English that's out there in cyberspace.

So you need a database that supports vectors. And Milvus is a great choice.

What Happened to That Flat File?

If you were here for the Embeddings tutorial, you might recall that we grabbed Embeddings via the OpenAI API and then stored them in a flat file.

So what's wrong with that solution? Nothing for small data sets.

But when you're looking at an encyclopedia of info, then it becomes problematic.

Why? Because when it's time for searching, you'd be doing a brute-force search. You'd step through each item in the encyclopedia to see how well it matches the search term.

In other words: you ain't got no indexing with a flat file.

And that's where Milvus shines. It's got a comprehensive indexing solution that will get you results in (relatively) short order.

On to Some Code

It's all Java from here on out. You can take a look at the complete code on GItHub.

First things first: update your application.properties file. Give it the host name where you installed Milvus.

milvus.host=[your host here]
milvus.user=[your user name]
milvus.password=[your password]

IMPORTANT: If you just followed the basic instructions for how to install Milvus, you won't need to worry about the name and password. Just leave those out.

All you probably need for now is milvus.host.

And remember: you won't find application.properties on GitHub because I don't share that stuff with everybody who reads my content.

Next, update your POM file with this dependency:

        <dependency>
            <groupId>io.milvus</groupId>
            <artifactId>milvus-sdk-java</artifactId>
            <version>2.2.9</version>
        </dependency>

That's the Milvus SDK. You might need to use a different version if you're reading this article well after its publication date.

Now it's time to move on to coding a new helper class.

A New Helper Class

Call it MilvusServiceClientHelper.

public class MilvusServiceClientHelper {

    //make sure to set the correct properties in application.properties
    public static MilvusServiceClient getClient() {
        final String host = AiProperties.get("milvus.host");

        final MilvusServiceClient milvusClient = new MilvusServiceClient(
                ConnectParam.newBuilder()
                        .withHost(host)
                        .withPort(19530)
                        .build()
        );

        return milvusClient;
    }
}

That class is fairly straightforward.

It starts off by getting the host IP address or name from the application.properties file.

Then it instantiates MilvusServiceClient by connecting to that host at port 19530 (one of two ports that Milvus uses).

Finally, it returns the newly created client.

Start With Persistence

Now that you've got that additional support class in place, it's time to start using it to persist your Embeddings.

Create a new class called PersistEmbeddings. Start populating it with necessary support methods.

Keep in mind: you'll work here with the same data that you used in the previous tutorial. So you should have two plain text files:

The second file is just the Embeddings generated from the first file.

With that in mind, define some constants for your class:

    private static final String INPUT_FILE = "./amazon-food-reviews.txt";
    private static final String EMBEDDINGS_FILE = "./amazon-food-reviews-embeddings.txt";

    private static final String COLLECTION_NAME = "amazon_food_reviews";
    private static final String CONTENT_FIELD = "content";
    private static final String EMBEDDING_FIELD = "content_embedding";
    private static final String ID_FIELD = "review_id";

The first two constants I just explained.

But what is that COLLECTION_NAME all about?

Well if you're familiar with MongoDB then you're probably already familiar with collections. They're similar to tables in a relational database.

And Milvus uses collections, too. They're groupings of related info.

For this application, the collection you'll persist is called "amazon_food_reviews".

Next up: CONTENT_FIELD. That's the name of the field where you'll persist the unstructured data.

You can think of CONTENT_FIELD as a column in a relational database. And you can think of each line in amazon-food-reviews.txt as a separate row where the data is stored in the column identified by CONTENT_FIELD.

So we're putting the content from each line in amazon-food-reviews.txt in a field called "content".

Then there's EMBEDDING_FIELD. You won't be shocked to learn that's the name of the field where you'll persist the embeddings.

Finally, there's ID_FIELD. It's necessary because each entry in a Milvus collection needs a primary ID.

Defining the Fields

I've talked a lot about fields in the previous section. But where do you define them?

Here:

    private static void createCollection(final MilvusServiceClient client) {
        final FieldType primaryIdType = FieldType.newBuilder()
                .withName(ID_FIELD)
                .withDataType(DataType.Int64)
                .withPrimaryKey(true)
                .withAutoID(true)
                .build();

        final FieldType contentType = FieldType.newBuilder()
                .withName(CONTENT_FIELD)
                .withDataType(DataType.VarChar)
                .withMaxLength(2048)
                .build();

        final FieldType embeddingsType = FieldType.newBuilder()
                .withName(EMBEDDING_FIELD)
                .withDataType(DataType.FloatVector)
                .withDimension(1536)
                .build();

        final CreateCollectionParam createCollectionReq = CreateCollectionParam.newBuilder()
                .withCollectionName(COLLECTION_NAME)
                .withDescription("Amazon Reviews")
                .withShardsNum(2)
                .addFieldType(primaryIdType)
                .addFieldType(contentType)
                .addFieldType(embeddingsType)
                .withEnableDynamicField(true)
                .build();

        client.createCollection(createCollectionReq);
    }

If you look carefully, though, you'll see that the code above creates the entire collection. The method name kind of gives that game away.

But it also defines fields. And it does so by defining data types for each field.

You've almost certainly seen that before. When designing relational tables, for example, you may define one column as a number (for example, BIGINT) and another column as a string (for example, VARCHAR).

Same thing here.

The first field type, primaryIdType, defines the primary key field. Happily enough, Milvus offers a withAutoID() function that creates the primary ID automagically.

The second field type, contentType, defines the field where you'll store the unstructured content. As you can see, the content type is VarChar. Because that's the kind of content you'll find in Amazon food reviews.

The third field type, embeddingsType, defines the field where you'll persist the Embeddings. That's why the data type is FloatVector.

Note that the last field includes a property called "dimension." That's a number between 1 and 32k.

It's the dimension of the vector. If you get it wrong, Milvus will be kind enough to not only tell you that you got it wrong, but to also tell you what the right number is.

Finally, that method creates the collection. In this case, you can use 2 shards. That's the default for Milvus collections so let's just roll with it.

The whole concept of sharding, by the way, streamlines write operations for Milvus collections.

Partitioning, on the other hand, streamlines read operations.

Also, you're creating the collection with a dynamic data model so that withEnableDynamicField() method is set to true.

Once that method completes, you've got a collection with nothing in it.

But You Want Something In It

You need to populate that collection with two data sets:

The text in amazon-food-reviews.txt in the "content" field
The embeddings in amazon-food-reviews-embeddings.txt in the "content_embedding" field

First, read the lines from the first file:

    private static List<String> loadLinesFromFile() throws IOException {
        final List<String> list = new ArrayList<>();

        try (Stream<String> stream = Files.lines(Path.of(INPUT_FILE))) {
            stream.forEach(line -> list.add(line));
        }

        return list;
    }

That will give you a List of String objects where each object corresponds to a line in the file.

Next, read the lines from the Embeddings file:

    private static List<List<Float>> populateEmbeddings() throws IOException {
        final List<List<Float>> embeddingsFromFile = EmbeddingsHelper.loadFromFileAsFloats(EMBEDDINGS_FILE);
        return embeddingsFromFile;
    }

You're using an existing support class for that one.

And you might have noticed that you're now reading the Embeddings as Float objects instead of Double objects.

Why? Because the Milvus API treats them as Float objects whereas my main man Theo Kanning treats them as Double objects.

That's why.

In any case, that method returns a list of a list. The inner list corresponds to a single line in the document. The outer list represents all the lines in the whole document.

Putting It All Together

Now it's up to the main() method to do all the heavy lifting.

    public static void main(String[] args) {
        try {
            final MilvusServiceClient client = MilvusServiceClientHelper.getClient();

            createCollection(client);

            final List<String> lines = loadLinesFromFile();
            final List<List<Float>> embeddings = populateEmbeddings();

            final List<InsertParam.Field> fields = new ArrayList<>();
            fields.add(new InsertParam.Field(CONTENT_FIELD, lines));
            fields.add(new InsertParam.Field(EMBEDDING_FIELD, embeddings));

            final InsertParam insertParam = InsertParam.newBuilder()
                    .withCollectionName(COLLECTION_NAME)
                    .withFields(fields)
                    .build();

            final R<MutationResult> result = client.insert(insertParam);
            System.out.println(result);

            client.close();
        } catch (Exception e) {
            LOG.error("Problem persisting embeddings!", e);
        }
    }

The first line in that try block creates the Milvus client. You've already seen the code that does that.

The second line creates the collection. Again, you've already seen the code that does that.

The next two lines of code grab the lines from the plain text file and the Embeddings file, respectively. Once more, with feeling: you've already seen that code.

Now it's time to populate the collection with the content and the Embeddings.

The next few lines create a List of InsertParam.Field objects.

An InsertParam.Field object associates a field with its content. You can see from the code above that it's associating the "content" field with the lines from the plain text file. It's associating the "content_embedding" field with the lines from the Embeddings file.

Both of those InsertParam.Field objects go into the list.

Next, the code creates an InsertParam object for the Milvus collection you created with the InsertParam.Field objects you also just created.

In other words, this is the request that actually handles the inserts.

Finally, the code invokes client.insert() with that InsertParam object to formally do the inserts.

And it returns a generic type called R.

Yes, R.

I gotta admit, I kind of like that.

Anyway, the specific type defined here is MutationResult. As the name implies, that will give you some info about the insert operations you just attempted to perform.

The code then prints out the result of your operation. If you see a "success" message, then you're in good shape.

Finally, the code closes the client.

Go ahead and run the code. I think you'll get the success message.

Default Database

One point of order: the code here uses the default database for everything.

You can create your own database, but I had problems getting search to work with anything other than the default database. I think there might be a problem with the Java API (as of this writing, anyway).

Wrapping It Up

Good job. You've now persisted Embeddings for the first time.

But there's plenty of room for improvement here. For example, there are still support classes with method names that include the word "persist." But they don't persist anything to Milvus.

So that can be a bit confusing. Feel free to clean it up.

Also, check out the complete code when you get a chance.

Have fun!

Photo by Tima Miroshnichenko: https://www.pexels.com/photo/a-woman-with-hair-bun-writing-on-a-notebook-while-looking-at-the-camera-6549354/