Encode data into a vector database

To encode data into a vector database, complete these tasks:

Encode documents: Uses vectorizing data and stores the vectors in Milvus collections. You can then use documents in later stages of the LLM-RAG process.
Encode data from the Splunk platform: Uses vectorizing data and stores the vectors in Milvus collections. You can use data in later stages of the LLM-RAG process.
Conduct vector search: Conduct vector similarity search on Splunk log data.
Manage and explore your vector database: List, pull, and delete your vector database.

Encode documents

Go to the Encode documents page:

In the Splunk App for Data Science and Deep Learning (DSDL), go to Assistants.
Select LLM-RAG, then Encoding Data to Vector Database, and then Encode documents.

Parameters

The Encode documents page has the following parameters:

Parameter	Description
`data_path`	The directory on Docker volume where you store the raw document data. Sub-directories are read automatically and CSV, PDF, TXT, DOCX, XML, and IPYNB files are encoded.
`collection_name`	A unique name of the collection to store the vectors in. The name must start with a letter or a number and contain no spaces. If you are adding data to an existing collection, make sure to use the same embedder model.
`vectordb_service`	Type of VectorDB service. Choose from `milvus`, `pinecone`, and `alloydb`.
`embedder_service`	Type of embedding service. Choose from `huggingface`, `ollama`, `azure_openai`, `openai`, `bedrock`, and `gemini`.
`embedder_name`	Name of embedding model. Optional if configured on the Setup LLM-RAG page.
`embedder_dimension`	Output dimensionality of the model. Optional if configured on the Setup LLM-RAG page.

Run the fit or compute command

Use the following syntax to run the fit command or the compute command:

Run the fit command:

| makeresults 
| fit MLTKContainer algo=llm_rag_document_encoder data_path="/srv/notebooks/data/Buttercup" vectordb_service=milvus embedder_service=azure_openai collection_name="document_collection_example" _time into app:llm_rag_document_encoder as Encoding

Run the compute command:

| makeresults 
| compute algo:llm_rag_document_encoder data_path:"/srv/notebooks/data/Buttercup/" vectordb_service:"milvus" embedder_service:"azure_openai" collection_name:"document_collection_example" _time

Dashboard view

The following image shows the dashboard view for the Encode documents page:

The dashboard view includes the following components:

Dashboard component	Description
Data Path	The directory on docker volume where you store the raw document data. Sub-directories are read automatically and files with extensions .CSV, .PDF, .TXT, .DOCX, .XML, and .IPYNB are encoded.
Collection Name	A unique name of the collection to store the vectors in. The name should start with an alphabet or a number and contain no space. If you are adding data to an existing collection, make sure to use the same embedder model.
VectorDB Service	Type of VectorDB service.
Embedding Service	Type of Embedding service.
Encode	Select to start encoding after finishing all the inputs.
Conduct Vector Search	Jump to Vector search dashboard.
RAG-based LLM	Jump to RAG-based LLM dashboard.
Create a New Encoding	Reset all the tokens on this dashboard.
Return to Menu	Return to the main menu.

Encode data from the Splunk platform

In the Splunk App for Data Science and Deep Learning (DSDL), navigate to Assistants, then LLM-RAG, then Encoding Data to Vector Database, and then select Encode data from Splunk.

Concatenate the command to a search pipeline that produces a table containing only a field of the log data you want to encode as well as other fields of metadata you want to add. Avoid using embeddings or label as field names in the table, as these 2 field names are used in the vector database by default.

Encoding too much data at once can cause a failure. Keep the cardinality of logs under 30,000.

Parameters

The Encode data from Splunk page has the following parameters:

Parameters	Description
`label_field_name`	Name of the field you want to encode. All the other fields in the table are treated as metadata in the collection.
`collection_name`	A unique name of the collection to store the vectors in. The name must start with a letter or a number and contain no spaces. If you are adding data to an existing collection, make sure to use the same embedder model.
`vectordb_service`	Type of VectorDB service. Choose from `milvus`, `pinecone`, and `alloydb`.
`embedder_service`	Type of embedding service. Choose from `huggingface`, `ollama`, `azure_openai`, `openai`, `bedrock`, and `gemini`.
`embedder_name`	Name of embedding model. Optional if configured on the Setup LLM-RAG page.
`embedder_dimension`	Output dimensionality of the model. Optional if configured on the Setup LLM-RAG page.

Run the fit or compute command

Use the following syntax to run the fit command or the compute command:

Run the fit command:

index=_internal error | table _raw sourcetype source | head 100 | fit MLTKContainer algo=llm_rag_log_encoder collection_name="test" vectordb_service=milvus embedder_service=azure_openai embedder_dimension=3072 label_field_name=_raw * into app:llm_rag_log_encoder as Encode

Run the compute command:

index=_internal error | table _raw sourcetype source | head 100 | compute algo:llm_rag_log_encoder collection_name:"test" vectordb_service:milvus embedder_service:azure_openai embedder_dimension:3072 label_field_name:_raw sourcetype source

The wildcard character ( * ) is not supported in the compute command. You must specify all the input fields within the command.

Dashboard view

The following image shows the dashboard view for the Encode data from Splunk page:

The dashboard view includes the following components:

Dashboard component	Description
Search bar	Search Splunk log data to encode. This search produces a table containing only a field of the log data you want to encode, as well as other fields of metadata you want to add.
Target Field Name	The name of the field you want to encode. All the other fields in the table are treated as metadata in the collection.
Collection Name	A unique name of the collection to store the vectors in. The name must start with a letter or number and do not include any spaces. If you are adding data to an existing collection, make sure to use the same embedder model.
VectorDB Service	Type of VectorDB service.
Embedding Service	Type of Embedding service.
Encode	Select to start encoding after finishing all the inputs.
Conduct Vector Search	Jump to Vector search dashboard.
RAG-based LLM	Jump to RAG-based LLM dashboard.
Create a New Encoding	Reset all the tokens on this dashboard.
Return to Menu	Return to the main menu.

Conduct vector search

In the Splunk App for Data Science and Deep Learning (DSDL), navigate to Assistants, then LLM-RAG, then Encoding Data to Vector Database, and then select Conduct Vector Search on log data.

Concatenate the command to a search pipeline that produces a table containing only a field of the log data you want to conduct similarity search on. Rename the field as "text".

Parameters

The Encode data from Splunk page has the following parameters:

Parameters	Description
`collection_name`	The existing collection to conduct similarity search on.
`vectordb_service`	Type of VectorDB service. Choose from `milvus`, `pinecone`, and `alloydb`.
`embedder_service`	Type of embedding service. Choose from `huggingface`, `ollama`, `azure_openai`, `openai`, `bedrock`, and `gemini`.
`embedder_name`	Name of embedding model. Optional if configured on the Setup LLM-RAG page.
`embedder_dimension`	Output dimensionality of the model. Optional if configured on the Setup LLM-RAG page.
`top_k`	Number of top results to return.

Run the fit or compute command

Use the following syntax to run the fit command or the compute command:

Run the fit command:

| search ... 
| table text 
| fit MLTKContainer algo=llm_rag_milvus_search collection_name=test embedder_service=huggingface vectordb_service=milvus top_k=5 text into app:llm_rag_milvus_search

Run the compute command:

| search ... 
| table text 
| compute algo:llm_rag_milvus_search collection_name:test embedder_service:huggingface vectordb_service:milvus top_k:5 text

Dashboard view

The following image shows the dashboard view for the Conduct vector search on log data page:

The dashboard view includes the following components:

Dashboard component	Description
Collection Name	A unique name of the collection to store the vectors in. The name must start with a letter or number and do not include any spaces. If you are adding data to an existing collection, make sure to use the same embedder model.
VectorDB Service	Type of VectorDB service.
Embedding Service	Type of Embedding service.
Submit	Select after finishing all the inputs.
Search bar	Search Splunk log data to conduct similarity search. This search produces a table containing only a field of the log data you want to search on. Select the specific log message to kick off vector search.
Conduct a New Vector Search	Reset all the tokens on this dashboard.
Return to Menu	Return to the main menu.

Manage and explore your vector database

In the Splunk App for Data Science and Deep Learning (DSDL), navigate to Assistants, then LLM-RAG, then Encoding Data to Vector Database, and then select Manage and Explore your vector database.

Parameters

Manage and explore your vector database with the following parameters:

Parameters	Description
`task`	The specific task for management. Use `list_collections` to list all the existing collections, `delete_collection` to delete a specific collection, `show_schema` to print the schema of a specific collection, and `show_rows` to print the number of vectors within a collection.
`collection_name`	The specific collection name. Required for all tasks except `list_collections`.

Run the fit or compute command

Use the following syntax to run the fit command or the compute command:

Run the fit command:

| makeresults
| fit MLTKContainer algo=llm_rag_milvus_management task=delete_collection collection_name=document_collection_example _time into app:llm_rag_milvus_management as RAG

Run the compute command:

| makeresults
| compute algo:llm_rag_milvus_management task:delete_collection collection_name:document_collection_example _time

Dashboard view

The following image shows the dashboard view for the Manage and explore your vector database page:

The dashboard view includes the following components:

Dashboard component	Description
Collection to delete	The specific collection name you want to delete.
Submit	Select to delete the input collection.
Refresh	Refresh the list of collections.
Return to Menu	Return to the main menu.

Next step

After pulling the LLM model to your local Docker container and encoding document or log data into the vector database, you can carry out inferences using the LLM. See Query LLM with vector data.

Related answers from Splunk Community

Encode data into a vector database

Encode documents

Parameters

Run the fit or compute command

Dashboard view

Encode data from the Splunk platform

Parameters

Run the fit or compute command

Dashboard view

Conduct vector search

Parameters

Run the fit or compute command

Dashboard view

Manage and explore your vector database

Parameters

Run the fit or compute command

Dashboard view

Next step

Comments

Encode data into a vector database

Was this topic useful?