Python操作向量数据库Milvus

有勇气的牛排 1035 AI大模型 2025-07-25 23:43:13

1 前言

Milvus 是一个 开源的向量数据库，用于存储、索引和检索高维向量数据，特别适用于AI 应用场景，如：

向量检索（以图搜图、语义搜索）
自然语言处理（文本相似度、QA 系统）
图像、音频、生物特征匹配等

由 Zilliz 公司 开发，支持百万甚至十亿级向量的高性能相似度检索。

1.1 环境

pip install pymilvus==2.6.0

2 Collection 操作

2.1 创建 Collection

在定义字段的时候，需要提前规划好需要使用到的字段，Milvus不支持动态新增字段。

# -*- coding: utf-8 -*-
from pymilvus import (
    connections, FieldSchema, CollectionSchema, DataType,
    Collection, utility
)

# ----------------- 连接 Milvus -----------------
connections.connect(db_name="cs_blog", host='192.168.56.20', port='19530')

# ----------------- 创建集合 -----------------
collection_name = "article"
if utility.has_collection(collection_name):
    print("集合已存在")
    Collection(collection_name).drop()

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
    FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=255),
    FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=384),
]

schema = CollectionSchema(fields, description="博客文章集合")
collection = Collection(name=collection_name, schema=schema)

# 建立索引
collection.create_index(field_name="embedding", index_params={
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {"nlist": 128}
})
collection.load()

3 获取Client

# -*- coding: utf-8 -*-
from pymilvus import connections, Collection

# 连接客户端
connections.connect(db_name="cs_blog", host='192.168.56.20', port='19530')

# 获取集合对象（等同于实例化一个操作集合的客户端对象）
collection = Collection("article")

print(collection)

4 Doc 操作

4.1 插入数据

# -*- coding: utf-8 -*-
import numpy as np

# 以字典形式表示文章
from 获取Client import collection

article = {
    "id": 1,
    "title": "Milvus向量数据库介绍",
    "author": "有勇气的牛排",
    "content": "Milvus 是一个开源的向量数据库，专门用于处理海量高维向量数据的相似度搜索。",
    "embedding": np.random.rand(384).astype("float32").tolist()
}

# 插入（注意：字段必须按集合 schema 顺序构造 list）
data = [
    [article["id"]],
    [article["title"]],
    [article["author"]],
    [article["content"]],
    [article["embedding"]]
]
collection.insert(data)
collection.flush()
print("✅ 插入成功")

4.2 删除数据

# -*- coding: utf-8 -*-
from 获取Client import collection

article_id = 1
expr = f"id=={article_id}"
collection.delete(expr)
collection.flush()
print(f"🗑️ 已删除 ID={article_id} 的文章")

4.3 相似搜索

# -*- coding: utf-8 -*-
import numpy as np

# 以字典形式表示文章
from AI.ai_code_test.Milvus案例.获取Client import collection

# 文本向量
query_vec = np.random.rand(384).astype("float32").tolist()
top_k = 3

collection.load()
results = collection.search(
    data=[query_vec],
    anns_field="embedding",
    param={"metric_type": "L2", "params": {"nprobe": 10}},
    limit=top_k,
    output_fields=["id", "title", "content"]
)

for hits in results:
    for hit in hits:
        print("\t", hit.entity.get("id"), hit.entity.get("title"), hit.entity.get("content"), hit.distance)

<h2><a id="1__0"></a>1 前言</h2> Milvus 是一个 开源的向量数据库，用于存储、索引和检索高维向量数据，特别适用于AI 应用场景，如： <ul> <li>向量检索（以图搜图、语义搜索）</li> <li>自然语言处理（文本相似度、QA 系统）</li> <li>图像、音频、生物特征匹配等</li> </ul> 由 Zilliz 公司 开发，支持百万甚至十亿级向量的高性能相似度检索。 <h3><a id="11__10"></a>1.1 环境</h3> <pre><div class="hljs"><code class="lang-shell">pip install pymilvus==2.6.0 </code></div></pre> <h2><a id="2_Collection__16"></a>2 Collection 操作</h2> <h3><a id="21__Collection_18"></a>2.1 创建 Collection</h3> 在定义字段的时候，需要提前规划好需要使用到的字段，Milvus不支持动态新增字段。 <pre><div class="hljs"><code class="lang-python"># -*- coding: utf-8 -*- from pymilvus import ( connections, FieldSchema, CollectionSchema, DataType, Collection, utility ) # ----------------- 连接 Milvus ----------------- connections.connect(db_name="cs_blog", host='192.168.56.20', port='19530') # ----------------- 创建集合 ----------------- collection_name = "article" if utility.has_collection(collection_name): print("集合已存在") Collection(collection_name).drop() fields = [ FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False), FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=255), FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535), FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=384), ] schema = CollectionSchema(fields, description="博客文章集合") collection = Collection(name=collection_name, schema=schema) # 建立索引 collection.create_index(field_name="embedding", index_params={ "index_type": "IVF_FLAT", "metric_type": "L2", "params": {"nlist": 128} }) collection.load() </code></div></pre> <h2><a id="3_Client_57"></a>3 获取Client</h2> <pre><div class="hljs"><code class="lang-python"># -*- coding: utf-8 -*- from pymilvus import connections, Collection # 连接客户端 connections.connect(db_name="cs_blog", host='192.168.56.20', port='19530') # 获取集合对象（等同于实例化一个操作集合的客户端对象） collection = Collection("article") print(collection) </code></div></pre> <h2><a id="4_Doc__72"></a>4 Doc 操作</h2> <h3><a id="41__74"></a>4.1 插入数据</h3> <pre><div class="hljs"><code class="lang-python"># -*- coding: utf-8 -*- import numpy as np # 以字典形式表示文章 from 获取Client import collection article = { "id": 1, "title": "Milvus向量数据库介绍", "author": "有勇气的牛排", "content": "Milvus 是一个开源的向量数据库，专门用于处理海量高维向量数据的相似度搜索。", "embedding": np.random.rand(384).astype("float32").tolist() } # 插入（注意：字段必须按集合 schema 顺序构造 list） data = [ [article["id"]], [article["title"]], [article["author"]], [article["content"]], [article["embedding"]] ] collection.insert(data) collection.flush() print("✅ 插入成功") </code></div></pre> <img src="https://www.couragesteak.com/tcos/article/5c2fe7505074af044e086330a0c10d73.png" alt="image.png" /> <h3><a id="42__106"></a>4.2 删除数据</h3> <pre><div class="hljs"><code class="lang-python"># -*- coding: utf-8 -*- from 获取Client import collection article_id = 1 expr = f"id=={article_id}" collection.delete(expr) collection.flush() print(f"🗑️ 已删除 ID={article_id} 的文章") </code></div></pre> <h3><a id="43__119"></a>4.3 相似搜索</h3> <pre><div class="hljs"><code class="lang-python"># -*- coding: utf-8 -*- import numpy as np # 以字典形式表示文章 from AI.ai_code_test.Milvus案例.获取Client import collection # 文本向量 query_vec = np.random.rand(384).astype("float32").tolist() top_k = 3 collection.load() results = collection.search( data=[query_vec], anns_field="embedding", param={"metric_type": "L2", "params": {"nprobe": 10}}, limit=top_k, output_fields=["id", "title", "content"] ) for hits in results: for hit in hits: print("\t", hit.entity.get("id"), hit.entity.get("title"), hit.entity.get("content"), hit.distance) </code></div></pre> <img src="https://www.couragesteak.com/tcos/article/3080b8512528cc38bfdf13520b90a772.png" alt="image.png" />