有勇气的牛排博客

sentence-transformers文本向量化|语义搜索|相似度计算


1 前言

1.1 什么是 sentence-transformers

sentence-transformers 是一个基于BERT和其变种模型的Python库,用于生成高质量句子向量(sentence embeddings)。这些向量可以用于以下场景:

  • 文本相似度计算
  • 语义搜索(Semantic Search)
  • 文本聚类
  • 多轮问答
  • 向量数据库索引(如FAISS、Milvus)

1.2 环境安装

python3.10

sentence-transformers==4.1.0

2 编码句子

加载一个预训练模型并编码句子。

# -*- coding: utf-8 -*- from sentence_transformers import SentenceTransformer model_path = "./tmp/model/all-MiniLM-L6-v2" # 加载模型 model = SentenceTransformer(model_path) # 编码句子为向量 sentences = ["韩立飞升仙界", "修仙是一种漫长的旅途"] embeddings = model.encode(sentences) for sentence, embedding in zip(sentences, embeddings): print(f"句子: {sentence}") print(f"向量: {embedding[:5]}...") # 仅展示前5维

sentence-transformers加载一个预训练模型并编码句子|文本向量化

3 相似度计算

# -*- coding: utf-8 -*- from sentence_transformers import util, SentenceTransformer # 加载模型 model_path = "./tmp/model/all-MiniLM-L6-v2" model = SentenceTransformer(model_path) s1 = "韩立突破金丹期" s2 = "韩立进入金丹阶段" emb1 = model.encode(s1) emb2 = model.encode(s2) similarity = util.pytorch_cos_sim(emb1, emb2) print("句子相似度:", similarity.item())

sentence-transformers相似度计算

4 语义搜索

# -*- coding: utf-8 -*- from sentence_transformers import util,SentenceTransformer # 加载模型 model_path = "E:/blog_article/tmp/model/all-MiniLM-L6-v2" model = SentenceTransformer(model_path) # 构建语料库 corpus = [ "韩立突破金丹期", "韩立在黑风山修炼", "韩立炼制了丹药", "王林在苍茫星修炼", ] corpus_embedding = model.encode(corpus) # 查询句子 query = "韩立结丹" query_embedding = model.encode(query) # 计算相似度得分 hits = util.semantic_search(query_embedding, corpus_embedding, top_k=3) for hit in hits[0]: print(f"得分: {hit['score']:.4f}, 语句: {corpus[hit['corpus_id']]}")

sentence-transformers语义搜索

5 与 FAISS 搭配构建向量检索系统

faiss-cpu
# -*- coding: utf-8 -*- import faiss import numpy as np from sentence_transformers import SentenceTransformer # 加载模型 model_path = "E:/blog_article/tmp/model/all-MiniLM-L6-v2" model = SentenceTransformer(model_path) # 构建语料 corpus = ["修仙者", "炼丹炉", "灵根测试", "灵石交易"] corpus_embeddings = model.encode(corpus) # 转换为 float32 corpus_embeddings = np.array(corpus_embeddings).astype("float32") # 创建 FAISS 索引 index = faiss.IndexFlatL2(corpus_embeddings.shape[1]) index.add(corpus_embeddings) # 查询向量 query = "灵石购买" query_embedding = model.encode([query]).astype("float32") # 搜索 top_k = 2 distances, indices = index.search(query_embedding, top_k) for idx, distance in zip(indices[0], distances[0]): print(f"距离: {distance:.4f}, 匹配: {corpus[idx]}")

sentence-transformers与 FAISS 搭配构建向量检索系统

评论区

×
×