1 前言
SnowNLP 是一个专门用于处理中文文本的 Python库。功能包括:
- 分词
- 情感分析
- 关键词提取
- 文本分类
- 拼音转换
- 繁体转简体
- 词相似度计算等
snownlp==0.12.3
测试环境:Python3.10.9
from snownlp import SnowNLP
from snownlp import seg
from snownlp import sentiment
from snownlp import normal
2 分词
中文分词(Character-Based Generative Model)
2.1 常规分词
from snownlp import SnowNLP
text = "有勇气的牛排是编程领域的博主"
s = SnowNLP(text)
print(s.words)

2.2 自定义分词 词典
尚未测出该功能
4 提取 关键词(TextRank算法)
from snownlp import SnowNLP
text = "有勇气的牛排写的文章通俗易懂,爱了爱了"
s = SnowNLP(text)
print(s.keywords(3))

5 提取 摘要(TextRank算法)
from snownlp import SnowNLP
text = "有勇气的牛排写的文章通俗易懂,爱了爱了"
s = SnowNLP(text)
print(s.summary(3))

6 拼音转换
原创:有勇气的牛排
https://www.couragesteak.com/article/456
from snownlp import SnowNLP
text = "有勇气的牛排"
s = SnowNLP(text)
print(s.pinyin)

7 文本分类(情感分析)
7.1 定义
文本分类使用的是 SnowNLP 的情感分析模型
7.2 情感分析(常规)
from snownlp import SnowNLP
text = "有勇气的牛排写的文章通俗易懂,爱了爱了"
s = SnowNLP(text)
print(s.sentiments)

7.3 情感分析(训练模型)
7.3.1 数据文件
pos.txt (正面情绪的文本)
这家餐厅的菜很好吃
我非常喜欢这本书
这个产品质量非常好
neg.txt (负面情绪的句子)
这家餐厅的服务很差
我不喜欢这部电影
这个产品质量很差
7.3.2 训练模型
main.py
from snownlp import sentiment
sentiment.train('neg.txt', 'pos.txt')
sentiment.save('sentiment.marshal')

7.3.3 使用模型进行情感分析
from snownlp import sentiment
from snownlp import SnowNLP
sentiment.load('sentiment.marshal')
text = "这个产品很糟糕,我很不满意。"
s = SnowNLP(text)
print(s.sentiments)
如果不加载前面我们训练的模型,分析结果为 0.669,十分不精确。

8 繁体转简体
from snownlp import SnowNLP
sentence_fan = "知識改變世界"
jian_ti = SnowNLP(sentence_fan)
print(jian_ti.han)

9 计算词的相似度(BM25)
from snownlp import SnowNLP
text = "有勇气的牛排是编程领域的博主"
s = SnowNLP(text)
print(len(s.words), s.words)
print(len(s.sim("的牛排")), s.sim("的牛排"))

<h2><a id="1__0"></a>1 前言</h2>
<p>SnowNLP 是一个专门用于处理中文文本的 Python库。功能包括:</p>
<ul>
<li>分词</li>
<li>情感分析</li>
<li>关键词提取</li>
<li>文本分类</li>
<li>拼音转换</li>
<li>繁体转简体</li>
<li>词相似度计算等</li>
</ul>
<pre><div class="hljs"><code class="lang-shell">snownlp==0.12.3
</code></div></pre>
<p>测试环境:Python3.10.9</p>
<pre><div class="hljs"><code class="lang-python"><span class="hljs-keyword">from</span> snownlp <span class="hljs-keyword">import</span> SnowNLP <span class="hljs-comment"># 使用</span>
<span class="hljs-keyword">from</span> snownlp <span class="hljs-keyword">import</span> seg <span class="hljs-comment"># 分词库</span>
<span class="hljs-keyword">from</span> snownlp <span class="hljs-keyword">import</span> sentiment <span class="hljs-comment"># 情感分词</span>
<span class="hljs-keyword">from</span> snownlp <span class="hljs-keyword">import</span> normal <span class="hljs-comment"># 停用词处理</span>
</code></div></pre>
<h2><a id="2__27"></a>2 分词</h2>
<p>中文分词(Character-Based Generative Model)</p>
<h3><a id="21__31"></a>2.1 常规分词</h3>
<pre><div class="hljs"><code class="lang-python"><span class="hljs-keyword">from</span> snownlp <span class="hljs-keyword">import</span> SnowNLP
text = <span class="hljs-string">"有勇气的牛排是编程领域的博主"</span>
s = SnowNLP(text)
<span class="hljs-built_in">print</span>(s.words)
</code></div></pre>
<p><img src="https://static.couragesteak.com/article/5755f5ce1c32952b096052ade199ffd2.png" alt="snownlp分词" /></p>
<h3><a id="22___43"></a>2.2 自定义分词 词典</h3>
<p>尚未测出该功能</p>
<h2><a id="4__TextRank_47"></a>4 提取 关键词(TextRank算法)</h2>
<pre><div class="hljs"><code class="lang-python"><span class="hljs-keyword">from</span> snownlp <span class="hljs-keyword">import</span> SnowNLP
text = <span class="hljs-string">"有勇气的牛排写的文章通俗易懂,爱了爱了"</span>
s = SnowNLP(text)
<span class="hljs-built_in">print</span>(s.keywords(<span class="hljs-number">3</span>))
</code></div></pre>
<p><img src="https://static.couragesteak.com/article/f37b7afafb15d23ee41e37f26a297cb9.png" alt="snownlp 关键词提取" /></p>
<h2><a id="5__TextRank_59"></a>5 提取 摘要(TextRank算法)</h2>
<pre><div class="hljs"><code class="lang-python"><span class="hljs-keyword">from</span> snownlp <span class="hljs-keyword">import</span> SnowNLP
text = <span class="hljs-string">"有勇气的牛排写的文章通俗易懂,爱了爱了"</span>
s = SnowNLP(text)
<span class="hljs-built_in">print</span>(s.summary(<span class="hljs-number">3</span>))
</code></div></pre>
<p><img src="https://static.couragesteak.com/article/54b5aac5bbec3d336b03c29ce1947061.png" alt="snownlp 摘要提取" /></p>
<h2><a id="6__71"></a>6 拼音转换</h2>
<p>原创:有勇气的牛排<br />
<a href="https://www.couragesteak.com/article/456" target="_blank">https://www.couragesteak.com/article/456</a></p>
<pre><div class="hljs"><code class="lang-python"><span class="hljs-keyword">from</span> snownlp <span class="hljs-keyword">import</span> SnowNLP
text = <span class="hljs-string">"有勇气的牛排"</span>
s = SnowNLP(text)
<span class="hljs-built_in">print</span>(s.pinyin)
</code></div></pre>
<p><img src="https://static.couragesteak.com/article/bbf01ee4105f91b68bcd825bbd9247f9.png" alt="image.png" /></p>
<h2><a id="7__87"></a>7 文本分类(情感分析)</h2>
<h3><a id="71__89"></a>7.1 定义</h3>
<p>文本分类使用的是 SnowNLP 的情感分析模型</p>
<h3><a id="72__93"></a>7.2 情感分析(常规)</h3>
<pre><div class="hljs"><code class="lang-python"><span class="hljs-keyword">from</span> snownlp <span class="hljs-keyword">import</span> SnowNLP
text = <span class="hljs-string">"有勇气的牛排写的文章通俗易懂,爱了爱了"</span>
s = SnowNLP(text)
<span class="hljs-built_in">print</span>(s.sentiments)
</code></div></pre>
<p><img src="https://static.couragesteak.com/article/bec7031d565274508bbbfb6a62eb9a87.png" alt="snownlp 情感分析" /></p>
<h3><a id="73__105"></a>7.3 情感分析(训练模型)</h3>
<h4><a id="731__107"></a>7.3.1 数据文件</h4>
<p>pos.txt (正面情绪的文本)</p>
<pre><div class="hljs"><code class="lang-python">这家餐厅的菜很好吃
我非常喜欢这本书
这个产品质量非常好
</code></div></pre>
<p>neg.txt (负面情绪的句子)</p>
<pre><div class="hljs"><code class="lang-python">这家餐厅的服务很差
我不喜欢这部电影
这个产品质量很差
</code></div></pre>
<h4><a id="732__125"></a>7.3.2 训练模型</h4>
<p>main.py</p>
<pre><div class="hljs"><code class="lang-python"><span class="hljs-keyword">from</span> snownlp <span class="hljs-keyword">import</span> sentiment
<span class="hljs-comment"># 训练模型</span>
sentiment.train(<span class="hljs-string">'neg.txt'</span>, <span class="hljs-string">'pos.txt'</span>)
<span class="hljs-comment"># 保存模型</span>
sentiment.save(<span class="hljs-string">'sentiment.marshal'</span>)
</code></div></pre>
<p><img src="https://static.couragesteak.com/article/498e77834e2ea00ad1084e931a18db4d.png" alt="snownlp 训练模型" /></p>
<h4><a id="733__141"></a>7.3.3 使用模型进行情感分析</h4>
<pre><div class="hljs"><code class="lang-python"><span class="hljs-keyword">from</span> snownlp <span class="hljs-keyword">import</span> sentiment
<span class="hljs-keyword">from</span> snownlp <span class="hljs-keyword">import</span> SnowNLP
<span class="hljs-comment"># 加载训练好的模型</span>
sentiment.load(<span class="hljs-string">'sentiment.marshal'</span>)
<span class="hljs-comment"># 使用训练好的模型</span>
text = <span class="hljs-string">"这个产品很糟糕,我很不满意。"</span>
s = SnowNLP(text)
<span class="hljs-built_in">print</span>(s.sentiments) <span class="hljs-comment"># 输出情感分析结果</span>
</code></div></pre>
<p>如果不加载前面我们训练的模型,分析结果为 0.669,十分不精确。</p>
<p><img src="https://static.couragesteak.com/article/73417b319446bcfab636db6d8d2be581.png" alt="snownlp 使用模型进行情感分析" /></p>
<h2><a id="8__162"></a>8 繁体转简体</h2>
<pre><div class="hljs"><code class="lang-python"><span class="hljs-keyword">from</span> snownlp <span class="hljs-keyword">import</span> SnowNLP
sentence_fan = <span class="hljs-string">"知識改變世界"</span>
jian_ti = SnowNLP(sentence_fan)
<span class="hljs-built_in">print</span>(jian_ti.han)
</code></div></pre>
<p><img src="https://static.couragesteak.com/article/8ec3d768db90435411c80bbaae71706f.png" alt="snownlp 繁体转简体" /></p>
<h2><a id="9_BM25_176"></a>9 计算词的相似度(BM25)</h2>
<pre><div class="hljs"><code class="lang-python"><span class="hljs-keyword">from</span> snownlp <span class="hljs-keyword">import</span> SnowNLP
text = <span class="hljs-string">"有勇气的牛排是编程领域的博主"</span>
s = SnowNLP(text)
<span class="hljs-built_in">print</span>(<span class="hljs-built_in">len</span>(s.words), s.words)
<span class="hljs-built_in">print</span>(<span class="hljs-built_in">len</span>(s.sim(<span class="hljs-string">"的牛排"</span>)), s.sim(<span class="hljs-string">"的牛排"</span>))
</code></div></pre>
<p><img src="https://static.couragesteak.com/article/397ae400656729d9d9a504afd48c127a.png" alt="snownlp 计算词的相似度" /></p>
留言