forked from DA-southampton/NLP_ability
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
26805f3
commit b142505
Showing
5 changed files
with
34 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
在知乎看到这样一个问题【如何将关键词信息融入到文本分类任务】,简单说一下自己的经验,供大家参考; | ||
|
||
首先说,现在基本各组都有自己的关键词词库,构造方法也都基本上相似。 | ||
|
||
简单点的就是TF-IDF筛选,复杂的就是构建挖掘特征,关键词二分类模型; | ||
|
||
基于此,大家一般也会加上新词发现+实体挖掘进行候选词库的补充; | ||
|
||
然后我们再来说,如何把关键词信息融入到文本分类任务中去。 | ||
|
||
如果说关键词类别未知,这种情况不常见,但是也会有,一般是两种处理方式。 | ||
|
||
一种是直接拼接在文本后面,增强信息,很常见。 | ||
|
||
举个例子【今天出去旅游吗】,关键词是【旅游】,文本输入就是【今天出去旅游吗旅游】 | ||
|
||
另一种是将关键词构造维稀疏特征加入到文本中去,缺点就是维度会比较高; | ||
|
||
如果说关键词类别已知,这种场景比较常见; | ||
|
||
先说个题外话,在挖掘语料的时候,关键词匹配挖掘语料是一个很常见的手段,但是容易造成语料太过简单单一+语料噪声比价大,所以冷启动的情况下,可以用关键词挖掘语料,之后还是上一批人工的标注会好一点; | ||
|
||
关键词类别已知的情况下,也可以使用两种方式来融入到文本分类任务中去; | ||
|
||
第一种就是,把关键词往上抽象化,转为对应的类别,然后作为特征结合文本输入到网络中去; | ||
|
||
第二种,也是我比较常用的就是对文本分类之后,对文本做关键词匹配,对应类别提升分值,简单说加规则,这个手段有点不好控制的地方就是分值的确定。 | ||
|
||
但是我为啥爱用呢?最大的原因就是容易在和运营讲道理【撕】的时候获胜,百试不爽~~~ | ||
|