c语言sscanf函数的用法是什么
268
2022-11-23
基于LDA的电商评论主题抽取
作者介绍
@贾少华
内蒙古大学计算机学院硕士,前某IT公司数据挖掘工程师;
现某乳业资源规划高级专员,负责业务数据化工作;
目前迷醉于经济与计算机的融合,坚信可解释性神经网络会带来更大的市场需求和学术进展;
深度中二少年,动漫无敌;
“数据人创作者联盟”成员。
1 理论介绍
LDA(Latent Dirichlet Allocation)于2003年BLei在论文中提出,该模型立足于LSA(Latent Senmantic Analysis与pLSI(probabilistic Latent Senmantic Analysis)模型,是一种更完善、成熟的概率主题模型。即LDA模型通过引入超参数的概念,使得整个模型较之pLSI更加概率化,形成了三层贝叶斯网络结构。LDA概率图模型见图1。
图1. 传统LDA模型
LDA模型意在寻求一篇文档中蕴含的潜在主题,其中对于潜在主题的个数一般通过困惑度亦或是对数似然值来确定。通常,一篇文档有包含多个部分,每个部分有N多词构成,也就是说由N多词构成一个个主题,而后由一个个主题构成了一篇文档。
对于文档集D中的每个文档w,LDA假设了以下的生成过程:
LDA模型中,需要估计的参数有两个,分别为θ和φ,即文档-主题概分布与主题-词概率分布。因使用EM对θ和φ进行参数估计的方法难以通过代码实现,故而在后续的模型学习与实现中,通常采用Gibbs抽样对这两个参数进行值的估计。
2 数据准备
此次Demo实验选取部分Yelp电商评论中的文本部分,其中评论有真有假,分别对真实评论和虚假评论进行主题抽取。其中表一展示的是一条原始的评论数据集和对应的清洗干净的数据集。
表1原始数据与干净数据
reviewContent |
Clean_review |
Service was impeccable. Experience and presentation was cool. Eating a balloon was fun. Trying to make a reservation was ridiculous. Food was not mouth watering, tasted like it it was made in a lab. I appreciate delicious food, so I don't get the hype here. |
service, impeccable, experience, presentation, cool, eating, balloon, fun, reservation, ridiculous, food, mouth, watering, tasted, lab, delicious, food, dont, hype |
当数据量较小的时候,LDA抽取的主题代表性不强,因此此处为了扩大建模的单词量,将真实评论合并为一个文档,虚假评论合并为一个文档,分别使用LDA对其进行建模,主题抽取结果如表2,表3所示。
表2 真实评论主题抽取
Topic1 |
Topic2 |
Topic3 |
Topic4 |
Topic5 |
promise |
park |
stones |
writing |
cube |
quality |
comments |
discarded |
reserve |
parings |
pushy |
ramps |
split |
injure |
shined |
rationalize |
edge |
eavesdrop |
damn |
pomp |
podium |
cliff |
strict |
autographed |
bamboo |
decorated |
spray |
breadth |
hate |
heroin |
peeled |
shots |
settle |
zealand |
absurd |
gulped |
care |
swirling |
olfactory |
unsalted |
|
comments |
discarded |
reserve |
parings |
pushy |
ramps |
split |
injure |
shined |
rationalize |
edge |
eavesdrop |
damn |
pomp |
podium |
cliff |
strict |
autographed |
bamboo |
decorated |
spray |
breadth |
hate |
heroin |
peeled |
shots |
settle |
zealand |
absurd |
gulped |
care |
swirling |
olfactory |
unsalted |
表3 虚假评论主题抽取
Topic1 |
Topic2 |
Topic3 |
Topic4 |
Topic5 |
extremely |
decadent |
confirmed |
prospect |
collective |
burnt |
entertain |
duke |
eaten |
smiled |
vaguely |
hiccup |
warm |
previous |
cultural |
arrives |
successor |
pour |
night |
mystery |
content |
troubles |
laugh |
dish |
smothering |
unstuck |
mustards |
transmogrify |
completely |
observing |
twists |
brighter |
care |
recognizable |
kindle |
redefining |
responds |
school |
notable |
tire |
附录:代码(python3.6,jupyter notebook)
版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。
发表评论
暂时没有评论,来抢沙发吧~