博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
主题模型 LDA 源码分享
阅读量:4921 次
发布时间:2019-06-11

本文共 1502 字,大约阅读时间需要 5 分钟。

转载请注明来源:

Latent Dirichlet Allocation(LDA)是目前业界最为流行的机器学习方法之一,这里用C++实现了一个as-lda版本,使用了非对称的先验设置,随着主题数的增加,主题分布上比传统模型更加稳定,减少因为主题数量大而导致大量小众主题,参考文献《Rethinking LDA:Why Priors Matter》,代码目录中包含了中文测试数据

代码地址:

 

asymmetric prior Latent Dirichlet Allocation (LDA) by c++

Usually, symmetric dirichlet prior is used in the implementation of lda. in "Rethinking LDA:Why Priors Matter" , they have showed that asymmetric prior can generate better result and stable topic distribution under the increment of topic number. So, in this project, we adopt this algorithm.

other features:

#easy to use, easy to understand
#small memory used

 

ML tools source code:

as-lda:
gbdt:
adaboost:

--------how to use it-----------

Usage:    -c  corpus file,default'./corpus.txt'    -v  vocab file,default'./vocab.txt'    -e or-i  act type(e for estimate,i for inference)    -m  model files dir,default'./models'    -z  pre model assignment file ( inference )    -a  hyperparameter alpha,default500/topic_num   -b  hyperparameter beta,default0.1   -k  topic number,default100   -n  max iteration number,default1000

Examples:

extimate: ./as_lda -e -c ./corpus.txt -v ./vocab.txt -n 2000 inference: ./as_lda -i -n 100 -c corpus.txt.test -v vocab.txt -z ./models/model.z

--------input format------------

For corpus:

    one line one doc, the number stands for word id

    example:
    2699\t10608\t52656\t17781\t17781\t7900\t24007

For vocab:

    one line one word,word id is the line number

 

 

转载于:https://www.cnblogs.com/snake-hand/archive/2013/06/07/3125046.html

你可能感兴趣的文章