Mahout | 贝叶斯算法

贝叶斯

基本思想

  • 已知类条件概率密度参数表达式和先验概率
  • 利用贝叶斯公式转换成后验概率
  • 根据后验概率大小进行决策分类

实例

/usr/local/mahout/examples/bin/classify-20newsgroups.sh

if [[ -z "$MAHOUT_WORK_DIR" ]]; then
  WORK_DIR=/tmp/mahout-work-${USER}
else
  WORK_DIR=$MAHOUT_WORK_DIR
fi

mkdir -p ${WORK_DIR}
curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz \
-o ${WORK_DIR}/20news-bydate.tar.gz
mkdir -p ${WORK_DIR}/20news-bydate
cd ${WORK_DIR}/20news-bydate && tar xzf ../20news-bydate.tar.gz && cd .. && cd ..

mkdir ${WORK_DIR}/20news-all
cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all

hdfs dfs -mkdir -p ${WORK_DIR}
hdfs dfs -mkdir ${WORK_DIR}/20news-all
hdfs dfs -put ${WORK_DIR}/20news-all ${WORK_DIR}/

# Creating sequence files from 20newsgroups data
mahout seqdirectory \
-i ${WORK_DIR}/20news-all \
-o ${WORK_DIR}/20news-seq -ow

# Converting sequence files to vectors
mahout seq2sparse \
-i ${WORK_DIR}/20news-seq \
-o ${WORK_DIR}/20news-vectors  -lnorm -nv  -wt tfidf

# Creating training and holdout set with \
# a random 80-20 split of the generated vector dataset
mahout split \
-i ${WORK_DIR}/20news-vectors/tfidf-vectors \
--trainingOutput ${WORK_DIR}/20news-train-vectors \
--testOutput ${WORK_DIR}/20news-test-vectors  \
--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

# Training Naive Bayes model
mahout trainnb \
-i ${WORK_DIR}/20news-train-vectors \
-o ${WORK_DIR}/model \
-li ${WORK_DIR}/labelindex \
-ow 

# Self testing on training set
mahout testnb \
-i ${WORK_DIR}/20news-train-vectors\
-m ${WORK_DIR}/model \
-l ${WORK_DIR}/labelindex \
-ow -o ${WORK_DIR}/20news-testing

# Testing on holdout set
mahout testnb \
-i ${WORK_DIR}/20news-test-vectors\
-m ${WORK_DIR}/model \
-l ${WORK_DIR}/labelindex \
-ow -o ${WORK_DIR}/20news-testing 
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

友情链接更多精彩内容