Linux下词频的计算

参考文章:
https://blog.csdn.net/herecles/article/details/8152054
https://www.cnblogs.com/standby/p/8309994.html

示例的文本如下:

cat words.txt
The Zen of Python, by Tim Peters
 
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

1.利用AWK来统计词频

 cat words.txt | awk '{for(i=1;i<=NF;i++){if($i ~ /\w/) valid++;\
count[$i]++}}END{print "valid words:"valid"\n";for(j in count)\
print j,count[j]}'
# 加了if去筛选“单词”字符,但是结果不理想
#在END中,利用for将hash count中的数据输出。

valid words:143#利用perl语言进行分析是,显然不是这样的,不知道哪里出了问题

-- 1
 19
 hard 1
 1unts.
one 2
only 1
is 10
it 1
If 2
 1nse.
special 1
aren't 1
are 1
ambiguity, 1
honking 1
Readability 1
way 2
of 3
In 1
 1w.
easy 1
one-- 1
than 8
Special 1
*right* 1
refuse 1
preferably 1
that 1
be 3
Errors 1
Sparse 1
Complex 1
explain, 2
 1ver.
 1tch.
 1rity.
bad 1
you're 1
Beautiful 1
There 1
 1sted.
do 2
Unless 1
by 1
cases 1
better 8
Now 1
Explicit 1
face 1
often 1
unless 1
not 1
more 1
a 2
 1ters
implementation 2
Tim 1
obvious 1
Although 3
let's 1
 1.
 1lently.
practicality 1
Namespaces 1
should 2
 1mplex.
those! 1
great 1
 2ea.
it's 1
Simple 1
 1les.
enough 1
idea 1
explicitly 1
 1lenced.
pass 1
Zen 1

2.利用perl来统计词频

perl语言此次处理起来似乎更胜一筹,但是这里有个点我琢磨很久,因为使用了2个perl语句,但是2个perl语句的作用不太一样,不能放在一个loop下执行,其中第一个语句是利用-alne(相当于while<>)将words中的单词进行遍历,完了之后需要结束循环;第二个perl语句不需要-alne,只是通过foreach语句进行hash count的打印,故而需加上END语句进行操作

 cat words.txt|perl -alne '{foreach(split){$total++;next if /\W/;\
$valid++;$count{$_}++;}}' -e  'END{print"total:$total words,\
valid:$valid words\n";foreach $word (sort keys %count)\
{print " $word ==> $count{$word}\n"}}'

total:144 words,valid:113 words

 Although ==> 3

 Beautiful ==> 1

 Complex ==> 1

 Errors ==> 1

 Explicit ==> 1

 Flat ==> 1

 If ==> 2

  There ==> 1

 Tim ==> 1

 Unless ==> 1

 Zen ==> 1

 a ==> 2

 and ==> 1

 are ==> 1

 at ==> 1

 bad ==> 1

 enough ==> 1

 explicitly ==> 1

 face ==> 1

 first ==> 1

 good ==> 1

 great ==> 1

 hard ==> 1

 honking ==> 1

 idea ==> 1

 implementation ==> 2

 is ==> 10

 it ==> 1

 may ==> 2

 more ==> 1

 never ==> 2

 not ==> 1

 obvious ==> 1

 of ==> 3

 often ==> 1

 one ==> 2

 only ==> 1

 pass ==> 1

 practicality ==> 1

 preferably ==> 1

 refuse ==> 1
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • 第1章 小试牛刀 $ 是普通用户,# 表示管理员用户 root。 shebang:#!。sharp / hash ...
    巴喬書摘阅读 6,402评论 1 4
  • awk介绍awk变量printf命令:实现格式化输出操作符awk patternawk actionawk数组aw...
    哈喽别样阅读 1,593评论 0 4
  • 嗨,各位盆友,2019已经开始两个月了,大家过得开心吗? 果果我这些天可开心可滋润了!面色红艳有关泽,就像草莓一样...
    订好果阅读 268评论 0 0
  • 今天晚八点,星造音开始了每周日的固定培训时间。不太一样的是今天我们看了阿里巴巴的纪录片——《造梦者》。 ...
    球球很疯狂爱挑战阅读 718评论 0 1
  • 当复杂庞大的数据分析与充满趣味性的故事结合,会产生怎样的可视化效果?本篇由国外某个大佬使用数据可视化工具制作的可视...
    Acleus阅读 444评论 0 0