工具使用——统计文章中的英文单词

统计英语单词工具V1.0


本来我是准备好好研究一下爬虫的,结果早晨起来读英语的时候我发现如果能写一个简单的程序将我读书过程中的生词记录下来集合成一张纸,然后去背掉它的话,那岂不是妙极?这个也顺便练习一下做爬虫的正则表达式的能力。

当然,正则表达式是我早就想学习的东西了,不过一直没有找到机会。
好吧,我们开始今天先写一个比较简单的程序来统计英语单词,然后我们根据这个再改动一下,看看能不能进行更加细腻的操作,看看能不能把输入变得更加完善。

首先,我们需要一个txt文档,这里面都是英语文章,然后我们运行程序。
老规矩,小程序先贴代码,后讲解

# -*- coding: utf-8 -*-
#使用方法:把文本用ANSI编码存下来
# 把文章存到成input.txt中并且放到C盘根目录下面,这样比较方便操作
import re
import string
#输出文件
output_file = open("C:\\result.txt","w")
#输入文本文件
input_file = open("C:\\input.txt","r")
strs =input_file.read()
#使用正则表达式,把单词提出出来,并都修改为小写格式
s = re.findall("\w+",str.lower(strs))
# 返回一个列表

#去除列表中的重复项,并排序
l = sorted(list(set(s)))

for i in l:
m = re.search("\d+",i)
n = re.search("\W+",i)
if not m and  not n and len(i)>4:
output_file.write(i +" : "+str(s.count(i))+"\n")
        # 不属于数字也不属于非(英文+数字)并且字母长度大于4的集合
input_file.close()
output_file.close()

好,我们先复制以下的内容,存储到C盘根目录下,文件名 input.txt

 first metacharacters we’ll look at are [ and ]. They’re used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'. For example, [abc] will match any of the characters a, b, or c; this is the same as [a-c], which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your RE would be [a-z].

Metacharacters are not active inside classes. For example, [akm$] will match any of the characters 'a', 'k', 'm', or '$'; '$' is usually a metacharacter, but inside a character class it’s stripped of its special nature.

You can match the characters not listed within the class by complementing the set. This is indicated by including a '^' as the first character of the class; '^' outside a character class will simply match the '^' character. For example, [^5] will match any character except '5'.

Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.

Some of the special sequences beginning with '\' represent predefined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn’t whitespace.

运行程序
成果如下


最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • Android 自定义View的各种姿势1 Activity的显示之ViewRootImpl详解 Activity...
    passiontim阅读 174,971评论 25 709
  • 1. Java基础部分 基础部分的顺序:基本语法,类相关的语法,内部类的语法,继承相关的语法,异常的语法,线程的语...
    子非鱼_t_阅读 31,896评论 18 399
  • 通过两天的学习,研究了一下quartz,发现这是一个绘图的好工具。可以在屏幕图层绘制文字,线条,图形还有折线图。 ...
    fisland阅读 3,047评论 0 0
  • 姓名:刁偉聰 公司:寧波貞觀電器有限公司 寧和塾《六項精進》235期謙虛二組學員 【行~践行】 去金帅再次同潘董要...
    真诚无敌阅读 1,579评论 0 0