1 邮件内容
假设目前邮件名叫“1.txt”,邮件内容为:
From: Justin-Bieber@entertain.org on behalf of Bieber
Leader [leader@hello.org]
Sent: 2017-07-01 12:48
To: 'staff@hello.org'; custom@hello.org;
Willim Johnson; John Snow
Subject: The battlefield in Winterfell
I have just met then. More details as soon as possible. So far, so good.
Sent via iPhone 7 plus
2 提取思路
- 要求把邮件头部信息提取出来,需要提取信息:
- 发件人(From:)、发件时间(Sent)、收件人(To)、主题(Subject)
- 初步提取信息所在行的内容即可。
- 使用一个提取函数,把四个关键词放入数组中,用正则提取。
- 四个信息都做了全局函数,如果曾经匹配过,则全局函数 + 1,以做标识。
- 如果一个信息已经匹配过,而下一个信息还没匹配到,这一行的内容也需要读取出来。
- 提取函数的返回值,如果是
None
则不做处理。
# coding: utf-8
import re
from_count = 0
sent_count = 0
to_count = 0
subject_count = 0
def inspect_string(string):
global from_count
global sent_count
global to_count
global subject_count
keyword_list = ['From:', 'Sent:', 'To:', 'Subject:']
for keyword in keyword_list:
regex_str = ".*({0}.*)".format(keyword)
match_obj = re.match(regex_str, string)
if re.match(".*(From:.*)", string):
from_count += 1
if re.match(".*(Sent:.*)", string):
sent_count += 1
if re.match(".*(To:.*)", string):
to_count += 1
if re.match(".*(Subject:.*)", string):
subject_count += 1
if match_obj:
return match_obj.group(1)
if from_count > 0 and sent_count < 1:
return string
if sent_count > 0 and to_count < 1:
return string
if to_count > 0 and subject_count < 1:
return string
with open('1.txt', 'rb') as f:
for line in f:
result = inspect_string(str(line))
if result is None:
continue
print(result)
3 运行结果
From: Justin-Bieber@entertain.org on behalf of Bieber
Leader [leader@hello.org]
Sent: 2017-07-01 12:48
To: 'staff@hello.org'; custom@hello.org;
Willim Johnson; John Snow
Subject: The battlefield in Winterfell