python cookbooky读书笔记--字符串和文本(一)

本章准备在一周内完成，上周拖了太久. 写笔记的原因：脑子不好使，容易忘记，方便自己找笔记, 同时敲一遍的时候可以思考下.

需要将一个字符串分割为多个字段，但是分隔符(还有周围的空格)并不是固定的

可以使用re模块中的split, string模块中的split方法只适用于简单的分隔情形.它并不允许有多个分隔符或者是分隔符周围不确定的空格

line ='asdf fjdk; afed, fjek,asdf,foo'
import re
print(re.split(r'[;,\s]\s*',line)) # \s 匹配空白，即 空格，tab键，\S   匹配非空白
# ['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

字符串开头或结尾匹配

需要通过指定的文本模式去检查字符串的开头或者结尾，比如文件名后缀，URL Scheme等等
解决方法为str.startswith() 或者str.endswith()

eg:

filename = "spam.txt"
print(filename.endswith('.txt')) # True
print(filename.startswith('file:')) # False

判断当前 目录下的所有文件.
filenames = os.listdir('.')
pyfiles = [file for file in filenames if file.endswith('.py')]
print(pyfiles)  # ['test_pytest.py', 'auto_deploy.py', 'shard_diff.py', 'yml_and_json.py', 'api_test.py']
any(name.endswith('.py') fornamein filenames) # True

用Shell通配符匹配字符串

比如想使用 Unix Shell中常用的通配符(比如 .py , Dat[0-9].csv 等)去匹配文本字符串
解决方法: fnmatch 的两个函数fnmatch()与fnmatchcase()
注意: 如果你的代码需要做文件名的匹配，最好使用 glob 模块。

from fnmatch import fnmatch, fnmatchcase
print(fnmatch('foo.txt', '*.txt')) #True
print(fnmatch('foo.txt','?oo.txt'))#True
print(fnmatch('Dat45.csv','Dat[0-9]*'))#True
names =['Dat1.csv','Dat2.csv', 'config.ini','foo.py']
print([name for name in names if fnmatch(name, 'Dat*.csv')]) # ['Dat1.csv', 'Dat2.csv']

# fnmatch() 函数使用底层操作系统的大小写敏感规则(不同的系统是不一样的)来匹配模 式
print(fnmatch('foo.txt', '*.TXT')) #False 
# 使用fnmatchcase()可以区分大小写模式.
addresses =[ '5412 NCLARKST', '1060 WADDISON ST', '1039 WGRANVILLE AVE', '2122 NCLARKST', '4802 NBROADWAY']
print([addr for addr in addresses if fnmatchcase(addr, '* ST')]) # ['1060 WADDISON ST']

字符串匹配和搜索

你想匹配或者搜索特定模式的文本
如果是字面字符串，可以使用str.find(),str.endswith(),str.startswith(), 对于复杂的字符串，可以使用re.match(), re.search()等

text1 ='11/27/2012'
text2 ='Nov 27, 2012'
# match 总是从字符中的开始去匹配。
if re.match(r'\d+/\d+/\d+', text1):
    print('yes')
else:
    print('no~')

# 预编译为模式对像，这样同一个模式可以做多次匹配.
datepat = re.compile(r'\d+/\d+/\d+')
if datepat.match(text1):  
    print('yes')
# 如果想查找任意位置，可以使用findall()代替
text ='Todayis 11/27/2012.PyConstarts3/13/2013.'
print(datepat.findall(text)) # ['11/27/2012', '3/13/2013']

# 在定义正则的时候，通常会使用括号去捕获分组. 
datepat2 = re.compile(r'(\d+)/(\d+)/(\d+)')
print(datepat2.findall(text)) # [('11', '27', '2012'), ('3', '13', '2013')]
m = datepat2.match('11/27/2012')
print(m.group(0)) #11/27/2012
print(m.group(2)) #27
print(m.group(3)) #  2012
print(m.groups()) #('11', '27', '2012')
month, day, year = m.groups()
for month, day, year in datepat2.findall(text):
    print('{}-{}-{}'.format(month, day, year)) #3-13-2013

字符串搜索和替换

想在字符串中搜索和匹配指定的文本模式
对于简单的字符串可以使用str.replace(),复杂的使用正则re.sub()

text ='yeah,but no,butyeah, butno, but yeah'
print(text.replace('yeah','yjoqm')) #yjoqm,but no,butyjoqm, butno, but yjoqm
text2 ='Todayis 11/27/2012.PyConstarts3/13/2013.'
# sub参数，第一个是匹配模式， 第二个是替换模式，反斜杠数字比如\1指向前面模式的捕获组号
print(re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)) # Todayis 2012-11-27.PyConstarts2013-3-13.
datepat3= re.compile(r'(\d+)/(\d+)/(\d+)')
print(datepat3.sub(r'\3-\1-\2', text2)) #Todayis 2012-11-27.PyConstarts2013-3-13.

对于更复杂的替换，可以传递一个替换回调函数来代替

最短匹配模式

大部分情况下使用非贪婪匹配，正则的匹配默认是贪婪的
eg: 贪婪匹配，结果会比想要的长

text1 ='Computer says"no."'
str_pat = re.compile(r'\"(.*)\"')
print(str_pat.findall(text1))  #['no.']
text2 ='Computer says"no."Phonesays "yes."'
print(str_pat.findall(text2)) # ['no."Phonesays "yes.']

假如更改预编译行为str_pat = re.compile(r'\"(.*?)\"'),输出结果会是自己想要的. ['no.'] 与['no.', 'yes.']

多行匹配模式

试着使用正则表达式去匹配一大块的文本，而你需要跨越多行去匹配
点(.)去匹配任意字符的时候，忘记了点(.)不能匹配换行符的事实

text2 ='''/*this isa 
multiline comment */
'''

comment = re.compile(r'\/*(.*?)\*/')
print(comment.findall(text2)) # ['multiline comment ']

comment2= re.compile(r'/\*((?:.|\n)*?)\*/')
print(comment2.findall(text2)) # ['this isa \nmultiline comment ']

在这个模式中， (?:.|\n) 指定了一个非捕获组 (也就是它定义了一个仅仅用来做匹配，而不能通过单独捕获或者编号的组)

比较喜欢下面这种方法：

comment2= re.compile(r'/\*(.*?)\*/',re.DOTALL)
print(comment2.findall(text2)) # ['this isa \nmultiline comment ']

re.compile()函数接受一个标志参数叫re.DOTALL,在这里非常有用，它可以让正则表达式中的点（.）匹配包括换行符在内的任意字符

删除字符串中不需要的字符

可以使用strip()用于删除开始或结尾的字符， lstrip()和 rstrip() 分别从左和从右执行删除操作，可以指定strip('---')参数的
假如需要处理字符串中间的字符，可以使用replace()或者正则

s= ' hello    world\n'
s = s.replace(' ','')
print(s) helloworld
print(re.sub(r'\s+','',s)) #helloworld

处理文件时一个比较好的形式：

with open(filename) as f:
    lines = (line.strip() for line im f)
    for line in lines:
        print(line)

字符串对齐

想通过从某种格式来格式化字符串
基本的对齐方式可以使用ljust(),rjust(),center()方法.

text = 'Hello World'
print(text.ljust(20))
print(text.rjust(20))
print(text.center(20))

格式化时优先使用format() 函数或者方法

合并字符串

*将几个小的字符串合并为一个大的字符串
通常情况下，可以使用join().

parts =['Is','Chicago','Not', 'Chicago?']
print(' '.join(parts)) #Is Chicago Not Chicago?
print(':'.join(parts)) #Is:Chicago:Not:Chicago

a= 'IsChicago'
b= 'Not Chicago?'
print(a + ' ' + b) #简单的要可以使用+号
print('{} {}'.format(a,b))

字符串中插入变量

*想创建一个内嵌变量的字符串，变量被它的值所表示的字符串替换掉
eg: 通过使用格式化，format()

s= '{name} has {n} messages.'
print(s.format(name='yjoqm', n=37))