python正则化
程序员文章站
2022-07-13 12:43:09
...
正则化作用
- 在很多文本编辑器里,正则表达式通常被用来检索、替换那些匹配某个模式的文本1
- 爬虫
举例
- ‘sa{1,3}s’ 代表重复sa1s, sa2s, sa*3s
In [1]: import re
In [3]: key = "saas and sas and saaas"
In [9]: re.findall('sa{1,3}s', key)
Out[9]: ['saas', 'sas', 'saaas']
In [10]: re.findall('sa{1,9}s', key)
Out[10]: ['saas', 'sas', 'saaas']
In [11]: re.findall('sa{1}s', key)
Out[11]: ['sas']
In [12]: re.findall('sa{3}s', key)
Out[12]: ['saaas']
In [13]: re.findall('sa{2}s', key)
Out[13]: ['saas']
- 匹配之间所有
In [23]: re.findall('sa.+saaas', key)
Out[23]: ['saas and sas and saaas']
In [24]: re.findall('sa.+as', key)
Out[24]: ['saas and sas and saaas']
In [25]: re.findall('sa.+an', key)
Out[25]: ['saas and sas an']
In [26]: key = "[email protected]"
In [27]: re.findall('@.+.', key)
Out[27]: ['@hit.edu.cn']
In [28]: re.findall('@.+\.', key)
Out[28]: ['@hit.edu.']
In [31]: re.findall('@.+?\.', key)
Out[31]: ['@hit.']
[匹配不捕获]2
这里与在python中使用略有区别
In [1]: key = '<br/><a target=_blank href="www.baidu.com">百度一下</a>百度才知道
...: '
In [2]: import re
In [3]: re.findall('(?<=(href=")).{1,200}(?=(">))', key)
Out[3]: [('href="', '">')]
In [4]: re.findall('(?<=href=").{1,200}(?=">)', key)
Out[4]: ['www.baidu.com']
举例
In [25]: key
Out[25]: '<annotation>\n\t<folder>02085620</folder>\n\t<filename>n02085620_10621</filename>\n\t<source>\n\t\t<database>ImageNet database</database>\n\t</source>\n\t<size>\n\t\t<width>500</width>\n\t\t<height>298</height>\n\t\t<depth>3</depth>\n\t</size>\n\t<segment>0</segment>\n\t<object>\n\t\t<name>Chihuahua</name>\n\t\t<pose>Unspecified</pose>\n\t\t<truncated>0</truncated>\n\t\t<difficult>0</difficult>\n\t\t<bndbox>\n\t\t\t<xmin>142</xmin>\n\t\t\t<ymin>43</ymin>\n\t\t\t<xmax>335</xmax>\n\t\t\t<ymax>250</ymax>\n\t\t</bndbox>\n\t</object>\n</annotation>'
In [26]: re.findall('(?<=<xmin>)[0-9]+?(?=</xmin>)', key)
Out[26]: ['142']
In [27]: xmin = int(re.findall('(?<=<xmin>)[0-9]+?(?=</xmin>)', key)[0])
...: xmax = int(re.findall('(?<=<xmax>)[0-9]+?(?=</xmax>)', key)[0])
...: ymin = int(re.findall('(?<=<ymin>)[0-9]+?(?=</ymin>)', key)[0])
...: ymax = int(re.findall('(?<=<ymax>)[0-9]+?(?=</ymax>)', key)[0])
In [28]: xmin
Out[28]: 142
In [29]: ymin
Out[29]: 43
In [30]: xmax
Out[30]: 335
In [31]: ymax
Out[31]: 250
上一篇: POI2014 Freight
下一篇: 添加 dropout 正则化