欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

python正则化

程序员文章站 2022-07-13 12:43:09
...

正则化(python)

正则化作用

  • 在很多文本编辑器里,正则表达式通常被用来检索、替换那些匹配某个模式的文本1
  • 爬虫

举例

  • ‘sa{1,3}s’ 代表重复sa1s, sa2s, sa*3s
In [1]: import re
In [3]: key = "saas and sas and saaas"
In [9]: re.findall('sa{1,3}s', key)                                             
Out[9]: ['saas', 'sas', 'saaas']

In [10]: re.findall('sa{1,9}s', key)                                            
Out[10]: ['saas', 'sas', 'saaas']

In [11]: re.findall('sa{1}s', key)                                              
Out[11]: ['sas']

In [12]: re.findall('sa{3}s', key)                                              
Out[12]: ['saaas']

In [13]: re.findall('sa{2}s', key)                                              
Out[13]: ['saas']
  • 匹配之间所有
In [23]: re.findall('sa.+saaas', key)                                           
Out[23]: ['saas and sas and saaas']

In [24]: re.findall('sa.+as', key)                                              
Out[24]: ['saas and sas and saaas']

In [25]: re.findall('sa.+an', key)                                              
Out[25]: ['saas and sas an']


In [26]: key = "[email protected]"   
In [27]: re.findall('@.+.', key)                                                
Out[27]: ['@hit.edu.cn']

In [28]: re.findall('@.+\.', key)                                               
Out[28]: ['@hit.edu.']


In [31]: re.findall('@.+?\.', key)                                              
Out[31]: ['@hit.']



[匹配不捕获]2
这里与在python中使用略有区别

In [1]: key = '<br/><a target=_blank href="www.baidu.com">百度一下</a>百度才知道
   ...: '                                                                       

In [2]: import re                                                               

In [3]: re.findall('(?<=(href=")).{1,200}(?=(">))', key)                        
Out[3]: [('href="', '">')]

In [4]: re.findall('(?<=href=").{1,200}(?=">)', key)                            
Out[4]: ['www.baidu.com']

举例

In [25]: key                                                                    
Out[25]: '<annotation>\n\t<folder>02085620</folder>\n\t<filename>n02085620_10621</filename>\n\t<source>\n\t\t<database>ImageNet database</database>\n\t</source>\n\t<size>\n\t\t<width>500</width>\n\t\t<height>298</height>\n\t\t<depth>3</depth>\n\t</size>\n\t<segment>0</segment>\n\t<object>\n\t\t<name>Chihuahua</name>\n\t\t<pose>Unspecified</pose>\n\t\t<truncated>0</truncated>\n\t\t<difficult>0</difficult>\n\t\t<bndbox>\n\t\t\t<xmin>142</xmin>\n\t\t\t<ymin>43</ymin>\n\t\t\t<xmax>335</xmax>\n\t\t\t<ymax>250</ymax>\n\t\t</bndbox>\n\t</object>\n</annotation>'
In [26]: re.findall('(?<=<xmin>)[0-9]+?(?=</xmin>)', key)                       
Out[26]: ['142']


In [27]: xmin = int(re.findall('(?<=<xmin>)[0-9]+?(?=</xmin>)', key)[0]) 
    ...: xmax = int(re.findall('(?<=<xmax>)[0-9]+?(?=</xmax>)', key)[0]) 
    ...: ymin = int(re.findall('(?<=<ymin>)[0-9]+?(?=</ymin>)', key)[0]) 
    ...: ymax = int(re.findall('(?<=<ymax>)[0-9]+?(?=</ymax>)', key)[0])        

In [28]: xmin                                                                   
Out[28]: 142

In [29]: ymin                                                                   
Out[29]: 43

In [30]: xmax                                                                   
Out[30]: 335

In [31]: ymax                                                                   
Out[31]: 250


  1. https://www.cnblogs.com/chuxiuhong/p/5885073.html ↩︎

  2. https://blog.csdn.net/z69183787/article/details/81740803 ↩︎