regular expression

base

re_page = re.compile(r'<dd class="name">.*?<h2>(.*?)</h2>.*?<dd class="week">(.*?)</dd>.*?<span>.*?<b>(.*?)</b>(.*?)</span>',re.S)

()是为了提取匹配的字符串，表达式中有几个()就有几个相应的匹配字符串，想得到什么数据，就()起来　

详情请见Python获取天气并email通知

group

In [14]: result=re.match(r"<h1>(.*)</h1>","<h1>匹配分组</h1>")

112In [15]: result.group()
113Out[15]: '<h1>匹配分组</h1>'

115In [16]: result.group(1)
116Out[16]: '匹配分组'

118In [17]: result=re.match(r"(<h1>).*(</h1>)","<h1>匹配分组</h1>")

120In [18]: result.group(1)
121Out[18]: '<h1>'
In [19]: result.group(2)
124Out[19]: '</h1>'

126In [20]: result.group(0)
127Out[20]: '<h1>匹配分组</h1>'

129In [21]: result.group()
130Out[21]: '<h1>匹配分组</h1>'

groups
In [22]: result.groups()
Out[22]: ('<h1>', '</h1>')

In [23]: result.groups()[0]
Out[23]: '<h1>'

\num

import re

In [25]: re.match(r"<.+><.+>.+</.+></.+>", s)
Out[25]: <re.Match object; span=(0, 26), match='<html><h1>itit</h1></html>'>

#2 对应第二个()的内容要一致 同理，1 也是同样道理 就不用像上面那样了，这个当时看书真的没理解完全呢 看了视频才比较爽了一些 
In [28]: re.match(r"<(.+)><(.+)>.+</\2></\1>", s)
Out[28]: <re.Match object; span=(0, 26), match='<html><h1>itit</h1></html>'>

findall

match 和 search 方法都是一次匹配，只要找到了一个匹配的结果就返回。然而，在大多数时候，我们需要搜索整个字符串，获得所有匹配的结果。
findall 方法的使用形式如下：
findall(string[, pos[, endpos]])
其中，string 是待匹配的字符串，pos 和 endpos 是可选参数，指定字符串的起始和终点位置，默认值分别是 0 和 len (字符串长度)。

findall 以列表形式返回全部能匹配的子串，如果没有匹配，则返回一个空列表。
看看例子：

首先是search 
In [12]: s="linuxsa</h1></html>ops</h1>"
In [13]: re.search(r"\w+</h1>",s).group()
Out[13]: 'linuxsa</h1>'

这里是findall
n [15]: re.findall(r"\w+</h1>",s)
Out[15]: ['linuxsa</h1>', 'ops</h1>']




import re
 pattern = re.compile(r'\d+')   # 查找数字
result1 = pattern.findall('hello 123456 789')
result2 = pattern.findall('one1two2three3four4', 0, 10)
 
print result1
print result2
执行结果：
['123456', '789']
['1', '2']


re.findall(pattern, string[, flags])
在字符串中找到正则表达式所匹配的所有子串，并组成一个列表返回。同样 RegexObject 有：
findall(string[, pos[, endpos]])

\S	匹配任意非空字符

\w	匹配字母数字及下划线

\W	匹配非字母数字及下划线



>>> import re  
>>> s = "adfad asdfasdf asdfas asdfawef asd adsfas "  
  
>>> reObj1 = re.compile('((\w+)\s+\w+)')  #小写
>>> reObj1.findall(s)  
[('adfad asdfasdf', 'adfad'), ('asdfas asdfawef', 'asdfas'), ('asd adsfas', 'asd')]  


In [9]: s = "a b c d   e f" 

In [10]: reObj1 = re.compile('((\w+)\s+\w+)') 

In [11]: reObj1.findall(s)
Out[11]: [('a b', 'a'), ('c d', 'c'), ('e f', 'e')]

#解说　　当给出的正则表达式中带有多个括号时，列表的元素为多个字符串组成的tuple，tuple中字符串个数与括号对数相同，字符串内容与每个括号内的正则表达式相对应，并且排放顺序是按括号出现的顺序。


In [12]: reObj1 = re.compile('(\w+\s+\w+)') 

In [13]: reObj1.findall(s)
Out[13]: ['a b', 'c d', 'e f']
＃当给出的正则表达式中带有一个括号时，列表的元素为字符串，此字符串的内容与括号中的正则表达式相对应（不是整个正则表达式的匹配内容）。

对于.re.compile.findall(data)之后的数据，我们可以通过list的offset索引或者str.join()函数来使之变成str字符串，从而进行方便的处理

Python 正则re模块之compile()和findall()详解