Pjack: Python 2.7 Standard Library 筆記 -- Regular Expression

教學連結
http://docs.python.org/howto/regex.html#regex-howto

re Module
http://docs.python.org/library/re.html

\d 就是 [0-9] \D 則是 [^0-9]
\s 就是 [ \t\n\r\f\v] \S 則是 [^ \t\n\r\f\v]
\w 就是 [a-zA-Z0-9_] \S 則是 [^a-zA-Z0-9_]
[] 是一堆字元的集合, 只要出現裡面任一字元, 就算符合
* 是重覆 0 ~ 無限多次
+ 是重覆 1 ~ 無限多次
? 是重覆 0 或 1 次

re.compile 會回傳一個 pattern object，利用這個 pattern object 可以持續的分解一段文字，還滿好用的

class SRE_Pattern(__builtin__.object)
 |  Compiled regular expression objects
 |
 |  Methods defined here:
 |
 |  findall(...)
 |      findall(string[, pos[, endpos]]) --> list.
 |      Return a list of all non-overlapping matches of pattern in string.
 |
 |  finditer(...)
 |      finditer(string[, pos[, endpos]]) --> iterator.
 |      Return an iterator over all non-overlapping matches for the
 |      RE pattern in string. For each match, the iterator returns a
 |      match object.
 |
 |  match(...)
 |      match(string[, pos[, endpos]]) --> match object or None.
 |      Matches zero or more characters at the beginning of the string
 |
 |  scanner(...)
 |
 |  search(...)
 |      search(string[, pos[, endpos]]) --> match object or None.
 |      Scan through string looking for a match, and return a corresponding
 |      MatchObject instance. Return None if no position in the string matches.
 |
 |  split(...)
 |      split(string[, maxsplit = 0])  --> list.
 |      Split string by the occurrences of pattern.
 |
 |  sub(...)
 |      sub(repl, string[, count = 0]) --> newstring
 |      Return the string obtained by replacing the leftmost non-overlapping
 |      occurrences of pattern in string by the replacement repl.
 |
 |  subn(...)
 |      subn(repl, string[, count = 0]) --> (newstring, number of subs)
 |      Return the tuple (new_string, number_of_subs_made) found by replacing
 |      the leftmost non-overlapping occurrences of pattern with the
 |      replacement repl.
 |

要注意 raw string 及非 raw string 的差別，一般會直接使用 raw string，比較直覺

# 第一個方案用一般  string 來找尋 '\n' 必須要輸入 "\\n"
>>> p  = re.compile('\\n', re.IGNORECASE)   
>>> p.findall("\np")
['\n']
>>> p.findall("\\np")
[]
# 第二個方案用 raw string， 就直接輸入 "\n" 就可以了，效果是一樣的
>>> p  = re.compile(r'\n', re.IGNORECASE)
>>> p.findall("\np")
['\n']
>>> p.findall("\\np")
[]
>>> print p.findall("\np")[0]

Compilation Flags

IGNORECASE, 這個應該很方便, 可以省去很多必須注意的小地方
MULTILINE, 可以自動把每一行分開解析

A | B 可以找尋 A 或 B, A 和 B 分別是一個 RE
^A 可以指定字串起始點必須要符合 A, A 是一個 RE

不過在集合 [] 內, ^ 是當反相的意思

A$ 可以指定字串尾巴必須要符合 A, A 是一個 RE
\bS\b 用來指明要找尋的字串S前面或是後面必須要有分隔的字元
\BS\B 剛好是\b 的相反, 字串 S 的前面或是後面不可以是分隔的字元
() 用在區隔 group, 可以讓你一次在一個字串內找尋兩個 pattern, 甚至這兩個 pattern 是有交互作用的

>>> p = re.compile('(a(b)c)d')
>>> m = p.match('abcd')
>>> m.group(0)
'abcd'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'

(?P...) ,指定 group 的名字, (?P=name) 是舊的寫法
(?:...), 不指定 group 的名字, 只 mapping 但不抓回來, 這邊的寫法有點難以理解

>>> m = re.match("([abc])+", "abc")
>>> m.groups()
('c',)
>>> m = re.match("(?:[abc])+", "abc")
>>> m.groups()
()

(...)\1 可以用來指明第幾個 group, \1 表示要搜尋第一個 group

Splitting Strings

也可以利用 RE 來 split string, 符合的字串將會被消去, 然後分段整個字串
如果不想把符合的字串消去, 就必須加上 group

>>> p = re.compile(r'\W+')
>>> p2 = re.compile(r'(\W+)')
>>> p.split('This... is a test.')
['This', 'is', 'a', 'test', '']
>>> p2.split('This... is a test.')
['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', '']

Search and Replace

符合的 pattern 可以用在 Replace Rule 當中, 這三種表示法都是同樣的意思 \1 = \g<1> = \g

*?、+?、?? 都是 Non-Greedy

Pjack

Sharing

2011年11月20日星期日

Python 2.7 Standard Library 筆記 -- Regular Expression

沒有留言:

Sharing

2011年11月20日 星期日

Python 2.7 Standard Library 筆記 -- Regular Expression

沒有留言:

2011年11月20日星期日