-
Notifications
You must be signed in to change notification settings - Fork 32
/
regexp.tex
176 lines (143 loc) · 4.98 KB
/
regexp.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
\chapter{Regular Expressions}\label{s:regexp}
A \gref{g:regular-expression}{regular expression} is a pattern for matching text.
Most languages implement them in libraries,
but they are built into JavaScript,
and are written using \texttt{/} before and after a string rather than single or double quotes.
\begin{itemize}
\item
Letters and digits match themselves,
so the regular expression \texttt{/enjoy/} matches the word ``enjoy'' wherever it appears in a string.
\item
A dot matches any character,
so \texttt{/../} matches any two consecutive characters
and \texttt{/en..y/} matches ``enjoy'', ``gently'', and ``brightens your day''.
\item
The asterisk \texttt{*} means ``match zero or more occurrences of what comes immediately before'',
so \texttt{en*} matches ``ten'' and ``penny''.
It also matches ``feet'' (which has an ``e'' followed by zero occurrences of ``n'').
\item
The plus sign \texttt{+} means ``match \emph{one} or more occurrences of what comes immediately before'',
so \texttt{en+} still matches ``ten'' and ``penny'' but doesn't match ``feet''
(because there isn't an ``n'').
\item
Parentheses create groups just as they do in mathematics,
so \texttt{(an)+} matches ``banana'' but not ``annual''.
\item
The pipe character \texttt{|} means ``either/or'',
so \texttt{b|c} matches either a single ``b'' or a single ``c'',
and \texttt{(either)|(or)} matches either ``either'' or ``or''.
(The parentheses are necessary because \texttt{either|or} matches ``eitherr'' or ``eitheor''.)
\item
The shorthand notation \texttt{[a-z]} means ``all the characters in a range'',
and is easier to write and read than \texttt{a|b|c|{\ldots}|y|z}.
\item
The characters \texttt{\textasciicircum} and \texttt{\$} are called anchors:
they match the beginning and end of the line without matching any actual characters.
\item
If we want to put a special character like \texttt{.}, \texttt{*}, \texttt{+}, or \texttt{|}
in a regular expression,
we have to \gref{g:escape-sequence}{escape} it with a backslash \texttt{\textbackslash}.
This means that \texttt{/stop{\textbackslash}./} only matches ``stop.'',
while \texttt{stop.} matches ``stops'' as well.
\end{itemize}
\begin{longtable}{llll}
Text
& Pattern
& Match
& Explanation
\\
\texttt{abc}
& /b/
& yes
&
character matches itself
\\
& /b*/
& yes
& matches zero or more b's
\\
& /z*/
& yes
& text contains zero z's, so pattern matches
\\
& /z+/
& no
& text does not contain one or more z's
\\
& /a.c/
& yes
& '.' matches the 'b'
\\
& /{\textasciicircum}b/
& no
& text does not start with 'b'
\\
\texttt{abc123}
& /[a-z]+/
& yes
& contains one or more consecutive lower-case letters
\\
& /{\textasciicircum}[a-z]+\$/
& no
& digits in string prevent a match
\\
& /{\textasciicircum}[a-z0-9]+\$/
& yes
& whole string is lower-letters or digits
\\
\texttt{Dr.\ Toby}
& /(Dr|Prof){\textbackslash}./
& yes & contains either ``Dr'' or ``Prof'' followed by literal '.'
\\
\caplbl{Regular Expression Matches}{t:regexp-examples}
\end{longtable}
This is a lot to digest,
so \tblref{t:regexp-examples} shows a few examples.
Regular expressions can match an intimidating number of other patterns,
but are fairly easy to use in programs.
Like strings and arrays,
they are objects with methods:
if \texttt{pattern} is a regular expression,
then \texttt{string.test(pattern)} returns \texttt{true} if the pattern matches the string
and \texttt{false} if it does not,
while \texttt{string.match(pattern)} returns an array of matching substrings.
If we add the modifier ``g'' after the closing slash of the regular expression to make it ``global'',
then \texttt{string.match(pattern)} returns \emph{all} of the matching substrings:
\begin{minted}{js}
Tests = [
'Jamie: james@geneinfo.org',
'Zara: zetsure@bio123.edu',
'Hong and Andrzej: hchui@euphoric.edu and aszego@euphoric.edu'
]
const pattern = /[a-z]+@[a-z]+\.[a-z]+/g
console.log(`pattern is ${pattern}`)
for (let test of Tests) {
console.log(`tested against ${test}`)
const matches = test.match(pattern)
if (matches === null) {
console.log('-no matches-')
}
else {
for (let m of matches) {
console.log(m)
}
}
}
\end{minted}
\begin{minted}{text}
pattern is /[a-z]+@[a-z]+\.[a-z]+/g
tested against Jamie: james@geneinfo.org
james@geneinfo.org
tested against Zara: zetsure@bio123.edu
-no matches-
tested against Hong and Andrzej: hchui@euphoric.edu and aszego@euphoric.edu
hchui@euphoric.edu
aszego@euphoric.edu
\end{minted}
As powerful as they are,
there are things that \hreffoot{https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454\#1732454}{regular expressions can't do}.
When it comes to pulling information out of text,
though,
they are easier to use and more efficient than long chains of substring tests.
They can also be used to replace substrings and to split strings into pieces:
please see \hreffoot{https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular\_Expressions}{the documentation} for more information.