Regular Expressions
Regular expressions are a system for matching patterns in text data, which are widely used in UNIX systems, and occasionally on personal computers as well. They provide a very powerful, but also rather obtuse, set of tools for finding particular words or combinations of characters in strings.
On first reading, this all seems particularly complicated and not of much use over and above the standard string matching provided in the Edit Filters dialog (Word matching, for example). In actual fact, in these cases MT-NewsWatcher converts your string matching criteria into a regular expression when applying filters to articles.
However, you can use some of the simpler matching criteria with ease (some examples are suggested below), and gradually build up the complexity of the regular expressions that you use.
One point to note is that regular expressions are not wildcards. The
regular expression 'c*t
' does not mean 'match "cat", "cot"' etc.
In this case, it means 'match zero or more 'c' characters followed by
a t', so it would match 't', 'ct', 'cccct' etc.
Information Sources
The information here is an amalgamation of the documentation of regular expressions in the Metrowerks CodeWarrior IDE, and of a chapter in the book UNIX Power Tools (Peek, O'Reilly & Loukides). Online information (often the man pages for UNIX utilities) is available by using one of the search engines (e.g. InfoSeek) to search for 'regular expressions'.
MT-NewsWatcher now uses Perl-compatible regular expressions, so any regular expressions that you'd use in Perl should work in MT-NW.
RegExp Basics
Matching simple expressions
Most characters match themselves. The only exceptions are called special characters:
Description | Symbol | Meaning |
---|---|---|
asterisk | * | match zero or more times |
plus sign | + | match one or more times |
question mark | ? | match zero or one time |
backslash | \ | escape the following character |
period | . | match any one character |
caret | ^ | negate a match in [], or start of line |
square brackets | [ and ] | character class |
parentheses | ( and ) | grouping |
dollar sign | $ | end of line |
ampersand | & | and |
or sign | | | or |
To match a special character, precede it with a backslash, like this \*
.
For example,
This expression... | matches this... | but not this... |
---|---|---|
a | a | b |
\.\* | .* | dog |
100 | 100 | ABCDEFG |
Matching any character
A period (.) matches any character except a newline character.
This expression... | matches this... | but not this... |
---|---|---|
.art | dart | art |
cart | hurt | |
tart | dark |
Matching certain types of characters
Some character types can be matched by a shorthand notation:
This expression... | matches |
---|---|
\d | digits (0-9) |
\D | non-digits |
\w | "word" characters (alphanumeric and underscore) |
\W | non-word characters |
\s | whitespace characters (space, tab) |
\S | non-whitespace characters |
Repeating expressions
You can repeat expressions with an asterisk or plus sign.
A regular expression followed by an asterisk (*
) matches zero or more
occurrences of the regular expression. If there is any choice, the
first matching string in a line is used.
A regular expression followed by a plus sign (+
) matches one or more
occurrences of the one-character regular expression. If there is any
choice, the first matching string in a line is used.
A regular expression followed by a question mark (?
) matches zero or
one occurrence of the one-character regular expression.
For example:
This expression... | matches this... | but not this... |
---|---|---|
a+b | ab | b |
aaab | baa | |
a*b | b | daa |
ab | ||
aaab | ||
.*cat | cat | dog |
9393cat | ||
the old cat | ||
c7sb@#puiercat | ||
a[n]? h | a herb | ann hat |
an herb |
So to match any series of zero or more characters, use ".*
". On
its own this isn't much use, but in the middle of a longer regular
expression, it can be.
Grouping expressions
If an expression is enclosed in parentheses ((
and )
), the editor
treats it as one expression and applies any asterisk (*
) or plus (+
)
to the whole expression.
For example
This expression... | matches this... | but not this... |
---|---|---|
(ab)*c | abc | ababab |
ababababc | ababd | |
(.a)+b | xab | b |
ra5afab | aagb |
Choosing one character from many
A string of characters enclosed in square brackets ([]
) matches any one character
in that string. If the first character in the brackets is a caret (^
), it matches
any character except those in the string. For example, [abc]
matches a, b, or c,
but not x, y, or z. However, [^abc]
matches x, y, or z, but not a, b, or c.
A minus sign (-) within square brackets indicates a range of consecutive ASCII
characters. For example, [0-9]
is the same as [0123456789]
. The minus sign loses
its special meaning if it's the first (after an initial ^, if any) or last character
in the string.
If a right square bracket is immediately after a left square bracket, it does
not terminate the string but is considered to be one of the characters to match.
If any special character, such as backslash (\
), asterisk (*
), or plus sign (+
),
is immediately after the left square bracket, it doesn't have its special meaning
and is considered to be one of the characters to match.
This expression... | matches this... | but not this... |
---|---|---|
[aeiou][0-9] | a6 | ex |
i3 | 9a | |
u2 | $6 | |
[^cfl]og | dog | cog |
bog | fog | |
END[.] | END. | END; |
END DO | ||
ENDIAN |
Matching the beginning or end of a line
You can specify that a regular expression match only the beginning or end of the line. In MT-NewsWatcher, a line is the whole field that is being matched, for example the author or subject fields. These are called anchor characters:
If a caret (^
) is at the beginning of the entire regular expression,
it matches the beginning of a line.
If a dollar sign ($
) is at the end of the entire regular expression,
it matches the end of a line.
If an entire regular expression is enclosed by a caret and dollar sign
(^like this$
), it matches an entire line.
This expression... | matches this... | but not this... |
---|---|---|
^(the cat).+ | the cat runs | see the cat run |
.+(the cat)$ | watch the cat | the cat eats |
So, to match all strings containing just one characters, use "^.$
".
RegExp Extensions
Matching words
You can specify that a regular expression match parts of words with \<
(match the start of a word) and \>
(match the end of a word).
An expression like "\<app
" will match "apple" and "application", while
"ing\>
" will match all words ending in -ing. To match a whole word,
using an expression like "\<this\>
".
MT-NewsWatcher provides facilities for doing words matches (which use these expressions internally), but if you want more flexibility, these come in useful. For example, you might want
M.*\<Excel\>
to match
MS Excel, Microsoft Excel, Microsquish Excel etc. To remind you, the .*
mean 'zero or more (*) of any character (.)'.
Alternatives
You can define an expression like (cash|money)
to match
strings which contain either the word 'cash', or the word 'money', or both.
Note that the parentheses around the expression are required.
RegExp Examples
Here are some example regular expressions that create filters useful for a variety of common situations.
Examples
- Kill if 'subject' matches the reg. exp. "
(cash|money)
" -
This kills articles with 'cash' or 'money' in the subject. This should be a case-insensitive match.
- Kill if 'subject' matches the reg. exp. "
^\[?F.?S.?
" -
This kills 'For Sale' articles, which have a subject line that starts ('
^
') with either FS, F.S., [FS] or [F.S.]. Here the '[
' needs to be escaped to '\[
', and the '?
' means 'match zero or one instance of'. - Kill if 'subject' matches the reg. exp. "
[[$%|_\*!][[$%|_\*!][[$%|_\*!]
" -
This is a nifty one that kills those posts with subjects like "$$$blah blah" or "_______this..." which are almost surely not worth reading. The regular expression reads like this. It repeats the range of characters
[[$%|_\*!]
three times, meaning that any of the characters in the [] will be matched. ([ is normally interpreted as starting a group like this unless it is the first character after a [, hence its position here.) This grouping is then repeated three times, to match subjects like $_* or *** or !_!. You could prepend a ^ to force the match a the beginning of the line. - Kill if 'Xref' matches the reg. exp. "
[^ ]+ [^ ]+ [^ ]+ [^ ]+
" -
This kills articles which have been cross-posted to four or more groups, and works by looking for runs of non-space characters (the
[^ ]
) separated by spaces. - Hilite if 'subject' matches the reg. exp. "
News ?Watcher
(ignore case)" -
This will match "MT-NewsWatcher", "MT-NewsWatcher", "News Watcher", "news Watcher" and so on. The '?' means match zero or one space.
- Hilite if 'subject' matches the reg. exp. "
Kaleid[aeio]scope
" -
This will match "Kaleidoscope", as well as all the misspellings that are common, the
[]
meaning match any of the alternatives within the square brackets. - Hilite if 'subject' matches the reg. exp. "
^\[?A[Nn][Nn]
" -
This is useful for catching announcement posts, where the subject line starts with [ANN] or Ann or [Ann. The first "
^
" forces a match at the beginning of the line. Then it looks for zero or one (the meaning of the "?
") "[" characters, but since this is a reserved character, it has to be escaped to "\[
". Then we look for a "A", followed by either "N" or "n" and then one or more "N" or "n" characters.
Filtering Junk Messages
Read on to find out how to use filters to kill junk messages.
Table of Contents
- Preface
- Table of Contents
- Introduction
- Features
- Advanced features
- The Interface
- Appendices