Page Contents

Regular Expressions

Regular expressions are a system for matching patterns in text data, which are widely used in UNIX systems, and occasionally on personal computers as well. They provide a very powerful, but also rather obtuse, set of tools for finding particular words or combinations of characters in strings.

On first reading, this all seems particularly complicated and not of much use over and above the standard string matching provided in the Edit Filters dialog (Word matching, for example). In actual fact, in these cases MT-NewsWatcher converts your string matching criteria into a regular expression when applying filters to articles.

However, you can use some of the simpler matching criteria with ease (some examples are suggested below), and gradually build up the complexity of the regular expressions that you use.

One point to note is that regular expressions are not wildcards. The regular expression 'c*t' does not mean 'match "cat", "cot"' etc. In this case, it means 'match zero or more 'c' characters followed by a t', so it would match 't', 'ct', 'cccct' etc.

Information Sources

The information here is an amalgamation of the documentation of regular expressions in the Metrowerks CodeWarrior IDE, and of a chapter in the book UNIX Power Tools (Peek, O'Reilly & Loukides). Online information (often the man pages for UNIX utilities) is available by using one of the search engines (e.g. InfoSeek) to search for 'regular expressions'.

MT-NewsWatcher now uses Perl-compatible regular expressions, so any regular expressions that you'd use in Perl should work in MT-NW.


RegExp Basics

Matching simple expressions

Most characters match themselves. The only exceptions are called special characters:

Description Symbol Meaning
asterisk * match zero or more times
plus sign + match one or more times
question mark ? match zero or one time
backslash \ escape the following character
period . match any one character
caret ^ negate a match in [], or start of line
square brackets [ and ]character class
parentheses ( and )grouping
dollar sign $ end of line
ampersand & and
or sign | or

To match a special character, precede it with a backslash, like this \*. For example,

This expression...matches this...but not this...
aab
\.\*.*dog
100100ABCDEFG

Matching any character

A period (.) matches any character except a newline character.

This expression...matches this...but not this...
.artdartart
carthurt
tartdark

Matching certain types of characters

Some character types can be matched by a shorthand notation:

This expression...matches
\ddigits (0-9)
\Dnon-digits
\w"word" characters (alphanumeric and underscore)
\Wnon-word characters
\swhitespace characters (space, tab)
\Snon-whitespace characters

Repeating expressions

You can repeat expressions with an asterisk or plus sign.

A regular expression followed by an asterisk (*) matches zero or more occurrences of the regular expression. If there is any choice, the first matching string in a line is used.

A regular expression followed by a plus sign (+) matches one or more occurrences of the one-character regular expression. If there is any choice, the first matching string in a line is used.

A regular expression followed by a question mark (?) matches zero or one occurrence of the one-character regular expression.

For example:

This expression... matches this... but not this...
a+b ab b
aaab baa
a*b b daa
ab
aaab
.*cat cat dog
9393cat
the old cat
c7sb@#puiercat
a[n]? h a herb ann hat
an herb

So to match any series of zero or more characters, use ".*". On its own this isn't much use, but in the middle of a longer regular expression, it can be.

Grouping expressions

If an expression is enclosed in parentheses (( and )), the editor treats it as one expression and applies any asterisk (*) or plus (+) to the whole expression.

For example

This expression... matches this... but not this...
(ab)*c abc ababab
ababababc ababd
(.a)+b xab b
ra5afab aagb

Choosing one character from many

A string of characters enclosed in square brackets ([]) matches any one character in that string. If the first character in the brackets is a caret (^), it matches any character except those in the string. For example, [abc] matches a, b, or c, but not x, y, or z. However, [^abc] matches x, y, or z, but not a, b, or c.

A minus sign (-) within square brackets indicates a range of consecutive ASCII characters. For example, [0-9] is the same as [0123456789]. The minus sign loses its special meaning if it's the first (after an initial ^, if any) or last character in the string.

If a right square bracket is immediately after a left square bracket, it does not terminate the string but is considered to be one of the characters to match. If any special character, such as backslash (\), asterisk (*), or plus sign (+), is immediately after the left square bracket, it doesn't have its special meaning and is considered to be one of the characters to match.

This expression... matches this... but not this...
[aeiou][0-9] a6 ex
i3 9a
u2 $6
[^cfl]og dog cog
bog fog
END[.] END. END;
END DO
ENDIAN

Matching the beginning or end of a line

You can specify that a regular expression match only the beginning or end of the line. In MT-NewsWatcher, a line is the whole field that is being matched, for example the author or subject fields. These are called anchor characters:

If a caret (^) is at the beginning of the entire regular expression, it matches the beginning of a line.

If a dollar sign ($) is at the end of the entire regular expression, it matches the end of a line.

If an entire regular expression is enclosed by a caret and dollar sign (^like this$), it matches an entire line.

This expression... matches this... but not this...
^(the cat).+ the cat runs see the cat run
.+(the cat)$ watch the cat the cat eats

So, to match all strings containing just one characters, use "^.$".


RegExp Extensions

Matching words

You can specify that a regular expression match parts of words with \< (match the start of a word) and \> (match the end of a word). An expression like "\<app" will match "apple" and "application", while "ing\>" will match all words ending in -ing. To match a whole word, using an expression like "\<this\>".

MT-NewsWatcher provides facilities for doing words matches (which use these expressions internally), but if you want more flexibility, these come in useful. For example, you might want

M.*\<Excel\>

to match MS Excel, Microsoft Excel, Microsquish Excel etc. To remind you, the .* mean 'zero or more (*) of any character (.)'.

Alternatives

You can define an expression like (cash|money) to match strings which contain either the word 'cash', or the word 'money', or both. Note that the parentheses around the expression are required.


RegExp Examples

Here are some example regular expressions that create filters useful for a variety of common situations.

Examples

Kill if 'subject' matches the reg. exp. "(cash|money)"

This kills articles with 'cash' or 'money' in the subject. This should be a case-insensitive match.

Kill if 'subject' matches the reg. exp. "^\[?F.?S.?"

This kills 'For Sale' articles, which have a subject line that starts ('^') with either FS, F.S., [FS] or [F.S.]. Here the '[' needs to be escaped to '\[', and the '?' means 'match zero or one instance of'.

Kill if 'subject' matches the reg. exp. "[[$%|_\*!][[$%|_\*!][[$%|_\*!]"

This is a nifty one that kills those posts with subjects like "$$$blah blah" or "_______this..." which are almost surely not worth reading. The regular expression reads like this. It repeats the range of characters [[$%|_\*!] three times, meaning that any of the characters in the [] will be matched. ([ is normally interpreted as starting a group like this unless it is the first character after a [, hence its position here.) This grouping is then repeated three times, to match subjects like $_* or *** or !_!. You could prepend a ^ to force the match a the beginning of the line.

Kill if 'Xref' matches the reg. exp. "[^ ]+ [^ ]+ [^ ]+ [^ ]+"

This kills articles which have been cross-posted to four or more groups, and works by looking for runs of non-space characters (the [^ ]) separated by spaces.

Hilite if 'subject' matches the reg. exp. "News ?Watcher (ignore case)"

This will match "MT-NewsWatcher", "MT-NewsWatcher", "News Watcher", "news Watcher" and so on. The '?' means match zero or one space.

Hilite if 'subject' matches the reg. exp. "Kaleid[aeio]scope"

This will match "Kaleidoscope", as well as all the misspellings that are common, the [] meaning match any of the alternatives within the square brackets.

Hilite if 'subject' matches the reg. exp. "^\[?A[Nn][Nn]"

This is useful for catching announcement posts, where the subject line starts with [ANN] or Ann or [Ann. The first "^" forces a match at the beginning of the line. Then it looks for zero or one (the meaning of the "?") "[" characters, but since this is a reserved character, it has to be escaped to "\[". Then we look for a "A", followed by either "N" or "n" and then one or more "N" or "n" characters.

Filtering Junk Messages

Read on to find out how to use filters to kill junk messages.

Table of Contents