Regular expressions - An introduction

Regular expressions are a pattern matching standard for string parsing and replacement. They are used on a wide range of platforms and programming environments. Originally missing in Visual Basic, regular expressions are now available for most VB and VBA versions.

Regular expressions, or regexes for short, are a way to match text with patterns. They are a powerful way to find and replace strings that take a defined format. For example, regular expressions can be used to parse dates, urls and email addresses, log files, configuration files, command line switches or programming scripts.

Since regexes are language independent, we're trying to keep this article as language independent as possible. However, it's to be noted that not all regex implementations are the same. The below text is based on Perl 5.0. This is also the format that RegExpr for VB/VBA uses. Some implementations may not handle all expressions the same way.

Regex syntax

In it's simplest form, a regular expression is a string of symbols to match "as is".

RegexMatches
abcabcabcabc
23412345

That's not very impressive yet. But you can see that regexes match the first case found, once, anywhere in the input string.

Quantifiers

So what if you want to match sevearal characters? You need to use a quantifier. The most important quantifiers are *?+. They may look familiar to you from, say, the dir statement of DOS, but they're not exactly the same.
* matches any number of what's before it, from zero to infinity.
? matches zero or one.
+ matches one or more.

RegexMatches
23*41245, 12345, 123345
23?41245, 12345
23+412345, 123345

By default, regexes are greedy. They take as many characters as possible. In the next example, you can see that the regex matches as many 2's as there are.

RegexMatches
2*122223

There is also stingy matching available that matches as few characters as possible, but let's leave it this time. There are also more quantifiers than those mentioned.

Special characters

A lot of special characters are available for regex building. Here are some of the more usual ones.

.The dot matches any single character.
\nMatches a newline character (or CR+LF combination).
\tMatches a tab (ASCII 9).
\dMatches a digit [0-9].
\DMatches a non-digit.
\wMatches an alphanumberic character.
\WMatches a non-alphanumberic character.
\sMatches a whitespace character.
\SMatches a non-whitespace character.
\Use \ to escape special characters. For example, \. matches a dot, and \\ matches a backslash.
^Match at the beginning of the input string.
$Match at the end of the input string.

Here are some likely uses for the special characters.

RegexMatches
1.3123, 1z3, 133
1.*313, 123, 1zdfkj3
\d\d01, 02, 99, ..
\w+@\w+a@a, email@company.com

^ and $ are important to regexes. Without them, regexes match anywhere in the input. With ^ and $ you can make sure to match only a full string, the beginning of the input, or the end of the input.

RegexMatchesDoes not match
^1.*3$13, 123, 1zdfkj3x13, 123x, x1zdfkj3x
^\d\d01abca01abc
\d\d$xyz01xyz01x

Character classes

You can group characters by putting them between square brackets. This way, any character in the class will match one character in the input.

[abc]Match any of a, b, and c.
[a-z]Match any character between a and z. (ASCII order)
[^abc]A caret ^ at the beginning indicates "not". In this case, match anything other than a, b, or c.
[+*?.]Most special characters have no meaning inside the square brackets. This expression matches any of +, *, ? or the dot.

Here are some sample uses.

RegexMatchesDoes not match
[^ab]c, d, zab
^[1-9][0-9]*$Any positive integerZero, negative or decimal numbers
[0-9]*[,.]?[0-9]+.1, 1, 1.2, 100,00012.

Grouping and alternatives

It's often necessary to group things together with parentheses ( and ).

RegexMatchesDoes not match
(ab)+ab, abab, abababaabb
(aa|bb)+aa, bbaa, aabbaaaaabab

Notice the | operator. This is the Or operator that takes any of the alternatives.

With parentheses, you can also define subexpressions to remember after the match has happened. In the below example, the string what is between (.)

RegexMatchesStores
a(\d+)aa12a12
(\d+)\.(\d+)1.21 and 2

In these examples, what is matched by (\d+) gets stored. The regex engine will allow you to retrieve the stored value by a successive call. The implementation of the call varies. In RegExpr for VB/VBA, you call RegExprResult(1) to get the first stored value, RegExprResult(2) to get the second one, and so on. This way you can retrieve fields for further processing.

Case sensitivity

So are regexes case sensitive? Yes and no. They are both. It depends on the way you write the regex call in the programming language. Refer to the documentation of your programming language or regex implementation on how to write the calls.

Advanced syntax

The above is in no way a complete description of regexes. There are more ways to write them, more special characters, and more quantifiers available. What's available depends also on the implementation. Some regex engines don't implement all of the possibilities, rendering them not so usable for every purpose. In case you're interested in learning a more complete set of regexes, see the help file of RegExpr for VB/VBA. It's available for free download.

Regex examples

Here are a few practical examples of regular expressions. They are provided for learning purposes. In real applications, you should carefully design your regexes to match the exact use.

Email matching

It's often necessary to check if a string is an email address or not. Here's one way to do it.

^[A-Za-z0-9_\.-]+@[A-Za-z0-9_\.-]+[A-Za-z0-9_][A-Za-z0-9_]$

Explanation

^[A-Za-z0-9_\.-]+Match a positive number of acceptable characters at the start of the string.
@Match the @ sign.
[A-Za-z0-9_\.-]+Match any domain name, including a dot.
[A-Za-z0-9_][A-Za-z0-9_]$Match two acceptable characters but not a dot. This ensures that the email address ends with .xx, .xxx, .xxxx etc.

This example works for most cases but is not written based on any standard. It may accept non-working email addresses and reject working ones. Fine-tuning is required.

Parsing dates

Date strings are difficult to parse because there are so many variations. You can't always trust VB's own date conversion functions as the date may come in an unexpected or unsupported format. Let's assume we have a date string in the following format: 2002-Nov-14.

^\d\d\d\d-[A-Z][a-z][a-z]-\d\d$

Explanation

^\d\d\d\dMatch four digits that make up the year.
-Match the separator dash.
[A-Z][a-z][a-z]Match a 3-letter month name. The first letter is in upper case.
-Match the separator dash.
\d\d$Match two digits that make up the day.

If a match is found, you can be sure that the input string is formatted like a date. However, a regex is not able to verify that it's a real date. For example, it could as well be 5400-Qui-32. This doesn't look like an acceptable date to most applications. If you want to prepare yourself for the stranger dates, you'll have to write a more limit ing expression:

^20\d\d-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-(0[1-9]|[1-2][0-9]|3[01])$

Explanation

^20\d\dMatch four digits that make up the year. The year must be between 2000 and 2099. No other dates please!
-Match the separator dash.
(Jan|Feb|Mar|Apr |May|Jun|Jul|Aug |Sep|Oct|Nov|Dec)Match the month abbreviation in English. Now you don't accept the date in any other language.
-Match the separator dash.
(0[1-9]|[1-2][0-9]|3[01])$Match two digits that make up the day. This accepts numbers from 01 to 09, 10 to 29 and 30 to 31. What if the user gives 2003-Feb-31? There are limitations to what regexes can do. If you want to validate the string futher, you need to use other techniques than regexes.

Web logs

Web server logs come in several formats. This is a typical line in a log file.

144.18.39.44 - - [01/Sep/2002:00:03:20 -0700] "GET /resources.html HTTP/1.1" 200 3458 "http://www.aivosto.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

As you can see, there are several fields on the line. They conform to a complex format. The fields are different from each other. A human-readable way to define the various fields is here:

host - - [date] "GET URL HTTP/1.1" status size "ref" "agent"

As you can see, there are fields such as host (visitor's Internet address), date and time (enclosed in square brackets), an HTTP request with file to retrieve (enclosed in quotation marks), numeric status code, numeric size of file, referer field (enclosed in quotation marks), and agent (browser) name (enclosed in quotation marks).

To programmatically parse the line, you need to split it into fields, then look at each field. This is a sample regex that will split the fields.

^(\S*) - - \[(.*) .....\] \"....? (\S*) .*\" (\d*) ([-0-9]*) (\"([^"]+)\")?

Explanation

^(\S*)Match any number of non-space characters at the start of the line.
- - Match the two dashes. They are actually empty fields that might have content in another log file.
\[(.*) .....\]Match the date inside square brackets. The date consists of a datetime string, a space, and a 5-character time zone indication. To actually use the date you'd need to write a more detailed regex to separate the year, month, day, hour, minute, and second.
\"....? (\S*) .*\"Match the HTTP request inside quotation marks. First there is a 3 to 4-character verb, such as GET, POST or HEAD. (\S*) matches the actual file that is being retrieved. At the end, .* matches HTTP/1.1 or whatever protocol was used to retrieve the file.
(\d*)Match a numeric status code.
([-0-9]*)Match a numeric file size, or - if no number is present.
(\"([^"]+)\")?Match the "ref" field. It's anything enclosed in quotation marks.
In this example, we've left "agent" unmatched. That does no harm because $ is not used to match the end-of-line. We can leave "agent" unmatched if we're not interested in the field.

This example has been taken from a web log file parser script. To use it in your own code, you have to fine-tune it to suit your log file format. The regex assumes that lines come only in the presented format. If, say, a field is missing or the file contains garbage lines, the regex may fail. This results in an unparsed line.

Regular expressions in Visual Basic

Earlier Visual Basic versions (from 1.0 to 6.0) didn't come with regular expressions. Neither did VBA. The .NET framework has regular expressions available.

For non-.NET programming, VB developers have to use a 3rd-party solution. Aivosto RegExpr is a solution that adds comprehensive support for regular expressions. Available as a pure source code module, it is an ideal way to add regexes to Visual Basic 5.0, 6.0 and VBA without any additional run-time file requirements.

©Aivosto Oy -