Regular expressions - An introduction

Regular expressions are a pattern matching standard for string parsing and replacement. They are used on a wide range of platforms and programming environments. Originally missing in Visual Basic, regular expressions are now available for most VB and VBA versions.

Regular expressions, or regexes for short, are a way to match text with patterns. They are a powerful way to find and replace strings that take a defined format. For example, regular expressions can be used to parse dates, urls and email addresses, log files, configuration files, command line switches or programming scripts.

Since regexes are language independent, we're trying to keep this article as language independent as possible. However, it's to be noted that not all regex implementations are the same. The below text is based on Perl 5.0. This is also the format that RegExpr for VB/VBA uses. Some implementations may not handle all expressions the same way.

Regex syntax

In its simplest form, a regular expression is a string of symbols to match "as is".

RegexMatches
abcabcabcabc
23412345

That's not very impressive yet. But you can see that regexes match the first case found, once, anywhere in the input string.

Quantifiers

So what if you want to match several characters? You need to use a quantifier. The most important quantifiers are *?+. They may look familiar to you from, say, the dir command of Windows command line, but they're not exactly the same.

*Match zero or more times, as many as possible.
+Match one or more times, as many as possible.
?Match one if possible, none if not possible.

Quantifiers take the preceding character as argument and attempt to match it zero, one or more times. So, x* will match zero or more x's, x+ will match one or more x's and x? will match exactly one x or none at all. Here are some examples:

RegexMatches
23*41245, 12345, 123345, 1233345
23+412345, 123345, 1233345
23?41245, 12345

By default, regexes are greedy. They take as many characters as possible. In the next example, you can see that the regex matches as many 2's as there are.

RegexMatches
2*122223

There is also stingy matching available that matches as few characters as possible, but let's leave it this time. There are also more quantifiers than those mentioned, but we're not going any deeper into that in this introduction.

Special characters

A lot of special characters are available for regex building. Here are some of the more usual ones.

.The dot matches any single character.
\nMatches a newline character (or CR+LF combination).
\tMatches a tab (ASCII 9).
\dMatches a digit [0-9].
\DMatches a non-digit.
\wMatches an alphanumberic character.
\WMatches a non-alphanumberic character.
\sMatches a whitespace character.
\SMatches a non-whitespace character.
\Use \ to escape special characters. For example, \. matches a dot and \\ matches a backslash.
^Match at the beginning of the input string.
$Match at the end of the input string.

Here are some likely uses for the special characters.

RegexMatches
1.3123, 1z3, 133
1.*313, 123, 1zdfkj3
\d\d01, 02, 99, ..
\w+@\w+a@a, email@company.com

^ and $ are important to regexes. Without them, regexes match anywhere in the input. With ^ and $ you can make sure to match only a full string, the beginning of the input, or the end of the input.

RegexMatchesDoesn't match
^1.*3$13, 123, 1zdfkj3x13, 123x, x1zdfkj3x
^\d\d01abca01abc
\d\d$xyz01xyz01x

Character classes

You can group characters by putting them between square brackets. This way, any character in the class will match one character in the input.

[abc]Match any of a, b, and c.
[a-z]Match any character between a and z. (ASCII order)
[^abc]A caret ^ at the beginning indicates "not". In this case, match anything other than a, b, or c.
[+*?.]Most special characters have no meaning inside the square brackets. This expression matches any of +, *, ? or the dot.

Here are some sample uses.

RegexMatchesDoesn't match
[^ab]c, d, zab
^[1-9][0-9]*$Any positive integerZero, negative or decimal numbers
[0-9]*[,.]?[0-9]+.1, 1, 1.2, 100,00012.

Grouping and alternatives

It's often necessary to group things together with parentheses ( and ).

RegexMatchesDoesn't match
(ab)+ab, abab, abababaa bb
(aa|bb)+aa, bbaa, aabbaaaaabab

Notice the | operator. This is the Or operator that takes any of the alternatives.

With parentheses, you can also define subexpressions to remember after the match has happened. In the below example, the part of the string that matches between the parentheses (…) gets stored.

RegexMatchesStores
a(\d+)aa12a12
(\d+)\.(\d+)1.21 and 2

In these examples, what was matched by (\d+) got stored. The regex engine allows you to retrieve the stored value by a successive call. The implementation of the call varies. In RegExpr for VB/VBA, you call RegExprResult(1) to get the first stored value, RegExprResult(2) to get the second one, and so on. In languages such as Perl, you refer to the stored values with special variables such as $1 and $2 instead. This way you can retrieve fields for further processing.

Case sensitivity

So are regexes case sensitive? Yes, by default they are. This means a and A are different characters that won't match each other.

You can also run a case insensitive match. In it, a and A are treated as if they were the same. The way you run a case insensitive match depends on the programming language. Refer to the documentation of your programming language or regex implementation on how to write the calls.

Advanced syntax

The above is in no way a complete description of regexes. There are more ways to write them, more special characters, and more quantifiers available. What's available depends also on the implementation. Some regex engines don't implement all of the possibilities, rendering them not so usable for every purpose. In case you're interested in learning a more complete set of regexes, see the help file of RegExpr for VB/VBA. It's available for free download.

Regex examples

Here are a few practical examples of regular expressions. They are provided for learning purposes. In real applications, you should carefully design your regexes to match the exact use.

Email matching

It's often necessary to check if a string is an email address or not. Here's one way to do it.

^[A-Za-z0-9_\.-]+@[A-Za-z0-9_\.-]+[A-Za-z0-9_][A-Za-z0-9_]$

Explanation:

^[A-Za-z0-9_\.-]+Match a positive number of acceptable characters at the start of the string.
@Match the @ sign.
[A-Za-z0-9_\.-]+Match any domain name, including a dot.
[A-Za-z0-9_][A-Za-z0-9_]$Match two acceptable characters but not a dot. This ensures that the email address ends with .xx, .xxx, .xxxx etc.

This example works for most cases but is not written based on any standard. It may accept non-working email addresses and reject working ones. Fine-tuning is required.

Parsing dates

Date strings are difficult to parse because there are so many variations. You can't always trust VB's own date conversion functions as the date may come in an unexpected or unsupported format. Let's assume we have a date string in the following format: 2002-Nov-14.

^\d\d\d\d-[A-Z][a-z][a-z]-\d\d$

Explanation:

^\d\d\d\dMatch four digits that make up the year (2000).
-Match the first separator dash.
[A-Z][a-z][a-z]Match a 3-letter month name (Nov). The first letter must be in upper case.
-Match the second separator dash.
\d\d$Match the two digits that make up the day (14).

If a match is found, you can be sure that the input string is formatted like a date. However, a regex is not able to verify that it's a real date. For example, it could as well be 5400-Qui-32. This doesn't look like an acceptable date to most applications. If you want to prepare yourself for the stranger dates, you'll have to write a more limiting expression:

^20\d\d-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-(0[1-9]|[1-2][0-9]|3[01])$

Explanation:

^20\d\dMatch four digits that make up the year. The year must be between 2000 and 2099. No other dates please!
-Match the first separator dash.
(Jan|Feb|Mar|Apr|May|Jun
|Jul|Aug|Sep|Oct|Nov|Dec)
Match the month abbreviation in English. Other languages are not accepted here.
-Match the second separator dash.
(0[1-9]|[1-2][0-9]|3[01])$Match the two digits that make up the day. This accepts numbers from 01 to 09, 10 to 29 and 30 to 31.

What if the user gives 2003-Feb-31? There are limitations to what regexes can do. If you want to validate the string further, you need to use other techniques than regexes.

Web logs

Web server logs come in several formats. This is a typical line in a log file.

144.18.39.44 - - [01/Sep/2002:00:03:20 -0700] "GET /resources.html HTTP/1.1" 200 3458 "http://www.aivosto.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

As you can see, there are several fields on the line. They conform to a complex format. The fields are different from each other. A human-readable way to define the various fields is here:

host - - [date] "GET URL HTTP/1.1" status size "ref" "agent"

As you can see, there are fields such as host (visitor's Internet address), date and time (enclosed in square brackets), an HTTP request with file to retrieve (enclosed in quotation marks), numeric status code, numeric size of file, referer field (enclosed in quotation marks), and agent or browser name (enclosed in quotation marks).

To programmatically parse the line, you need to split it into fields, then look at each field. This is a sample regex that will split the fields.

^(\S*) - - \[(.*) .....\] \"....? (\S*) .*\" (\d*) ([-0-9]*) (\"([^"]+)\")?

Explanation:

^(\S*)Match any number of non-space characters at the start of the line.
 - - Match the two dashes. They are actually empty fields that might have content in another log file.
\[(.*) .....\]Match the date inside square brackets. The date consists of a datetime string, a space, and a 5-character time zone indication. To actually use the date you'd need to write a more detailed regex to separate the year, month, day, hour, minute, and second.
\"....? (\S*) .*\"Match the HTTP request inside quotation marks. First there is a 3 to 4-character verb, such as GET, POST or HEAD. (\S*) matches the actual file that is being retrieved. At the end, .* matches HTTP/1.1 or whatever protocol was used to retrieve the file.
(\d*)Match a numeric status code.
([-0-9]*)Match a numeric file size or a dash (-) if no number is present.
(\"([^"]+)\")?Match the "ref" field. It's anything enclosed in quotation marks.

In this example, we've left "agent" unmatched. That does no harm because $ is not used to match the end-of-line. We can leave "agent" unmatched if we're not interested in the field.

This example has been taken from a web log file parser script. To use it in your own code, you have to fine-tune it to suit your log file format. The regex assumes that lines come only in the presented format. If, say, a field is missing or the file contains garbage lines, the regex may fail. This results in an unparsed line.

Regular expressions in Visual Basic

Earlier Visual Basic versions (from 1.0 to 6.0) didn't come with regular expressions. Neither did VBA. The .NET framework has regular expressions available.

For non-.NET programming, VB developers have to use a 3rd-party solution. Aivosto RegExpr is a solution that adds comprehensive support for regular expressions. Available as a pure source code module, it is an ideal way to add regexes to Visual Basic 5.0, 6.0 and VBA without any additional run-time file requirements.