Optimize string handling in VB6 - Part II
Make your Visual Basic apps process text fast as lightning.
Part II of this article dives deep into the performance of the VB6 String
functions. We learn the functions to use and the ones to avoid.
We also learn how to call the Windows Unicode API functions and
how to build really, really huge strings without crashing our VB6 apps.
Part I | Part II | Part III
As told in Part I, you can use many
tricks to make VB6 process strings faster. We are now going deeper into
the details of fast and robust string programming.
VB6 functions in this article:
Asc, AscB, AscW,
Chr$, ChrB$, ChrW$, CDbl, CInt, CStr, InStr, InStrRev, LCase$, Left$,
Len, LenB, LTrim$, Mid$, Replace, Right$, RTrim$, Str$, StrPtr, Trim$,
StrComp, StrConv, UCase$, Val, VarPtr.
Memory layout of VB6 strings
To get some background, let's see how strings are stored in RAM. VB6
stores strings in Unicode format. In COM terminology, a VB String is a
BSTR. A String requires six overhead bytes plus 2 bytes for
each character. Thus, you spend
6 + Len(string)*2 bytes for each string.
The string starts with a 4-byte length prefix for the size of the
string. It's not the character length, though. This 32-bit
integer namely counts the number of bytes in the string (not
counting the terminating 2 zero bytes). After the length prefix comes
the actual text data, 2 bytes for each character. The last 2 bytes are
zeros, denoting a NULL terminator (a Unicode null character).
Memory layout of a VB6 String
Let's see how a sample string "Aivosto" is stored:
Memory layout of "Aivosto"
NULL character note. It is perfectly valid to store a NULL character
in the string. As the string length is stored in the first 4 bytes, you
can store NULLs in the string data. They will not be treated as the terminating NULL as in C/C++.
BSTR note. BSTR is the COM datatype for a string
pointer. The pointer points to the first character of the datastring,
not to the length prefix.
Performance of simple string functions
We are now going to measure how fast the various built-in string functions of VB6 really are. For this purpose we compiled a little .exe, which called each function 100 million times. The test was run on a typical Pentium 4 processor (2.8 GHz). We found out that Len and LenB are the fastest functions. So we compared the rest of the functions to Len/LenB and put the results in the chart and table below.
Relative performance of VB6 string functions
Len(S) returns the number of characters in string S.
(Note: works differently with UDTs)
LenB(S) returns the number of bytes in S.
AscW(S) returns the Unicode value of the first character in S.
AscB(S) returns the first byte in S.
Asc(S) returns the ANSI value of the first character in S.
ChrW$(U) returns a string containing the Unicode character U.
ChrB$(B) returns a byte string containing the character B.
Chr$(A) returns a string containing the ANSI character A.
Left$(S,x) returns x characters from the start of S.
Right$(S,x) returns x characters from the end of S.
Mid$(S,x,y) returns y characters from S, starting at the xth character.
CStr(x) returns the string representation of x (localized).
Str$(x) returns the string representation of x (not localized).
How to read the table: Calling
Len takes 1 unit of time, while
Asc takes 7 units. You can call
Len 7 times in the same time
Asc executes just once.
LenB. The fastest functions are
LenB. These are lightning fast functions that simply read the 2 length bytes at the start of the string area. Len is implemented in 6 assembly instructions in the VB runtime.
LenB is even shorter: it runs just 5 instructions. In principle,
LenB should run faster. In practice, this is not the case. Their performance is equal on today's processors.
Asc. This group of functions is very fast as well. Note how
Asc takes 7 times the time of
AscB. This is because
AscB simply return the first byte(s) of the string.
Asc needs to convert the value to an ANSI character code. You can squeeze more performance out of your program by replacing
AscW return different values for many characters (other than ASCII 0–127), you need to know what you are doing before choosing the other function. Read Part III for more on this topic.
Chr$. Performance degrades as we move to functions that create strings. This group of functions creates a string out of a numeric character code. Note how
Chr$ takes about 40% more time to run than
ChrW$. This is because
Chr$ converts from ANSI to Unicode while
ChrW$ works with pure Unicode values. You can squeeze more performance out of your program by replacing
ChrW$. Since they return different characters for many values (other than 0–127), you need to know what you are doing before choosing the other function. Read Part III for more on this topic.
Mid$. Performance keeps at the degraded level with this group of functions. These functions create new strings by copying some characters in the input string. These are the only functions that can access the individual characters in a string. As you can see,
Mid$ is slower than
Right$. This means you should use
Right$ when possible and only resort to
Mid$ when you really need to access characters in the middle.
Tip. To access the first character in a string, call
AscW(S) instead of
Left$(S,1). This will save you 95% of the time of running
Left$. There is a caveat, though. S must not be empty. If S is empty, you will trigger a run-time error. Therefore, you need to make sure S is not empty by testing it with
LenB(S) first. This is the proper way:
If LenB(S) Then x = AscW(S)
Str$. These slow functions are used to convert other data types to a string. You typically use them to convert a numeric value into a string.
CStr is much faster than
Str$. (Tested for integer input value 32.)
You can save time by replacing calls to
CStr. This is not a straightforward task, though, because
Str$ return different values.
CStr returns a localized string while
Str$ returns a non-localized one. What is more,
Str$ prefixes positive values with a space. As an example,
"1,2" in several European locales.
Str$(1.2) always returns
" 1.2". Thus, you can trust that
Str$ always works the same way, while
CStr works differently in different locales. If you simply replace calls to
CStr, your program may fail later if it fails to interpret the resulting localized string. The following table compares
CStr in the Finnish locale. The results will look similar in several other non-English locales.
CStr in the Finnish locale
Performance of string functions, group 2
The next group consists of functions whose performance vary based on
what you feed them as input. Generally speaking, the longer the input
string, the slower the call. Performance drops considerably with longer strings.
The following chart reports timings for two input strings S defined as follows:
- Short input
- S consists of 21 characters: 10 spaces, "1" and 10 spaces.
S = " 1 "
- Long input
- S consists of 201 characters: 100 spaces, "1" and 100 spaces.
So, for the call
Replace(text), the performance was 513 using the short input
string and 968 using the long input. The table below presents the
numeric performance values for the short input.
Relative performance of VB6 string functions
LTrim$(S) returns a copy of S without leading spaces.
RTrim$(S) returns a copy of S without trailing spaces.
Trim$(S) returns a copy of S without leading or trailing spaces.
Val(S) returns the numeric value contained in S (non-localized).
CInt(S) returns the integer value contained in S (localized, rounded).
CDbl(S) returns the double floating point value contained in S (localized).
LCase$(S) returns a copy of S with lower case letters.
UCase$(S) returns a copy of S with upper case letters.
StrConv(S, vbLowerCase) returns a copy of S with lower case letters.
StrConv(S, vbUpperCase) returns a copy of S with upper case letters.
StrComp(S1, S2, vbBinaryCompare) compares string S1 and S2 based on their Unicode values.
StrComp(S1, S2, vbTextCompare) compares string S1 and S2 in a locale-dependent, case-insensitive way.
InStr(x, S1, S2, vbBinaryCompare) looks for S2 in S1, starting at position x, in a locale-independent, case-sensitive way.
InStr(x, S1, S2, vbTextCompare) looks for S2 in S1, starting at position x, using text comparison.
InStrRev(S1, S2, x, vbBinaryCompare) looks for S2 in S1, from position x back to 1, in a locale-independent, case-sensitive way.
InStrRev(S1, S2, x, vbTextCompare) looks for S2 in S1, from position x back to 1, using text comparison.
Replace(S1, S2, S3, x, y, vbBinaryCompare) replaces S2 with S3 in S1, starting at position x, max y times, in a locale-independent, case-sensitive way.
Replace(S1, S2, S3, x, y, vbBinaryCompare) replaces S2 with S3 in S1, starting at position x, max y times, using text comparison.
Trim$. This useful group of functions removes spaces from the start or the end of the input string, or both. The more spaces there are, the slower they run. These functions return a copy of the input string.
Tip 1. Never nest like
LTrim$(RTrim$(S)). Simply call
Tip 2. LTrim$ is the fastest alternative to test whether string S contains any non-space characters. Use this syntax:
CDbl. This is a group of relatively slow conversion functions. They look for a number inside a string.
Val is a non-localized version that is best used together with
CDbl are localized versions that are compatible with
Val is a safe function to call, while
CDbl raise error 13 when conversion fails.ʾ
It is best to avoid costly conversion where possible. Always pass numeric values in numeric data types, not as strings. By converting numbers to strings and strings to numbers you spend extra CPU cycles and also risk error 13 if your code is not designed the right way.
StrConv(vbUpperCase). This group of functions performs case
LCase$ is the faster alternative to
UCase$ is the faster alternative to
Keep in mind that
UCase$ have full Unicode support,
StrConv(vbUpperCase) don't support all
Unicode characters. In fact, calls to
StrConv(vbUpperCase) can remove diacritic marks from Latin characters
and convert unsupported characters to "?". As an example,
StrConv(vbLowerCase) converts all Greek and Cyrillic characters to
garbage on a Western system.
StrConv(vbUpperCase). There is no
reason to use these slow and limited calls. If you see them used,
replace them with
UCase$. If there is a chance of a Null value
in the input, use
UCase. The dollar versions
UCase$ will fail if the input is Null.
StrComp function compares two strings. The
vbBinaryCompare option is faster than
vbTextCompare should only be used for sorting.
vbTextCompare is often used to test for a case-insensitive match (test if "ABC" is equal to "abc"), but it is not suitable for that. This is because the result depends on the current locale. Certain character combinations can produce false matches. For example,
StrComp("ss", "ß", vbTextCompare) returns 0. This means "ss" and "ß" are a match, which may not be what you intended to do. For a real case insensitive test (without extra effects), use something like
LCase$(S1) = LCase$(S2).
StrComp(vbTextCompare) can produce other surprises as well. In the following table you can see how
StrComp sorts certain characters in the Finnish locale (as an example). This serves as a proof of why
vbTextCompare is not simply the case insensitive counterpart of
vbBinaryCompare. It affects much more than just the case.
|All locales||Finnish locale
|a < ae < z < ä < æ
||a < ae = æ < z < ä
|va < vb < wa
||va < wa < vb
|AE < ae < Æ < æ
||AE = ae = Æ = æ
|OE < oe < Œ < œ
||OE = oe = Œ = œ
|ss < ß
||ss = ß
|TH < th < Þ < þ
||TH = th = Þ = þ
|d < e < Ð < ð
||d < Ð = ð < e
Option Compare Text is the equivalent of
Compare Binary is the equivalent of
It is also the default setting.
InStr(,,vbBinaryCompare) is a quick function. Its counterpart
InStr(,,vbTextCompare) is very slow, on the other hand. This is where it really pays off to use a binary search.
InStr is full of
surprises. To name an example,
InStr(1,"Straße","ss",vbTextCompare) locates "ß" at
position 5. You searched for 2 characters, but InStr found just one!
What is this? The character "ß" stands for "ss" in the German language.
But, if your program was expecting the 2 characters "ss" or "SS", it can
fail now. — Similarly,
"Þ". This happens because the character "Þ" stands for "th"
in Icelandic. Again, you searched for 2 characters, but InStr found just
one. These are not the only examples. You get the same behavior for
æ and œ. Don't use
vbTextCompare unless this is
what you really want.
There is a separate article on the InStr
function with detailed information about text comparison.
Tip. Quick uses for
|Locate a substring
|Locates x in S
|See if string contains a substring
InStr(S, x) <> 0
|Tells us if S contains at least one x
|See if character is one of alternatives
InStr("ABC", S) <> 0
|Tells us if S is one of "A", "B" or "C"
Len(S) must be 1)
|See if string is one of alternatives
InStr("-AB-CD-EF-", "-" & S & "-") <> 0
|Tells us if S is one of "AB", "CD" or "EF"
(S must not contain "-")
much better performance with
vbBinaryCompare than with
InStrRev is a lot slower than a regular
InStr. This difference is
especially big with the
vbBinaryCompare option. Searching for the middle
character in our 21-character short input string,
InStrRev spent 5 times
InStrRev's performance also drops
considerably when the input string is long, that is, when there is a lot
to search in. It seems that
InStrRev has a bad implementation.
InStrRev because of the bad performance. It may make sense to
replace calls to
InStr at times. Example: If the input is
known to contain two TABs and you wish to find the latter one, it may be
faster to call
InStr twice rather than
Replace is a slow function. As with the other text
vbBinaryCompare has better performance than
If a replace is unlikely to occur, you can add performance by not
calling Replace unnecessarily. First test with
InStr whether a replace
is required, and then only call
Replace if a replacement is about to
happen. Thus, if you need to replace "£" with "€", first test with
InStr if there really is a "£" in the input string. Only call
Replace if there was.
Empty string vs. null string
As you probably know having read this far, there are two ways to represent a zero-length string:
- the null string:
- the empty string:
As far as pure VB6 programming is concerned, they work quite exactly
the same way. The only differences are that
faster and it saves 6 bytes of memory.
What exactly makes
vbNullString different from
""? It's the way they are stored.
The empty string (
"") consumes 6 bytes of memory. You can see the memory layout in the table below. All of the bytes are overhead bytes. There are no data bytes as there are no characters to store. The Datastring field, which normally appears between the Length prefix and Terminator fields, is missing. Its length is zero.
Memory layout of
"", the empty string
|Number of bytes
What is the memory layout of
vbNullString? The answer:
nothing! There is nothing to store, because
simply a NULL pointer. Besides a zero pointer, there is nothing else to
store. For the empty string, there is a non-zero pointer and a real
6-byte string allocated in RAM.
Now let's see what happens when you store a null string and an empty
string in variable S. The null string is stored as a null pointer. The
only thing to store in S is a zero. To store the empty string
"", VB needs to do more. VB will allocate an empty string
in the memory and set S to point to it. Thus, we consume both the
pointer and the data area. You can see the difference in the table
Null string vs. empty string
S = vbNullString
S = ""
S = "a"
In the table, variable S is stored as a pointer at memory address 1308564.
- When you let
S = vbNullString, the pointer value is set
to zero. Nothing else is stored.
- When you let
S = "", the value 1568888 is stored at address
1308564. Now S points to the data area at 1568888. As it
happens, this area contains 6 overhead bytes for the empty string
- Compare the empty string to a regular non-empty string. When you let
S = "a", it is stored exactly the same way as
Again, S points to the data area at 1568888. In this area you can find the actual string "a", which takes 8 bytes.
There is a difference in using "" vs.
vbNullString. The difference is with API calls. Some API calls may accept only the other. Check out the API documentation before switching to
vbNullString. VB itself doesn't make this difference.
Getting string pointers
VB6 supports two functions for getting pointers to strings.
VarPtr(S) returns a pointer to the variable
S. That memory location contains a pointer (BSTR) to the
actual string data. This means
VarPtr essentially returns a pointer to a
StrPtr(S) returns a pointer to the actual
string data currently stored in S. This is what you need when
passing the string to Unicode API calls. The pointer you get points to
the Datastring field, not the Length prefix field. In COM terminology,
StrPtr returns the value of the BSTR pointer.
API calls with Unicode strings
A lot of Windows API procedures come in two versions: ANSI and Unicode. The Ansi versions end in 'A' while the Unicode versions end in 'W'. The W stands for wide (wide characters).
VB6 supports the ANSI 'A' versions. When calling a
Declare Sub or
Declare Function, VB automatically converts
String parameters from Unicode to ANSI. This means you safely use
As String parameters in your
Declare statements as long as you work with ANSI functions.
Declare Sub MySub Lib "x" Alias "MySubA" (Text As String)
Unfortunately VB6 doesn't directly support the Unicode 'W' functions. You can pass a Unicode string quite easily, though. This is how you declare the Unicode version:
Declare Sub MySub Lib "x" Alias "MySubW" (ByVal Text As Long)
This is how you pass the Unicode string:
You simply wrap the String parameters in
StrPtr. That's how simple it really is! The API procedure gets a pointer to a Unicode string, not a copy of the string. By passing your strings as a pointer, you avoid the costly automated conversion to ANSI. What is more, your code is not restricted to the current ANSI character set, but can use the full range of Unicode characters. This is important with truly international applications.
Building huge strings
VB6 lacks a
StringBuilder class for the creation of large strings.
Consider building a big string in a loop. An example is when you load a text
file line by line:
Do Until EOF(FileNr)
Line Input #FileNr, Line$
Big$ = Big$ & Line$ & vbNewline
You repeatedly copy stuff to the end of the string with the
&. The bad news is, VB is not
optimized for this. As you grow the string, VB repeatedly copies Big$
over and over again. This really degrades performance when repeated. VB
constantly allocates new space and performs a copy.
The problem gets worse as the string exceeds 64K in size. Small
strings are stored in a 64K string cache. As the string becomes larger
than the cache, performance drops considerably.
Order of concatenation
Big$ = Big$ & "abc" & "def"
Big$ = Big$ & ("abc" & "def")
On line 1 above, VB will first join
"abc". After this, it will perform
another copy to add
"def" to the end.
If the variable Big$ is big, the same is better rewritten with parentheses around
the shorter strings. The trick here is that you avoid copying the
contents of Big$ twice. On line 2, VB will combine
& "def" first before making a single slow copy of Big$.
You can avoid the slow run-time concatenation by using constants:
Const TWO_NEWLINES = vbNewline & vbNewline
String constants are stored in the executable in their entirety.
There will be no concatenation when your program executes.
Sometimes you cannot use a constant. You can still initialize a
variable when your program starts and use the value where required:
Public EscapeSequence As String
EscapeSequence = ChrW$(27) & vbCr
Building large strings with CString
A good approach to building large strings is provided in a class called CString, originally published in Francesco Balena's article in Visual Basic Programmer's Journal. Designed as a "string builder" class, CString is a nice replacement for big VB strings. It is easy to use and its performance is great.
- CString original article, VBPJ, January 1999.
- The CString source code has been published online, but you may need to search for it.
CString stores the string in a byte array. As you add text to the string, CString allocates more space from time to time. Because it allocates far less often than VB, CString performs much better than a regular string.
The bad news is, CString works in ANSI. It stores the string in a byte array, one byte per character. While this saves memory, it doesn't work well with all international characters. Being ANSI means CString will silently convert your characters to something else. This isn't what you want if your strings are going to contain any characters outside of the current codepage. This means, CString doesn't handle international text.
Fortunately, it's possible to make CString work in Unicode. You need to change the byte array storage to an integer array, meaning two bytes per character. While this doubles the memory requirement, it ensures CString will work with all possible characters and not cause any nasty effects.
Building a huge string in a file
As your strings grow really, really huge, CString will not be enough.
Memory allocation will eventually fail, even if there is free RAM
available. The limit varies but there is a point where you get Error
#7: Out of memory and your application is likely to crash. Maybe it's
50 MB or 300 MB, but Out of memory can hit your application as well.
This can happen if your program produces large reports, logs, web pages,
XML or other textual data files by building the data in a regular string variable before
writing it to a text file. What can you do to keep your program
Save directly to a text file. Stop keeping the string in
RAM. Write your string data to a file instead. If you are creating a report
or an HTML page, for example, it's better to write the text directly to
a file rather than building the text in a string variable and writing to disk
Temporary text file. Writing text directly to a file
isn't always an option, especially when existing code should be
rewritten and you want to avoid a rewrite like the plague. This is where
you can build a helper class. Let's call the class FileString. The class
will use CString (or any other similar solution) as the default storage,
the regular RAM string. As you append text to FileString, the
class will store the text in the RAM string. If the string grows larger
than a certain limit (say 1 MB), the FileString class creates a
temporary file and dumps the RAM string in it. It then clears
the RAM string to accept more stuff. You go on appending stuff to the
RAM string. Every now and then FileString will store the RAM string into
the temporary file.
Once you're ready, you can have FileString save the resulting huge
string to a file. This is as simple as copying the temporary file into
the final file. If there is any text left in the RAM string, that text is
simply appended to the end of the final file.
This way you can build virtually unlimited strings, even exceeding
the amount of available RAM.
The FileString solution is especially nice when you can't tell if
the string will be small or large. When the string is small, it will be
built in RAM. When the string is large, it will go onto the disk. There
is no limit to the string data. You can be certain your code will run
as long as there's free space on the disk.
Part I | Part II | Part III
Optimize string handling in VB6 - Part II