Optimize string handling in VB6 - Part II
Make your Visual Basic apps process text fast as lightning. Part II of this article dives deep into the performance of the VB6 String functions. We learn the functions to use and the ones to avoid. We also learn how to call the Windows Unicode API functions and how to build really, really huge strings without crashing our VB6 apps.
As told in Part I, you can use many tricks to make VB6 process strings faster. We are now going deeper into the details of fast and robust string programming.
In this article:
VB6 functions in this article: Asc, AscB, AscW, Chr$, ChrB$, ChrW$, CDbl, CInt, CStr, InStr, InStrRev, LCase$, Left$, Len, LenB, LTrim$, Mid$, Replace, Right$, RTrim$, Str$, StrPtr, Trim$, StrComp, StrConv, UCase$, Val, VarPtr.
Memory layout of VB6 strings
To get some background, let's see how strings are stored in RAM. VB6
stores strings in Unicode format. In COM terminology, a VB String is a
BSTR. A String requires six overhead bytes plus 2 bytes for
each character. Thus, you spend
The string starts with a 4-byte length prefix for the size of the string. It's not the character length, though. This 32-bit integer namely counts the number of bytes in the string (not counting the terminating 2 zero bytes). After the length prefix comes the actual text data, 2 bytes for each character. The last 2 bytes are zeros, denoting a NULL terminator (a Unicode null character).
Let's see how a sample string "Aivosto" is stored:
NULL character note. It is perfectly valid to store a NULL character in the string. As the string length is stored in the first 4 bytes, you can store NULLs in the string data. They will not be treated as the terminating NULL as in C/C++.
BSTR note. BSTR is the COM datatype for a string pointer. The pointer points to the first character of the datastring, not to the length prefix.
Performance of simple string functions
We are now going to measure how fast the various built-in string functions of VB6 really are. For this purpose we compiled a little .exe, which called each function 100 million times. The test was run on a typical Pentium 4 processor (2.8 GHz). We found out that Len and LenB are the fastest functions. So we compared the rest of the functions to Len/LenB and put the results in the chart and table below.
How to read the table: Calling Len takes 1 unit of time, while Asc takes 7 units. You can call Len 7 times in the same time you call Asc once.
Len and LenB. The fastest functions are Len and LenB. These are lightning fast functions that simply read the 2 length bytes at the start of the string area. Len is implemented in 6 assembly instructions in the VB runtime. LenB is even shorter: it runs just 5 instructions. In principle, LenB should run faster. In practice, this is not the case. Their performance is equal on today's processors.
AscW, AscB and Asc. This group of functions is very fast as well. Note how Asc takes 7 times the time of AscW and AscB. This is because AscW and AscB simply return the first byte(s) of the string. Asc needs to convert the value to an ANSI character code. You can squeeze more performance out of your program by replacing Asc with AscW. Since Asc and AscW return different values for many characters (other than ASCII 0-127), you need to know what you are doing before choosing the other function. Read Part III for more on this topic.
ChrW$, ChrB$ and Chr$. Performance degrades as we move to functions that create strings. This group of functions creates a string out of a numeric character code. Note how Chr$ takes about 40% more time to run than ChrW$. This is because Chr$ converts from ANSI to Unicode while ChrW$ works with pure Unicode values. You can squeeze more performance out of your program by replacing Chr$ with ChrW$. Since they return different characters for many values (other than 0-127), you need to know what you are doing before choosing the other function. Read Part III for more on this topic.
Left$, Right$ and Mid$. Performance keeps at the degraded level with this group of functions. These functions create new strings by copying some characters in the input string. These are the only functions that can access the individual characters in a string. As you can see, Mid$ is slower than Left$ or Right$. This means you should use Left$ and Right$ when possible and only resort to Mid$ when you really need to access characters in the middle.
Tip. To access the first character in a string, call AscW(S) instead of Left$(S,1). This will save you 95% of the time of running Left$. There is a caveat, though. S must not be empty. If S is empty, you will trigger a run-time error. Therefore, you need to make sure S is not empty by calling Len(S) or LenB(S) first. This is the proper way to do it:
CStr and Str$. These slow functions are used to convert other data types to a string. You typically use them to convert a numeric value into a string. CStr is much faster than Str$. (Tested for integer input value 32.)
You can save time by replacing calls to Str$ with CStr. This is not a straightforward task, though, because CStr and Str$ return different values. CStr returns a localized string, while Str$ returns a non-localized one. What is more, Str$ prefixes positive values with a space. As an example, CStr(1.2) returns "1,2" in several European locales. Str$(1.2) always returns " 1.2". Thus, you can trust that Str$ always works the same way, while CStr works differently in different locales. If you simply replace calls to Str$ with CStr, your program may fail later if it fails to interpret the resulting localized string. The following table compares Str and CStr in the Finnish locale. The results will look similar in several non-English locales.
Performance of string functions, group 2
The next group consists of functions whose performance vary based on what you feed them as input. Generally speaking, the longer the input string, the slower the call. Performance drops considerably with longer strings.
The following chart reports timings for two input strings S defined as follows:
So, for the call Replace(text), the performance was 513 using the short input string and 968 using the long input. The table below presents the numeric performance values for the short input.
LTrim$, RTrim$ and Trim$. This useful group of functions removes spaces from the start or the end of the input string, or both. The more spaces there are, the slower they run. These functions return a copy of the input string.
Tip 1. Never nest like LTrim$(RTrim$(S)). Simply call Trim$(S) instead.
Tip 2. LTrim$ is the fastest alternative to test whether string S contains any non-space characters. Use this syntax:
Val, CInt, CDbl. This is a group of relatively slow conversion functions. They look for a number inside a string. Val is a non-localized version that is best used together with Str$. CInt and CDbl are localized versions that are compatible with CStr. Val is a safe function to call, while CInt and CDbl raise error 13 when conversion fails. — It is best to avoid costly conversion where possible. Always pass numeric values in numeric data types, not as strings. By converting numbers to strings and strings to numbers you spend extra CPU cycles and also risk error 13 if your code is not designed the right way.
LCase$, UCase$, StrConv(vbLowerCase), StrConv(vbUpperCase). This group of functions performs case conversion. LCase$ is the faster alternative to StrConv(vbLowerCase). UCase$ is the faster alternative to StrConv(vbUpperCase).
Keep in mind that LCase$ and UCase$ have full Unicode support, whereas StrConv(vbLowerCase) and StrConv(vbUpperCase) don't support all Unicode characters. In fact, calls to StrConv(vbLowerCase) and StrConv(vbUpperCase) can remove diacritic marks from Latin characters and convert unsupported characters to "?". As an example, StrConv(vbLowerCase) converts all Greek and Cyrillic characters to garbage on a Western system.
Tip. Avoid StrConv(vbLowerCase) and StrConv(vbUpperCase). There is no reason to use these slow and limited calls. If you see them used, replace them with LCase$ or UCase$. If there is a chance of a Null value in the input, use LCase or UCase, because the dollar versions LCase$ and UCase$ fail if the input is Null.
StrComp. The StrComp function compares two strings. The
StrComp(vbTextCompare) can produce other surprises as well. In the following table you can see how StrComp sorts certain characters in the Finnish locale (as an example). This serves as a proof of why
InStr. InStr(,,vbBinaryCompare) is a quick function. Its counterpart InStr(,,vbTextCompare) is very slow, on the other hand. This is where it really pays off to use a binary search.
As with StrComp,
There is a separate article on the InStr function with detailed information about text comparison.
InStrRev. Also InStrRev has much better performance with vbBinaryCompare than with vbTextCompare.
InStrRev is a lot slower than a regular InStr. This difference is especially big with the vbBinaryCompare option. Searching for the middle character in our 21-character short input string, InStrRev spent 5 times the amount InStr spent.
InStrRev's performance also drops considerably when the input string is long, that is, when there is a lot to search in. It seems that InStrRev has a bad implementation.
Tip. Avoid InStrRev because of the bad performance. It may make sense to replace calls to InStrRev with InStr at times. Example: If the input is known to contain two TABs and you wish to find the latter one, it may be faster to call InStr twice rather than InStrRev once.
Replace is a slow function. As with the other text functions, vbBinaryCompare has better performance than vbTextCompare.
If a replace is unlikely to occur, you can add performance by not calling Replace unnecessarily. First test with InStr whether a replace is required, and then only call Replace if a replacement is about to happen. Thus, if you need to replace "£" with "€", first test with InStr if there really is a "£" in the input string. Only call Replace if there was.
Empty string vs. null string
As you probably know having read this far, there are two ways to represent a zero-length string:
As far as pure VB6 programming is concerned, they work quite exactly
the same way. The only differences are that
What exactly makes
What is the memory layout of
Now let's see what happens when you store a null string and an empty
string in variable S. The null string is stored as a null pointer. The
only thing to store in S is a zero. To store the empty string
In the table, variable S is stored as a pointer at memory address 1308564.
There is a difference in using "" vs. vbNullString. The difference is with API calls. Some API calls may accept only the other. Check out the API documentation before switching to vbNullString. VB itself doesn't make this difference.
Getting string pointers
VB6 supports two functions for getting pointers to strings.
API calls with Unicode strings
A lot of Windows API procedures come in two versions: ANSI and Unicode. The Ansi versions end in 'A' while the Unicode versions end in 'W'.
VB6 supports the ANSI 'A' versions. When calling a
Declare Sub MySub Lib "x" Alias "MySubA" (Text As String)
Unfortunately VB6 doesn't directly support the Unicode 'W' Declares. You can pass a Unicode string quite easily, though. This is how you declare the Unicode version:
Declare Sub MySub Lib "x" Alias "MySubW" (ByVal Text As Long)
This is how you pass the Unicode string:
You simply wrap the String parameters in StrPtr. That's how simple it really is! The API procedure gets a pointer to a Unicode string, not a copy of the string. By passing your strings as a pointer, you avoid the costly automated conversion to ANSI. What is more, your code is not restricted to the current ANSI character set, but can use the full range of Unicode characters. This is important with truly international applications.
Building huge strings
VB6 lacks a StringBuilder class for the creation of large strings. Consider building a big string in a loop. An example is when you load a text file line by line:
Do Until EOF(FileNr) Line Input #FileNr, Line$ Big$ = Big$ & Line$ & vbNewline ' Bad! Loop
You repeatedly copy stuff to the end of the string with the
The problem gets worse as the string exceeds 64K in size. Small strings are stored in a 64K string cache. As the string becomes larger than the cache, performance drops considerably.
Order of concatenation
Big$ = Big$ & "abc" & "def" ' 1 - Default order Big$ = Big$ & ("abc" & "def") ' 2 - Short strings first
On line 1 above, VB will first join Big$ and "abc". After this, it will perform another copy to add "def" to the end.
If Big$ is big, the same is better rewritten with parentheses around the shorter strings. The trick here is that you avoid copying the contents of Big$ twice. On line 2, VB will combine "abc" & "def" first before making a copy of Big$.
You can avoid the slow run-time concatenation by using constants:
Const TWO_NEWLINES = vbNewline & vbNewline
String constants are stored in the executable in their entirety. There will be no concatenation when your program executes.
Sometimes you cannot use a constant. You can still initialize a variable when your program starts and use the value where required:
Public EscapeSequence As String EscapeSequence = ChrW$(27) & vbCr ' ESC CR
Building large strings with CString
A good approach to building large strings is provided in a class called CString, originally published in Francesco Balena's article in Visual Basic Programmer's Journal. Designed as a "string builder" class, CString is a nice replacement for big VB strings. It is easy to use and its performance is great.
CString stores the string in a byte array. As you add text to the string, CString allocates more space from time to time. Because it allocates far less often than VB, CString performs much better than a regular String.
The bad news is, CString works in ANSI. It stores the string in a byte array, one byte per character. While this saves memory, it doesn't work well with all international characters. Being ANSI means CString will silently convert your characters to something else. This isn't what you want if your strings are going to contain any characters outside of the current codepage. This means, CString doesn't handle international text.
Fortunately, it's possible to make CString work in Unicode. You need to change the byte array storage to an integer array, meaning two bytes per character. While this requires twice the memory, it ensures CString will work with all possible characters and not cause any nasty effects.
Building a huge string in a file
As your strings grow really, really huge, CString will not be enough. Memory allocation will eventually fail, even if there is free RAM available. The limit varies but there is a point where you get Error #7: Out of memory and your application is likely to crash. Maybe it's 50 MB or 300 MB, but Out of memory can hit your application as well. This can happen if your program produces large reports, logs, web pages, XML or other textual data files by building the data in a String before writing it to a text file. What can you do to keep your program running?
Save directly to a text file. Stop keeping the string in RAM. Write your string data to a file instead. If you are creating a report or an HTML page, for example, it's better to write the text directly to a file rather than building the text in a String and writing to disk afterwards.
Temporary text file. Writing text directly to a file isn't always an option, especially when existing code should be rewritten and you want to avoid a rewrite like the plague. This is where you can build a helper class. Let's call the class FileString. The class will use CString (or any other similar solution) as the default storage, the regular RAM string. As you append text to FileString, the class will store the text in the RAM string. If the string grows larger than a certain limit (say 1 MB), the FileString class creates a temporary file and dumps the RAM string in it. It then clears the RAM string to accept more stuff. You go on appending stuff to the RAM string. Every now and then FileString will store the RAM string into the temporary file.
Once you're ready, you can have FileString save the resulting huge string to a file. This is as simple as copying the temporary file into the final file. If there is any text left in the RAM string, that text is simply appended to the end of the final file.
This way you can build virtually unlimited strings, even exceeding the amount of available RAM.
The FileString solution is especially nice when you can't tell if the string will be small or large. When the string is small, it will be built in RAM. When the string is large, it will go onto the disk. There is no limit to the string data. You can be certain your code will run as long as there's free space on the disk.
Optimize string handling in VB6 - Part II