Algorithm flowcharted: EscapeHTML

This article explains with flow charts the algorithm to escape special characters for HTML. The algorithm converts a string ready to be saved in a .html file in plain ASCII. Our algorithm is in Visual Basic 6, but the principle is not language dependent.

Theory

In HTML, certain characters are reserved. The 4 reserved characters are and how you need to represent them are:

CharacterEscaped
&&
<&lt;
>&gt;
"&quot;

If HTML is saved in a regular plain ASCII file, any non-ASCII characters need to be escaped as well. They are represented by a numerical character reference &#entity_number; where entity number is the Unicode value of the character in decimal. Thus, the character , which has a Unicode value of 216, is represented as &#216; and so on.

Algorithm

This is the escape algorithm in Visual Basic 6. You call it like this: EscapeHTML string. As the procedure returns, the string is escaped, ready for saving into a .html file.

Public Sub EscapeHTML(ByRef Text As String)
' Escape Text to represent it in HTML in pure ASCII
' Usage: EscapeHTML string

Dim i As Long, Unicode As Long, Replacement As String

' Replace reserved characters with corresponding HTML entity
Text = Replace(Text, "&", "&amp;")
Text = Replace(Text, "<", "&lt;")
Text = Replace(Text, ">", "&gt;")
Text = Replace(Text, """", "&quot;")

' Replace all non-ASCII characters by &#entity_number;
i = 1

' Loop through all characters in input
Do While i <= Len(Text)
  ' Determine Unicode value of character
  Unicode = AscW(Mid$(Text, i, 1))
  Select Case Unicode
    Case 9, 10, 13, 32 To 126
      ' Printable ASCII or TAB, LF, CR
      ' Keep "as is", move to next character
      i = i + 1
    Case Else
      ' Non-ASCII or control character
      ' Replace by decimal &#entity_number;
      
      If Unicode < 0 Then
        ' Unicode value is negative.
        ' Get positive number by adding 65536.
        Unicode = Unicode + 65536
      End If
      
      ' Make up &#entity_number;
      Replacement = "&#" & Unicode & ";"
      
      ' Replace character by &#entity_number;
      Text = Left$(Text, i - 1) & Replacement & Mid$(Text, i + 1)
      
      ' Skip Replacement, move to next character
      i = i + Len(Replacement)
  End Select
Loop

' Ready.
' Return value is parameter Text.

End Sub

Flow chart

In flow chart form, the algorithm looks like this:

Full flow chart of Sub EscapeHTML

The flow charts on this page were created by Visustin, a flow chart tool that converts source code to flow charts.

Flow chart step by step

As we start, we first replace the reserved characters with their escaped version. This is really straightforward in Visual Basic.

Flow chart of the start of Sub EscapeHTML

Done with that, let's deal with the non-ASCII characters. We have a loop that runs through the input string character by character.

Flow chart of a loop

Inside the loop, we examine the Unicode value of the character. The VB function AscW() combined with Mid$() gives us the Unicode value of the character at position i.

If it's a regular ASCII character, we can just skip it. Regular ASCII characters are in Unicode range 32 to 126, which is the range for printable ASCII characters. Additional regular characters are 9, 10 or 13, which are the usual Tab, Linefeed and Carriage Return characters. None of these characters need any escaping. They will appear "as is". We just advance to the next character by letting i = i + 1.

All other characters will be turned into &#entity_number;. This is the "Case Else" branch we will be looking into next.

Flow chart of Select Case

Now, how do we convert those special characters into &#entity_number;?

First, we need to have the correct entity_number, which varies from 0 to 65535 in this little example. We have that number in the variable Unicode. Well, almost. The VB function AscW we used above may return a negative value -32768 to -1. We need to fix the value by adding 65536. This pushes any negative numbers up so that we end up with a correct entity_number that always falls the range 0 to 65535.

Now that we have the correct entity_number in the variable Unicode, all we need to do is to create the &#entity_number; and replacing the special character with this string. That's what the rest of the algorithm does.

Flow chart of the end of the algorithm

So here we are. When the loop has finished running, we have escaped all the required characters in the variable Text. As the algorithm ends, this variable is passed back to the caller with freshly escaped content.

Output

Text = "München-København < 1000 km"
EscapeHTML Text
' Text is now: M&#252;nchen-K&#248;benhavn &gt; 1000 km

The flow charts on this page were created by Visustin, a flow chart tool that converts source code to flow charts.