Programming and Data Types    

Searching with Tokens

Within a regular expression, parentheses are used to group characters or expressions. These grouped expressions are called tokens. Tokens can be used in matching text. For example, the regular expression and(y|rew) matches the text andy or andrew. Tokens also remember what they matched so that you can recall and reuse the found text with a special variable for searching or replacing.

Here is an example of how tokens are assigned values. Suppose that you are going to search the following text:

You choose to search the above text with the following search pattern:

In this pattern there are three parenthetical expressions that generate tokens. When you finally perform the search, the following tokens are generated for each match.

Match
Token 1
Token 2
andy
y

ted
t
d
andrew
rew

andy
y

ted
t
d

Only the highest level parentheses are used. For example, if the search pattern and(y|rew) finds the text andrew, token 1 is assigned the value rew. However, if the search pattern (and(y|rew)) is used, token 1 is assigned the value andrew.

The variables that allow you to use tokens in your search pattern have the form \1, \2,..., \n (n<17) and are assigned left to right from parenthetical expressions in the search string.

As an example, suppose that you are searching an HTML file to find many table entries. Generally, HTML lines with table entries have the following form:

You can use search pattern tokens to search and find all table entries with the following search pattern:

This search pattern finds the following:

  1. The expression <(\w*)> finds any number of word characters enclosed by angle brackets. Because of the parentheses around \w*, the word characters matched are placed in token 1. In our example, <(\w*)> finds the string <TD> and places the string TD in token 1.
  2. The expression (.*?) contains the expression .*?, which finds a minimum number of any characters. Using the expression (.*) instead would find the maximum number of any characters and could use up the entire file before finding the angle bracket characters sought for by the next expression.
  1. The parentheses in the expression (.*?) places the matched characters in token 2. Although this is not necessary for the search, placing these characters in a token allows you to reference them in replacement text.

  1. The expression <\\\1> uses the variable \1 for token 1 preceded by two backslashes. Since backslashes are regular expression logical operators, two backslashes specify a single backslash as a search character. In the HTML example, since the first expression, <(\w*)>, finds the string <TD>, and places the string TD in token 1, the expression <\\\1> finds the following <\TD> string.

  Regular Expressions Numeric/String Conversion