Web based School

Chapter 7 Pattern Matching

CONTENTS

Introduction
The Match Operators
- Match-Operator Precedence
Special Characters in Patterns
Pattern-Matching Options
The Substitution Operator
The Translation Operator
- Options for the Translation Operator
Extended Pattern-Matching
Summary
Q&A
Workshop
- Quiz
- Exercises

This lesson describes the pattern-matching features of Perl. ToChapter, you learn about the following:

How pattern matching works
The pattern-matching operators
Special characters supported in pattern matching
Pattern-matching options
Pattern substitution
Translation
Extended pattern-matching features

Introduction

A pattern is a sequence of characters to be searched for in a character string. In Perl, patterns are normally enclosed in slash characters:


/def/

This represents the pattern def.

If the pattern is found, a match occurs. For example, if you search the string redefine for the pattern /def/, the pattern matches the third, fourth, and fifth characters.


redefine

You already have seen a simple example of pattern matching in the library function split.


@array = split(/ /, $line);

Here the pattern / / matches a single space, which splits a line into words.

The Match Operators

Perl defines special operators that test whether a particular pattern appears in a character string.

The =~ operator tests whether a pattern is matched, as shown in the following:


$result = $var =~ /abc/;

The result of the =~ operation is one of the following:

A nonzero value, or true, if the pattern is found in the string
0, or false, if the pattern is not matched

In this example, the value stored in the scalar variable $var is searched for the pattern abc. If abc is found, $result is assigned a nonzero value; otherwise, $result is set to zero.

The !~ operator is similar to =~, except that it checks whether a pattern is not matched.


$result = $var !~ /abc/;

Here, $result is set to 0 if abc appears in the string assigned to $var, and to a nonzero value if abc is not found.

Because =~ and !~ produce either true or false as their result, these operators are ideally suited for use in conditional expressions. Listing 7.1 is a simple program that uses the =~ operator to test whether a particular sequence of characters exists in a character string.

Listing 7.1. A program that illustrates the use of the matching operator.


1:  #!/usr/local/bin/perl

2:  

3:  print ("Ask me a question politely:\n");

4:  $question = <STDIN>;

5:  if ($question =~ /please/) {

6:          print ("Thank you for being polite!\n");

7:  } else {

8:          print ("That was not very polite!\n");

9:  }


$ program7_1

Ask me a question politely:

May I have a glass of water, please?

Thank you for being polite!

$

Line 5 is an example of the use of the match operator =~ in a conditional expression. The following expression is true if the value stored in $question contains the word please, and it is false if it does not:


$question =~ /please/

Match-Operator Precedence

Like all operators, the match operators have a defined precedence. By definition, the =~ and !~ operators have higher precedence than multiplication and division, and lower precedence than the exponentiation operator **.

For a complete list of Perl operators and their precedence, see Chapter 4, "More Operators."

Special Characters in Patterns

Perl supports a variety of special characters inside patterns, which enables you to match any of a number of character strings. These special characters are what make patterns useful.

The `+` Character

The special character + means "one or more of the preceding characters." For example, the pattern /de+f/ matches any of the following:


def

deef

deeef

deeeeeeef

NOTE

Patterns containing + always try to match as many characters as possible. For example, if the pattern

/ab+/

is searching in the string

abbc

it matches abb, not ab.

The + special character makes it possible to define a better way to split lines into words. So far, the sample programs you have seen have used


@words = split (/ /, $line);

to break an input line into words. This works well if there is exactly one space between words. However, if an input line contains more than one space between words, as in


Here's  multiple   spaces.

the call to split produces the following list:


("Here's", "", "multiple", "", "spaces.")

The pattern / / tells split to start a new word whenever it sees a space. Because there are two spaces between each word, split starts a word when it sees the first space, and then starts another word when it sees the second space. This means that there are now "empty words" in the line.

The + special character gets around this problem. Suppose the call to split is changed to this:


@array = split (/ +/, $line);

Because the pattern / +/ tries to match as many blank characters as possible, the line


Here's  multiple  spaces.

produces the following list:


("Here's", "multiple", "spaces")

Listing 7.2 shows how you can use the / +/ pattern to produce a count of the number of words in a file.

Listing 7.2. A word-count program that handles multiple spaces between words.


1:  #!/usr/local/bin/perl

2:  

3:  $wordcount = 0;

4:  $line = <STDIN>;

5:  while ($line ne "") {

6:          chop ($line);

7:          @words = split(/ +/, $line);

8:          $wordcount += @words;

9:          $line = <STDIN>;

10: }

11: print ("Total number of words: $wordcount\n");


$ program7_2

Here   is  some input.

Here are   some   more words.

Here      is my  last  line.

^D

Total number of words: 14

$

This is the same word-count program you saw in Listing 5.15, with only one change: The pattern / +/ is being used to break the line into words. As you can see, this handles spaces between words properly.

You might have noticed the following problems with this word-count program:

Spaces at the beginning of a line are counted as a word, because split always starts a new word when it sees a space.
Tab characters are counted as a word.

For an example of the first problem, take a look at the following input line:


    This line contains leading spaces.

The call to split in line 7 breaks the preceding into the following list:


("", "This", "line", "contains", "leading", "spaces")

This yields a word count of 6, not the expected 5.

There can be at most one empty word produced from a line, no matter how many leading spaces there are, because the pattern / +/ matches as many spaces as possible. Note also that the program can distinguish between lines containing words and lines that are blank or contain just spaces. If a line is blank or contains only spaces, the line


@words = split(/ +/, $line);

assigns the empty list to @words. Because of this, you can fix the problem of leading spaces in lines by modifying line 8 as follows:


$wordcount += (@words > 0 && $words[0] eq "" ? 

               @words-1 : @words);

This checks for lines containing leading spaces; if a line contains leading spaces, the first "word" (which is the empty string) is not added to the word count.

To find out how to modify the program to deal with tab characters as well as spaces, see the following section.

The `[]` Special Characters

The [] special characters enable you to define patterns that match one of a group of alternatives. For example, the following pattern matches def or dEf:


/d[eE]f/

You can specify as many alternatives as you like.


/a[0123456789]c/

This matches a, followed by any digit, followed by c.

You can combine [] with + to match a sequence of characters of any length.


/d[eE]+f/

This matches all of the following:


def

dEf

deef

dEef

dEEEeeeEef

Any combination of E and e, in any order, is matched by [eE]+.

You can use [] and + together to modify the word-count program you've just seen to accept either tab characters or spaces. Listing 7.3 shows how you can do this.

Listing 7.3. A word-count program that handles multiple spaces and tabs between words.


1:  #!/usr/local/bin/perl

2:  

3:  $wordcount = 0;

4:  $line = <STDIN>;

5:  while ($line ne "") {

6:          chop ($line);

7:          @words = split(/[\t ]+/, $line);

8:          $wordcount += @words;

9:          $line = <STDIN>;

10: }

11: print ("Total number of words: $wordcount\n");


$ program7_3

Here is some input.

Here are some more words.

Here is my last line.

^D

Total number of words: 14

$

This program is identical to Listing 7.2, except that the pattern is now /[\t ]+/.

The \t special-character sequence represents the tab character, and this pattern matches any combination or quantity of spaces and tabs.

NOTE

Any escape sequence that is supported in double-quoted strings is supported in patterns. See Chapter 3, "Understanding Scalar Values," for a list of the escape sequences that are available.

The `*` and `?` Special Characters

As you have seen, the + character matches one or more occurrences of a character. Perl also defines two other special characters that match a varying number of characters: * and ?.

The * special character matches zero or more occurrences of the preceding character. For example, the pattern


/de*f/

matches df, def, deef, and so on.

This character can also be used with the [] special character.


/[eE]*/

This matches the empty string as well as any combination of E or e in any order.

Be sure not to confuse the * special character with the + special character. If you use the wrong special character, you might not get the results that you want.

For example, suppose that you modify Listing 7.3 to call split as follows:

@words = split (/[\t ]*/, $list);

This matches zero or more occurrences of the space or tab character. When you run this with the input

a line

here's the list that is assigned to @words:

("a", "l", "i", "n", "e")

Because the pattern /[\t ]*/ matches on zero occurrences of the space or tab character, it matches after every character. This means that split starts a word after every character that is not a space or tab. (It skips spaces and tabs because /[\t ]*/ matches them.)

The best way to avoid problems such as this one is to use the * special character only when there is another character appearing in the pattern. Patterns such as

/b*[c]/

never match the null string, because the matched sequence has to contain at least the character c.

The ? character matches zero or one occurrence of the preceding character. For example, the pattern


/de?f/

matches either df or def. Note that it does not match deef, because the ? character does not match two occurrences of a character.

Escape Sequences for Special Characters

If you want your pattern to include a character that is normally treated as a special character, precede the character with a backslash \. For example, to check for one or more occurrences of * in a string, use the following pattern:


/\*+/

The backslash preceding the * tells the Perl interpreter to treat the * as an ordinary character, not as the special character meaning "zero or more occurrences."

To include a backslash in a pattern, specify two backslashes:


/\\+/

This pattern tests for one or more occurrences of \ in a string.

If you are running Perl 5, another way to tell Perl that a special character is to be treated as a normal character is to precede it with the \Q escape sequence. When the Perl interpreter sees \Q, every character following the \Q is treated as a normal character until \E is seen. This means that the pattern


/\Q^ab*/

matches any occurrence of the string ^ab*, and the pattern


/\Q^ab\E*/

matches ^a followed by zero or more occurrences of b.

For a complete list of special characters in patterns that require \ to be given their natural meaning, see the section titled "Special-Character Precedence," which contains a table that lists them.

TIP

In Perl, any character that is not a letter or a digit can be preceded by a backslash. If the character isn't a special character in Perl, the backslash is ignored.

If you are not sure whether a particular character is a special character, preceding it with a backslash will ensure that your pattern behaves the way you want it to.

Matching Any Letter or Number

As you have seen, the pattern


/a[0123456789]c/

matches a, followed by any digit, followed by c. Another way of writing this is as follows:


/a[0-9]c/

Here, the range [0-9] represents any digit between 0 and 9. This pattern matches a0c, a1c, a2c, and so on up to a9c.

Similarly, the range [a-z] matches any lowercase letter, and the range [A-Z] matches any uppercase letter. For example, the pattern


/[A-Z][A-Z]/

matches any two uppercase letters.

To match any uppercase letter, lowercase letter, or digit, use the following range:


/[0-9a-zA-Z]/

Listing 7.4 provides an example of the use of ranges with the [] special characters. This program checks whether a given input line contains a legal Perl scalar, array, or file-variable name. (Note that this program handles only simple input lines. Later examples will solve this problem in a better way.)

Listing 7.4. A simple variable-name validation program.


1:  #!/usr/local/bin/perl

2:  

3:  print ("Enter a variable name:\n");

4:  $varname = <STDIN>;

5:  chop ($varname);

6:  if ($varname =~ /\$[A-Za-z][_0-9a-zA-Z]*/) {

7:          print ("$varname is a legal scalar variable\n");

8:  } elsif ($varname =~ /@[A-Za-z][_0-9a-zA-Z]*/) {

9:          print ("$varname is a legal array variable\n");

10: } elsif ($varname =~ /[A-Za-z][_0-9a-zA-Z]*/) {

11:         print ("$varname is a legal file variable\n");

12: } else {

13:         print ("I don't understand what $varname is.\n");

14: }


$ program7_4

Enter a variable name:

$result

$result is a legal scalar variable

$

Line 6 checks whether the input line contains the name of a legal scalar variable. Recall that a legal scalar variable consists of the following:

A $ character
An uppercase or lowercase letter
Zero or more letters, digits, or underscore characters

Each part of the pattern tested in line 6 corresponds to one of the aforementioned conditions given. The first part of the pattern, \$, ensures that the pattern matches only if it begins with a $ character.

NOTE

The $ is preceded by a backslash, because $ is a special character in patterns. See the following section, "Anchoring Patterns," for more information on the $ special character.

The second part of the pattern,


[A-Za-z]

matches exactly one uppercase or lowercase letter. The final part of the pattern,


[_0-9a-zA-Z]*

matches zero or more underscores, digits, or letters in any order.

The patterns in line 8 and line 10 are very similar to the one in line 6. The only difference in line 8 is that the pattern there matches a string whose first character is @, not $. In line 10, this first character is omitted completely.

The pattern in line 8 corresponds to the definition of a legal array-variable name, and the pattern in line 10 corresponds to the definition of a legal file-variable name.

Anchoring Patterns

Although Listing 7.4 can determine whether a line of input contains a legal Perl variable name, it cannot determine whether there is extraneous input on the line. For example, it can't tell the difference between the following three lines of input:


$result

junk$result

$result#junk

In all three cases, the pattern


/\$[a-zA-Z][_0-9a-zA-Z]*/

finds the string $result and matches successfully; however, only the first line is a legal Perl variable name.

To fix this problem, you can use pattern anchors. Table 7.1 lists the pattern anchors defined in Perl.

Table 7.1. Pattern anchors in Perl.

Anchor	Description
`^` or `\A`	Match at beginning of string only
`$` or `\Z`	Match at end of string only
`\b`	Match on word boundary
`\B`	Match inside word

These pattern anchors are described in the following sections.

The `^` and `$` Pattern Anchors

The pattern anchors ^ and $ ensure that the pattern is matched only at the beginning or the end of a string. For example, the pattern


/^def/

matches def only if these are the first three characters in the string. Similarly, the pattern


/def$/

matches def only if these are the last three characters in the string.

You can combine ^ and $ to force matching of the entire string, as follows:


/^def$/

This matches only if the string is def.

In most cases, the escape sequences \A and \Z (defined in Perl 5) are equivalent to ^ and $, respectively:


/\Adef\Z/

This also matches only if the string is def.

NOTE

\A and \Z behave differently from ^ and $ when the multiple-line pattern-matching option is specified. Pattern-matching options are described later toChapter.

Listing 7.5 shows how you can use pattern anchors to ensure that a line of input is, in fact, a legal Perl scalar-, array-, or file-variable name.

Listing 7.5. A better variable-name validation program.


1:  #!/usr/local/bin/perl

2:  

3:  print ("Enter a variable name:\n");

4:  $varname = <STDIN>;

5:  chop ($varname);

6:  if ($varname =~ /^\$[A-Za-z][_0-9a-zA-Z]*$/) {

7:          print ("$varname is a legal scalar variable\n");

8:  } elsif ($varname =~ /^@[A-Za-z][_0-9a-zA-Z]*$/) {

9:          print ("$varname is a legal array variable\n");

10: } elsif ($varname =~ /^[A-Za-z][_0-9a-zA-Z]*$/) {

11:         print ("$varname is a legal file variable\n");

12: } else {

13:         print ("I don't understand what $varname is.\n");

14: }


$ program7_5

Enter a variable name:

x$result

I don't understand what x$result is.

$

The only difference between this program and the one in Listing 7.4 is that this program uses the pattern anchors ^ and $ in the patterns in lines 6, 8, and 10. These anchors ensure that a valid pattern consists of only those characters that make up a legal Perl scalar, array, or file variable.

In the sample output given here, the input


x$result

is rejected, because the pattern in line 6 is matched only when the $ character appears at the beginning of the line.

Word-Boundary Pattern Anchors

The word-boundary pattern anchors, \b and \B, specify whether a matched pattern must be on a word boundary or inside a word boundary. (A word boundary is the beginning or end of a word.)

The \b pattern anchor specifies that the pattern must be on a word boundary. For example, the pattern


/\bdef/

matches only if def is the beginning of a word. This means that def and defghi match but abcdef does not.

You can also use \b to indicate the end of a word. For example,


/def\b/

matches def and abcdef, but not defghi. Finally, the pattern


/\bdef\b/

matches only the word def, not abcdef or defghi.

NOTE

A word is assumed to contain letters, digits, and underscore characters, and nothing else. This means that

/\bdef/

matches $defghi: because $ is not assumed to be part of a word, def is the beginning of the word defghi, and /\bdef/ matches it.

The \B pattern anchor is the opposite of \b. \B matches only if the pattern is contained in a word. For example, the pattern


/\Bdef/

matches abcdef, but not def. Similarly, the pattern


/def\B/

matches defghi, and


/\Bdef\B/

matches cdefg or abcdefghi, but not def, defghi, or abcdef.

The \b and \B pattern anchors enable you to search for words in an input line without having to break up the line using split. For example, Listing 7.6 uses \b to count the number of lines of an input file that contain the word the.

Listing 7.6. A program that counts the number of input lines containing the word the.


1:  #!/usr/local/bin/perl

2:  

3:  $thecount = 0;

4:  print ("Enter the input here:\n");

5:  $line = <STDIN>;

6:  while ($line ne "") {

7:          if ($line =~ /\bthe\b/) {

8:                  $thecount += 1;

9:          }

10:         $line = <STDIN>;

11: }

12:  print ("Number of lines containing 'the': $thecount\n");


$ program7_6

Enter the input here:

Now is the time

for all good men

to come to the aid

of the party.

^D

Number of lines containing 'the': 3

$

This program checks each line in turn to see if it contains the word the, and then prints the total number of lines that contain the word.

Line 7 performs the actual checking by trying to match the pattern


/\bthe\b/

If this pattern matches, the line contains the word the, because the pattern checks for word boundaries at either end.

Note that this program doesn't check whether the word the appears on a line more than once. It is not difficult to modify the program to do this; in fact, you can do it in several different ways.

The most obvious but most laborious way is to break up lines that you know contain the into words, and then check each word, as follows:


if ($line =~ /\bthe\b/) {

        @words = split(/[\t ]+/, $line);

        $count = 1;

        while ($count <= @words) {

                if ($words[$count-1] eq "the") {

                        $thecount += 1;

                }

                $count++;

        }

}

A cute way to accomplish the same thing is to use the pattern itself to break the line into words:


if ($line =~ /\bthe\b/) {

        @words = split(/\bthe\b/, $line);

        $thecount += @words - 1;

}

In fact, you don't even need the if statement.


@words = split(/\bthe\b/, $line);

$thecount += @words - 1;

Here's why this works: Every time split sees the word the, it starts a new word. Therefore, the number of occurrences of the is equal to one less than the number of elements in @words. If there are no occurrences of the, @words has the length 1, and $thecount is not changed.

This trick works only if you know that there is at least one word on the line.

Consider the following code, which tries to use the aforementioned trick on a line that has had its newline character removed using chop:

$line = <STDIN>; chop ($line); @words = split(/\bthe\b/, $line); $thecount += @words - 1;

This code actually subtracts 1 from $thecount if the line is blank or consists only of the word the, because in these cases @words is the empty list and the length of @words is 0.

Leaving off the call to chop protects against this problem, because there will always be at least one "word" in every line (consisting of the newline character).

Variable Substitution in Patterns

If you like, you can use the value of a scalar variable in a pattern. For example, the following code splits the line $line into words:


$pattern = "[\\t ]+";

@words = split(/$pattern/, $line);

Because you can use a scalar variable in a pattern, there is nothing to stop you from reading the pattern from the standard input file. Listing 7.7 accepts a search pattern from a file and then searches for the pattern in the input files listed on the command line. If it finds the pattern, it prints the filename and line number of the match; at the end, it prints the total number of matches.

This example assumes that two files exist, file1 and file2. Each file contains the following:


This is a line of input.

This is another line of input.

If you run this program with command-line arguments file1 and file2 and search for the pattern another, you get the output shown.

Listing 7.7. A simple pattern-search program.


1:  #!/usr/local/bin/perl

2:  

3:  print ("Enter the search pattern:\n");

4:  $pattern = <STDIN>;

5:  chop ($pattern);

6:  $filename = $ARGV[0];

7:  $linenum = $matchcount = 0;

8:  print ("Matches found:\n");

9:  while ($line = <>) {

10:         $linenum += 1;

11:         if ($line =~ /$pattern/) {

12:                 print ("$filename, line $linenum\n");

13:                 @words = split(/$pattern/, $line);

14:                 $matchcount += @words - 1;

15:         }

16:         if (eof) {

17:                 $linenum = 0;

18:                 $filename = $ARGV[0];

19:         }

20:  }

21:  if ($matchcount == 0) {

22:          print ("No matches found.\n");

23:  } else {

24:          print ("Total number of matches: $matchcount\n");

25:  }


$ program7_7 file1 file2

Enter the search pattern:

another

Matches found:

file1, line 2

file2, line 2

Total number of matches: 2

$

This program uses the following scalar variables to keep track of information:

$pattern contains the search pattern read in from the standard input file.
$filename contains the file currently being searched.
$linenum contains the line number of the line currently being searched.
$matchcount contains the total number of matches found to this point.

Line 6 sets the current filename, which corresponds to the first element in the built-in array variable @ARGV. This array variable lists the arguments supplied on the command line. (To refresh your memory on how @ARGV works, refer back to Chapter 6, "Reading from and Writing to Files.") This current filename needs to be stored in a scalar variable, because the <> operator in line 9 shifts @ARGV and destroys this name.

Line 9 reads from each of the files on the command line in turn, one line at a time. The current input line is stored in the scalar variable $line. Once the line is read, line 10 adds 1 to the current line number.

Lines 11-15 handle the matching process. Line 11 checks whether the pattern stored in $pattern is contained in the input line stored in $line. If a match is found, line 12 prints out the current filename and line number. Line 13 then splits the line into "words," using the trick described in the earlier section, "Word-Boundary Pattern Anchors." Because the number of elements of the list stored in @words is one larger than the number of times the pattern is matched, the expression @words - 1 is equivalent to the number of matches; its value is added to $matchcount.

Line 16 checks whether the <> operator has reached the end of the current input file. If it has, line 17 resets the current line number to 0. This ensures that the next pass through the loop will set the current line number to 1 (to indicate that the program is on the first line of the next file). Line 18 sets the filename to the next file mentioned on the command line, which is currently stored in $ARGV[0].

Lines 21-25 either print the total number of matches or indicate that no matches were found.

NOTE

Make sure that you remember to include the enclosing / characters when you use a scalar-variable name in a pattern. The Perl interpreter does not complain when it sees the following, for example, but the result might not be what you want:

@words = split($pattern, $line);

Excluding Alternatives

As you have seen, when the special characters [] appear in a pattern, they specify a set of alternatives to choose from. For example, the pattern


/d[eE]f/

matches def or dEf.

When the ^ character appears as the first character after the [, it indicates that the pattern is to match any character except the ones displayed between the [ and ]. For example, the pattern


/d[^eE]f/

matches any pattern that satisfies the following criteria:

The first character is d.
The second character is anything other than e or E.
The last character is f.

NOTE

To include a ^ character in a set of alternatives, precede it with a backslash, as follows:

/d[\^eE]f/

This pattern matches d^f, def, or dEf.

Character-Range Escape Sequences

In the section titled "Matching Any Letter or Number" earlier in this chapter, you learned that you can represent consecutive letters or numbers inside the [] special characters by specifying ranges. For example, in the pattern


/a[1-3]c/

the [1-3] matches any of 1, 2, or 3.

Some ranges occur frequently enough that Perl defines special escape sequences for them. For example, instead of writing


/[0-9]/

to indicate that any digit is to be matched, you can write


/\d/

The \d escape sequence means "any digit."

Table 7.2 lists the character-range escape sequences, what they match, and their equivalent character ranges.

Table 7.2. Character-range escape sequences.

Escape sequence	Description	Range
`\d`	Any digit	`[0-9]`
`\D`	Anything other than a digit	`[^0-9]`
`\w`	Any word character	`[_0-9a-zA-Z]`
`\W`	Anything not a word character	`[^_0-9a-zA-Z]`
`\s`	White space	[ \r\t\n\f]
`\S`	Anything other than white space	`[^ \r\t\n\f]`

These escape sequences can be used anywhere ordinary characters are used. For example, the following pattern matches any digit or lowercase letter:


/[\da-z]/

NOTE

The definition of word boundary as used by the \b and \B special characters corresponds to the definition of word character used by \w and \W.

If the pattern /\w\W/ matches a particular pair of characters, the first character is part of a word and the second is not; this means that the first character is the end of a word, and that a word boundary exists between the first and second characters matched by the pattern.

Similarly, if /\W\w/ matches a pair of characters, the first character is not part of a word and the second character is. This means that the second character is the beginning of a word. Again, a word boundary exists between the first and second characters matched by the pattern.

Matching Any Character

Another special character supported in patterns is the period (.) character, which matches any character except the newline character. For example, the following pattern matches d, followed by any non-newline character, followed by f:


/d.f/

The . character is often used in conjunction with the * character. For example, the following pattern matches any string that contains the character d preceding the character f:


/d.*f/

Normally, the .* special-character combination tries to match as much as possible. For example, if the string banana is searched using the following pattern, the pattern matches banana, not ba or bana:


/b.*a/

NOTE

There is one exception to the preceding rule: The .* character only matches the longest possible string that enables the pattern match as a whole to succeed.

For example, suppose the string Mississippi is searched using the pattern

/M.*i.*pi/

Here, the first .* in /M.*i.*pi/ matches

Mississippi

If it tried to go further and match

Mississippi

or even

Mississippi

there would be nothing left for the rest of the pattern to match.

When the first .* match is limited to

Mississippi

the rest of the pattern, i.*pi, matches ippi, and the pattern as a whole succeeds.

Matching a Specified Number of Occurrences

Several special characters in patterns that you have seen enable you to match a specified number of occurrences of a character. For example, + matches one or more occurrences of a character, and ? matches zero or one occurrences.

Perl enables you to define how many occurrences of a character constitute a match. To do this, use the special characters { and }.

For example, the pattern


/de{1,3}f/

matches d, followed by one, two, or three occurrences of e, followed by f. This means that def, deef, and deeef match, but df and deeeef do not.

To specify an exact number of occurrences, include only one value between the { and the }.


/de{3}f/

This specifies exactly three occurrences of e, which means this pattern only matches deeef.

To specify a minimum number of occurrences, leave off the upper bound.


/de{3,}f/

This matches d, followed by at least three es, followed by f.

Finally, to specify a maximum number of occurrences, use 0 as the lower bound.


/de{0,3}f/

This matches d, followed by no more than three es, followed by f.

NOTE

You can use { and } with character ranges or any other special character, as follows:

/[a-z]{1,3}/

This matches one, two, or three lowercase letters.

/.{3}/

This matches any three characters.

Specifying Choices

The special character | enables you to specify two or more alternatives to choose from when matching a pattern. For example, the pattern


/def|ghi/

matches either def or ghi. The pattern


/[a-z]+|[0-9]+/

matches one or more lowercase letters or one or more digits.

Listing 7.8 is a simple example of a program that uses the | special character. It reads a number and checks whether it is a legitimate Perl integer.

Listing 7.8. A simple integer-validation program.


1:  #!/usr/local/bin/perl

2:  

3:  print ("Enter a number:\n");

4:  $number = <STDIN>;

5:  chop ($number);

6:  if ($number =~ /^-?\d+$|^-?0[xX][\da-fa-F]+$/) {

7:          print ("$number is a legal integer.\n");

8:  } else {

9:          print ("$number is not a legal integer.\n");

10: }


$ program7_8

Enter a number:

0x3ff1

0x3ff1 is a legal integer.

$

Recall that Perl integers can be in any of three forms:

Standard base-10 notation, as in 123
Base-8 (octal) notation, indicated by a leading 0, as in 0123
Base-16 (hexadecimal) notation, indicated by a leading 0x or 0X, as in 0X1ff

Line 6 checks whether a number is a legal Perl integer. The first alternative in the pattern,


^-?\d+$

matches a string consisting of one or more digits, optionally preceded by a -. (The ^ and $ characters ensure that this is the only string that matches.) This takes care of integers in standard base-10 notation and integers in octal notation.

The second alternative in the pattern,


^-?0[xX][\da-fa-F]+$

matches integers in hexadecimal notation. Take a look at this pattern one piece at a time:

The ^ matches the beginning of the line. This ensures that lines containing leading spaces or extraneous characters are not treated as valid hexadecimal integers.
The -? matches a - if it is present. This ensures that negative numbers are matched.
The 0 matches the leading 0.
The [xX] matches the x or X that follows the leading 0.
The [\da-fa-F] matches any digit, any letter between a and f, or any letter between A and F. Recall that these are precisely the characters which are allowed to appear in hexadecimal digits.
The + indicates that the pattern is to match one or more hexadecimal digits.
The closing $ indicates that the pattern is to match only if there are no extraneous characters following the hexadecimal integer.

Beware that the following pattern matches either x or one or more of y, not one or more of x or y:

/x|y+/

See the section called "Special-Character Precedence" later toChapter for details on how to specify special-character precedence in patterns.

Reusing Portions of Patterns

Suppose that you want to write a pattern that matches the following:

One or more digits or lowercase letters
Followed by a colon or semicolon
Followed by another group of one or more digits or lowercase letters
Another colon or semicolon
Yet another group of one or more digits or lowercase letters

One way to indicate this pattern is as follows:


/[\da-z]+[:;][\da-z]+[:;][\da-z]+/

This pattern is somewhat complicated and is quite repetitive.

Perl provides an easier way to specify patterns that contain multiple repetitions of a particular sequence. When you enclose a portion of a pattern in parentheses, as in


([\da-z]+)

Perl stores the matched sequence in memory. To retrieve a sequence from memory, use the special character \n, where n is an integer representing the nth pattern stored in memory.

For example, the aforementioned pattern can be written as


/([\da-z]+])[:;]\1[:;]\1/

Here, the pattern matched by [\da-z]+ is stored in memory. When the Perl interpreter sees the escape sequence \1, it matches the matched pattern.

You also can store the sequence [:;] in memory, and write this pattern as follows:


/([\da-z]+)([:;])\1\2\1/

Pattern sequences are stored in memory from left to right, so \1 represents the subpattern matched by [\da-z]+ and \2 represents the subpattern matched by [:;].

Pattern-sequence memory is often used when you want to match the same character in more than one place but don't care which character you match. For example, if you are looking for a date in dd-mm-yy format, you might want to match


/\d{2}([\W])\d{2}\1\d{2}/

This matches two digits, a non-word character, two more digits, the same non-word character, and two more digits. This means that the following strings all match:

However, the following string does not match:


21-05.91

This is because the pattern is looking for a - between the 05 and the 91, not a period.

Beware that the pattern

/\d{2}([\W])\d{2}\1\d{2}/

is not the same as the pattern

/(\d{2})([\W])\1\2\1/

In the first pattern, any digit can appear anywhere. The second pattern matches any two digits as the first two characters, but then only matches the same two digits again. This means that

17-17-17

matches, but the following does not:

17-05-91

Pattern-Sequence Scalar Variables

Note that pattern-sequence memory is preserved only for the length of the pattern. This means that if you define the following pattern (which, incidentally, matches any floating-point number that does not contain an exponent):


/-?(\d+)\.?(\d+)/

you cannot then define another pattern, such as the following:


/\1/

and expect the Perl interpreter to remember that \1 refers to the first \d+ (the digits before the decimal point).

To get around this problem, Perl defines special built-in variables that remember the value of patterns matched in parentheses. These special variables are named $n, where n is the nth set of parentheses in the pattern.

For example, consider the following:


$string = "This string contains the number 25.11.";

$string =~ /-?(\d+)\.?(\d+)/;

$integerpart = $1;

$decimalpart = $2;

In this case, the pattern


/-?(\d+)\.?(\d+)/

matches 25.11, and the subpattern in the first set of parentheses matches 25. This means that 25 is stored in $1 and is later assigned to $integerpart. Similarly, the second set of parentheses matches 11, which is stored in $2 and later assigned to $decimalpart.

The values stored in $1, $2, and so on, are destroyed when another pattern match is performed. If you need these values, be sure to assign them to other scalar variables.

There is also one other built-in scalar variable, $&, which contains the entire matched pattern, as follows:


$string = "This string contains the number 25.11.";

$string =~ /-?(\d+)\.?(\d+)/;

$number = $&;

Here, the pattern matched is 25.11, which is stored in $& and then assigned to $number.

Special-Character Precedence

Perl defines rules of precedence to determine the order in which special characters in patterns are interpreted. For example, the pattern


/x|y+/

matches either x or one or more occurrences of y, because + has higher precedence than | and is therefore interpreted first.

Table 7.3 lists the special characters that can appear in patterns in order of precedence (highest to lowest). Special characters with higher precedence are always interpreted before those of lower precedence.

Table 7.3. The precedence of pattern-matching special characters.

Special character	Description
`()`	Pattern memory
`+ * ? {}`	Number of occurrences
`^ $ \b \B`	Pattern anchors
`\|`	Alternatives

Because the pattern-memory special characters () have the highest precedence, you can use them to force other special characters to be evaluated first. For example, the pattern


(ab|cd)+

matches one or more occurrences of either ab or cd. This matches, for example, abcdab.

Remember that when you use parentheses to force the order of precedence, you also are storing into pattern memory. For example, in the sequence

/(ab|cd)+(.)(ef|gh)+\1/

the \1 refers to what ab|cd matched, not to what the . special character matched.

Now that you know all of the special-pattern characters and their precedence, look at a program that does more complex pattern matching. Listing 7.9 uses the various special-pattern characters, including the parentheses, to check whether a given input string is a valid twentieth-century date.

Listing 7.9. A date-validation program.


1:  #!/usr/local/bin/perl

2:  

3:  print ("Enter a date in the format YYYY-MM-DD:\n");

4:  $date = <STDIN>;

5:  chop ($date);

6:  

7:  # Because this pattern is complicated, we split it

8:  # into parts, assign the parts to scalar variables,

9:  # then substitute them in later.

10: 

11: # handle 31-Chapter months

12: $md1 = "(0[13578]|1[02])\\2(0[1-9]|[12]\\d|3[01])";

13: # handle 30-Chapter months

14: $md2 = "(0[469]|11)\\2(0[1-9]|[12]\\d|30)";

15: # handle February, without worrying about whether it's

16: # supposed to be a leap year or not

17: $md3 = "02\\2(0[1-9]|[12]\\d)";

18: 

19: # check for a twentieth-century date

20: $match = $date =~ /^(19)?\d\d(.)($md1|$md2|$md3)$/;

21: # check for a valid but non-20th century date

22: $olddate = $date =~ /^(\d{1,4})(.)($md1|$md2|$md3)$/;

23: if ($match) {

24:         print ("$date is a valid date\n");

25: } elsif ($olddate) {

26:         print ("$date is not in the 20th century\n");

27: } else {

28:         print ("$date is not a valid date\n");

29: }


$ program7_9

Enter a date in the format YYYY-MM-DD:

1991-04-31

1991-04-31 is not a valid date

$

Don't worry: this program is a lot less complicated than it looks! Basically, this program does the following:

It checks whether the date is in the format YYYY-MM-DD. (It allows YY-MM-DD, and also enables you to use a character other than a hyphen to separate the year, month, and date.)
It checks whether the year is in the twentieth century or not.
It checks whether the month is between 01 and 12.
Finally, it checks whether the date field is a legal date for that month. Legal date fields are between 01 and either 29, 30, or 31, depending on the number of Chapters in that month.

If the date is legal, the program tells you so. If the date is not a twentieth-century date but is legal, the program informs you of this also.

Because the pattern to be matched is too long to fit on one line, this program breaks it into pieces and assigns the pieces to scalar variables. This is possible because scalar-variable substitution is supported in patterns.

Line 12 is the pattern to match for months with 31 Chapters. Note that the escape sequences (such as \d) are preceded by another backslash (producing \\d). This is because the program actually wants to store a backslash in the scalar variable. (Recall that backslashes in double-quoted strings are treated as escape sequences.) The pattern


(0[13578]|1[02])\2(0[1-9]|[12]\d|3[01])

which is assigned to $md1, consists of the following components:

The sequence (0[13578]|1[02]), which matches the month values 01, 03, 05, 07, 08, 10, and 12 (the 31-Chapter months)
\2, which matches the character that separates the Chapter, month, and year
The sequence (0[1-9]|[12]\d|3[01]), which matches any two-digit number between 01 and 31

Note that \2 matches the separator character because the separator character will eventually be the second pattern sequence stored in memory (when the pattern is finally assembled).

Line 14 is similar to line 12 and handles 30-Chapter months. The only differences between this subpattern and the one in line 12 are as follows:

The month values accepted are 04, 06, 09, and 11.
The valid date fields are 01 through 30, not 01 through 31.

Line 17 is another similar pattern that checks whether the month is 02 (February) and the date field is between 01 and 29.

Line 20 does the actual pattern match that checks whether the date is a valid twentieth-century date. This pattern is divided into three parts.

^(19)?\d\d, which matches any two-digit number at the beginning of a line, or any four-digit number starting with 19
The separator character, which is the second item in parentheses-the second item stored in memory-and thus can be retrieved using \2
($md1|$md2|$md3)$, which matches any of the valid month-Chapter combinations defined in lines 12, 14, and 17, provided it appears at the end of the line

The result of the pattern match, either true or false, is stored in the scalar variable $match.

Line 22 checks whether the date is a valid date in any century. The only difference between this pattern and the one in line 20 is that the year can be any one-to-four-digit number. The result of the pattern match is stored in $olddate.

Lines 23-29 check whether either $match or $olddate is true and print the appropriate message.

As you can see, the pattern-matching facility in Perl is quite powerful. This program is less than 30 lines long, including comments; the equivalent program in almost any other programming language would be substantially longer and much more difficult to write.

Specifying a Different Pattern Delimiter

So far, all the patterns you have seen have been enclosed by / characters.


/de*f/

These / characters are known as pattern delimiters.

Because / is the pattern-delimiter character, you must use \/ to include a / character in a pattern. This can become awkward if you are searching for a directory such as, for example, /u/jqpublic/perl/prog1.


/\/u\/jqpublic\/perl\/prog1/

To make it easier to write patterns that include / characters, Perl enables you to use any pattern-delimiter character you like. The following pattern also matches the directory /u/jqpublic/perl/prog1:


m!/u/jqpublic/perl/prog1!

Here, the m indicates the pattern-matching operation. If you are using a pattern delimiter other than /, you must include the m.

There are two things you should watch out for when you use other pattern delimiters.

First, if you use the ' character as a pattern delimiter, the Perl interpreter does not substitute for scalar-variable names.

m'$var'

This matches the string $var, not the current value of the scalar variable $var.

Second, if you use a pattern delimiter that is normally a special-pattern character, you will not be able to use that special character in your pattern. For example, if you want to match the pattern ab?c (which matches a, optionally followed by b, followed by c) you cannot use the ? character as a pattern delimiter. The pattern

m?ab?c?

produces a syntax error, because the Perl interpreter assumes that the ? after the b is a pattern delimiter. You can still use

m?ab\?c?

but this pattern won't match what you want. Because the ? inside the pattern is escaped, the Perl interpreter assumes that you want to match the actual ? character, and the pattern matches the sequence ab?c.

Pattern-Matching Options

When you specify a pattern, you also can supply options that control how the pattern is to be matched. Table 7.4 lists these pattern-matching options.

Table 7.4. Pattern-matching options.

Option	Description
`g`	Match all possible patterns
`i`	Ignore case
`m`	Treat string as multiple lines
`o`	Only evaluate once
`s`	Treat string as single line
`x`	Ignore white space in pattern

All pattern options are included immediately after the pattern. For example, the following pattern uses the i option to ignore case:


/ab*c/i

You can specify as many of the options as you like, and the options can be in any order.

Matching All Possible Patterns

The g operator tells the Perl interpreter to match all the possible patterns in a string. For example, if you search the string balata using the pattern


/.a/g

which matches any character followed by a, the pattern matches ba, la, and ta.

If a pattern with the g option specified appears as an assignment to an array variable, the array variable is assigned a list consisting of all the patterns matched. For example,


@matches = "balata" =~ /.a/g;

assigns the following list to @matches:


("ba", "la", "ta")

Now, consider the following statement:


$match = "balata" =~ /.a/g;

The first time this statement is executed, $match is assigned the first pattern matched, which in this case is ba. If this assignment is performed again, $match is assigned the second pattern matched in the string, which is la, and so on until the pattern runs out of matches.

This means that you can use patterns with the g option in loops. Listing 7.10 shows how this works.

Listing 7.10. A program that loops using a pattern.


1:  #!/usr/local/bin/perl

2:  

3:  while ("balata" =~ /.a/g) {

4:          $match = $&;

5:          print ("$match\n");

6:  }


$ program7_10

ba

la

ta

$

The first time through the loop, $match has the value of the first pattern matched, which is ba. (The system variable $& always contains the last pattern matched; this pattern is assigned to $match in line 4.) When the loop is executed for a second time, $match has the value la. The third time through, $match has the value ta. After this, the loop terminates; because the pattern doesn't match anything else, the conditional expression is now false.

Determining the Match Location

If you need to know how much of a string has been searched by the pattern matcher when the g operator is specified, use the pos function.


$offset = pos($string);

This returns the position at which the next pattern match will be started.

You can reposition the pattern matcher by putting pos() on the left side of an assignment.


pos($string) = $newoffset;

This tells the Perl interpreter to start the next pattern match at the position specified by $newoffset.

If you change the string being searched, the match position is reset to the beginning of the string.

NOTE

The pos function is not available in Perl version 4.

Ignoring Case

The i option enables you to specify that a matched letter can either be uppercase or lowercase. For example, the following pattern matches de, dE, De, or DE:


/de/i

Patterns that match either uppercase or lowercase letters are said to be case-insensitive.

Treating the String as Multiple Lines

The m option tells the Perl interpreter that the string to be matched contains multiple lines of text. When the m option is specified, the ^ special character matches either the start of the string or the start of any new line. For example, the pattern


/^The/m

matches the word The in


This pattern matches\nThe first word on the second line

The m option also specifies that the $ special character is to match the end of any line. This means that the pattern


/line.$/m

is matched in the following string:


This is the end of the first line.\nHere's another line.

NOTE

The m option is defined only in Perl 5. To treat a string as multiple lines when you run Perl 4, set the $* system variable, described on Chapter 17, "System Variables."

Evaluating a Pattern Only Once

The o option enables you to tell the Perl interpreter that a pattern is to be evaluated only once. For example, consider the following:


$var = 1;

$line = <STDIN>;

while ($var < 10) {

        $result = $line =~ /$var/o;

        $line = <STDIN>;

        $var++;

}

The first time the Perl interpreter sees the pattern /$var/, it replaces the name $var with the current value of $var, which is 1; this means that the pattern to be matched is /1/.

Because the o option is specified, the pattern to be matched remains /1/ even when the value of $var changes. If the o option had not been specified, the pattern would have been /2/ the next time through the loop.

TIP

There's no real reason to use the o option for patterns unless you are keen on efficiency. Here's an easier way to do the same thing:

$var = <STDIN>; $matchval = $var; $line = <STDIN>; while ($var < 10) { $result = $line =~ /$matchval/; $line = <STDIN>; $var++; }

The value of $matchval never changes, so the o option is not necessary.

Treating the String as a Single Line

The s option specifies that the string to be matched is to be treated as a single line of text. In this case, the . special character matches every character in a string, including the newline character. For example, the pattern /a.*bc/s is matched successfully in the following string:


axxxxx \nxxxxbc

If the s option is not specified, this pattern does not match, because the . character does not match the newline.

NOTE

The s option is defined only in Perl 5.

Using White Space in Patterns

One problem with patterns in Perl is that they can become difficult to follow. For example, consider this pattern, which you saw earlier:


/\d{2}([\W])\d{2}\1\d{2}/

Patterns such as this are difficult to follow, because there are a lot of backslashes, braces, and brackets to sort out.

Perl 5 makes life a little easier by supplying the x option. This tells the Perl interpreter to ignore white space in a pattern unless it is preceded by a backslash. This means that the preceding pattern can be rewritten as the following, which is much easier to follow:


/\d{2} ([\W]) \d{2} \1 \d{2}/x

Here is an example of a pattern containing an actual blank space:


/[A-Z] [a-z]+ \ [A-Z] [a-z]+ /x

This matches a name in the standard first-name/last-name format (such as John Smith). Normally, you won't want to use the x option if you're actually trying to match white space, because you wind up with the backslash problem all over again.

NOTE

The x option is defined only in Perl 5.

The Substitution Operator

Perl enables you to replace part of a string using the substitution operator, which has the following syntax:


s/pattern/replacement/

The Perl interpreter searches for the pattern specified by the placeholder pattern. If it finds pattern, it replaces it with the string represented by the placeholder replacement. For example:


$string = "abc123def";

$string =~ s/123/456/;

Here, 123 is replaced by 456, which means that the value stored in $string is now abc456def.

You can use any of the pattern special characters in the substitution operator. For example,


s/[abc]+/0/

searches for a sequence consisting of one or more occurrences of the letters a, b, and c (in any order) and replaces the sequence with 0.

If you just want to delete a sequence of characters rather than replace it, leave out the replacement string as in the following example, which deletes the first occurrence of the pattern abc:


s/abc//

Using Pattern-Sequence Variables in Substitutions

You can use pattern-sequence variables to include a matched pattern in the replacement string. The following is an example:


s/(\d+)/[$1]/

This matches a sequence of one or more digits. Because this sequence is enclosed in parentheses, it is stored in the scalar variable $1. In the replacement string, [$1], the scalar variable name $1 is replaced by its value, which is the matched pattern.

NOTE

Because the replacement string in the substitution operator is a string, not a pattern, the pattern special characters, such as [], *, and +, do not have a special meaning. For example, in the substitution

s/abc/[def]/

the replacement string is [def] (including the square brackets).

Options for the Substitution Operator

The substitution operator supports several options, which are listed in Table 7.5.

Table 7.5. Options for the substitution operator.

Option	Description
`g`	Change all occurrences of the pattern
`i`	Ignore case in pattern
`e`	Evaluate replacement string as expression
`m`	Treat string to be matched as multiple lines
`o`	Evaluate only once
`s`	Treat string to be matched as single line
`x`	Ignore white space in pattern

As with pattern matching, options are appended to the end of the operator. For example, to change all occurrences of abc to def, use the following:


s/abc/def/g

Global Substitution

The g option changes all occurrences of a pattern in a particular string. For example, the following substitution puts parentheses around any number in the string:


s/(\d+)/($1)/g

Listing 7.11 is an example of a program that uses global substitution. It examines each line of its input, removes all extraneous leading spaces and tabs, and replaces multiple spaces and tabs between words with a single space.

Listing 7.11. A simple white space cleanup program.


1:  #!/usr/local/bin/perl

2:  

3:  @input = <STDIN>;

4:  $count = 0;

5:  while ($input[$count] ne "") {

6:          $input[$count] =~ s/^[ \t]+//;

7:          $input[$count] =~ s/[ \t]+\n$/\n/;

8:          $input[$count] =~ s/[ \t]+/ /g;

9:          $count++;

10: }

11: print ("Formatted text:\n");

12: print (@input);


$ program7_11

This is   a  line   of    input.

  Here   is another line.  

This     is my  last line of   input.

^D

Formatted text:

This is a line of input.

Here is another line.

This is my last line of input.

$

This program performs three substitutions on each line of its input. The first substitution, in line 6, checks whether there are any spaces or tabs at the beginning of the line. If any exist, they are removed.

Similarly, line 7 checks whether there are any spaces or tabs at the end of the line (before the trailing newline character). If any exist, they are removed. To do this, line 7 replaces the following pattern (one or more spaces and tabs, followed by a newline character, followed by the end of the line) with a newline character:


/[ \t]+\n$/

Line 8 uses a global substitution to remove extra spaces and tabs between words. The following pattern matches one or more spaces or tabs, in any order; these spaces and tabs are replaced by a single space:


/[ \t]+/

Ignoring Case

The i option ignores case when substituting. For example, the following substitution replaces all occurrences of the words no, No, NO, and nO with NO. (Recall that the \b escape character specifies a word boundary.)


s/\bno\b/NO/gi

Replacement Using an Expression

The e option treats the replacement string as an expression, which it evaluates before replacing. For example, consider the following:


$string = "0abc1";

$string =~ s/[a-zA-Z]+/$& x 2/e

The substitution shown here is a quick way to duplicate part of a string. Here's how it works:

The pattern /[a-zA-Z]+/ matches abc, which is stored in the built-in variable $&.
The e option indicates that the replacement string, $& x 2, is to be treated as an expression. This expression is evaluated, producing the result abcabc.
abcabc is substituted for abc in the string stored in $string. This means that the new value of $string is 0abcabc1.

Listing 7.12 is another example that uses the e option in a substitution. This program takes every integer in a list of input files and multiplies them by 2, leaving the rest of the contents unchanged. (For the sake of simplicity, the program assumes that there are no floating-point numbers in the file.)

Listing 7.12. A program that multiplies every integer in a file by 2.


1:  #!/usr/local/bin/perl

2:  

3:  $count = 0;

4:  while ($ARGV[$count] ne "") {

5:          open (FILE, "$ARGV[$count]");

6:          @file = <FILE>;

7:          $linenum = 0;

8:          while ($file[$linenum] ne "") {

9:                  $file[$linenum] =~ s/\d+/$& * 2/eg;

10:                 $linenum++;

11:         }

12:         close (FILE);

13:         open (FILE, ">$ARGV[$count]");

14:         print FILE (@file);

15:         close (FILE);

16:         $count++;

17: }

If a file named foo contains the text

This contains the number 1.

This contains the number 26.
and the name foo is passed as a command-line argument to this program, the file foo becomes
This contains the number 2.

This contains the number 52.

This program uses the built-in variable @ARGV to retrieve filenames from the command line. Note that the program cannot use <>, because the following statement reads the entire contents of all the files into a single array:


@file = <>;

Lines 8-11 read and substitute one line of a file at a time. Line 9 performs the actual substitution as follows:

The pattern \d+ matches a sequence of one or more digits, which is automatically assigned to $&.
The value of $& is substituted into the replacement string.
The e option indicates that this replacement string is to be treated as an expression. This expression multiplies the matched integer by 2.
The result of the multiplication is then substituted into the file in place of the original integer.
The g option indicates that every integer on the line is to be substituted for.

After all the lines in the file have been read, the file is closed and reopened for writing. The call to print in line 14 takes the list stored in @file-the contents of the current file-and writes them back out to the file, overwriting the original contents.

Evaluating a Pattern Only Once

As with the match operator, the o option to the substitution operator tells the Perl interpreter to replace a scalar variable name with its value only once. For example, the following statement substitutes the current value of $var for its name, producing a replacement string:


$string =~ /abc/$var/o;

This replacement string then never changes, even if the value of $var changes. For example:


$var = 17;

while ($var > 0) {

        $string = <STDIN>;

        $string =~ /abc/$var/o;

        print ($string);

        $var--;  # the replacement string is still "17"

}

Again, as with the match operator, there is no real reason to use the o option.

Treating the String as Single or Multiple Lines

As in the pattern-matching operator, the s and m options specify that the string to be matched is to be treated as a single line or as multiple lines, respectively.

The s option ensures that the newline character \n is matched by the . special character.


$string = "This is a\ntwo-line string.";

$string =~ s/a.*o/one/s;

# $string now contains "This is a one-line string."

If the m option is specified, ^ and $ match the beginning and end of any line.


$string = "The The first line\nThe The second line";

$string =~ s/^The//gm;

# $string now contains "The first line\nThe second line"

$string =~ s/e$/k/gm;

# $string now contains "The first link\nThe second link"

The \A and \Z escape sequences (defined in Perl 5) always match only the beginning and end of the string, respectively. (This is the only case where \A and \Z behave differently from ^ and $.)

NOTE

The m and s options are defined only in Perl 5. To treat a string as multiple lines when you run Perl 4, set the $* system variable, described on Chapter 17.

Using White Space in Patterns

The x option tells the Perl interpreter to ignore all white space unless preceded by a backslash. As with the pattern-matching operator, ignoring white space makes complicated string patterns easier to read.


$string =~ s/\d{2} ([\W]) \d{2} \1 \d{2}/$1-$2-$3/x

This converts a Chapter-month-year string to the dd-mm-yy format.

NOTE

Even if the x option is specified, spaces in the replacement string are not ignored. For example, the following replaces 14/04/95 with 14 - 04 - 95, not 14-04-95:

$string =~ s/\d{2} ([\W]) \d{2} \1 \d{2}/$1 - $2 - $3/x

Also note that the x option is defined only in Perl 5.

Specifying a Different Delimiter

You can specify a different delimiter to separate the pattern and replacement string in the substitution operator. For example, the following substitution operator replaces /u/bin with /usr/local/bin:


s#/u/bin#/usr/local/bin#

The search and replacement strings can be enclosed in parentheses or angle brackets.


s(/u/bin)(/usr/local/bin)

s</u/bin>/\/usr\/local\/bin/

NOTE

As with the match operator, you cannot use a special character both as a delimiter and in a pattern.

s.a.c.def.

This substitution will be flagged as containing an error because the . character is being used as the delimiter. The substitution

s.a\.c.def.

does work, but it substitutes def for a.c, where . is an actual period and not the pattern special character.

The Translation Operator

Perl also provides another way to substitute one group of characters for another: the tr translation operator. This operator uses the following syntax:


tr/string1/string2/

Here, string1 contains a list of characters to be replaced, and string2 contains the characters that replace them. The first character in string1 is replaced by the first character in string2, the second character in string1 is replaced by the second character in string2, and so on.

Here is a simple example:


$string = "abcdefghicba";

$string =~ tr/abc/def/;

Here, the characters a, b, and c are to be replaced as follows:

All occurrences of the character a are to be replaced by the character d.
All occurrences of the character b are to be replaced by the character e.
All occurrences of the character c are to be replaced by the character f.

After the translation, the scalar variable $string contains the value defdefghifed.

NOTE

If the string listing the characters to be replaced is longer than the string containing the replacement characters, the last character of the replacement string is repeated. For example:

$string = "abcdefgh"; $string =~ tr/efgh/abc/;

Here, there is no character corresponding to d in the replacement list, so c, the last character in the replacement list, replaces h. This translation sets the value of $string to abcdabcc.

Also note that if the same character appears more than once in the list of characters to be replaced, the first replacement is used:


$string =~ tr/AAA/XYZ/; replaces A with X

The most common use of the translation operator is to convert alphabetic characters from uppercase to lowercase or vice versa. Listing 7.13 provides an example of a program that converts a file to all lowercase characters.

Listing 7.13. An uppercase-to-lowercase conversion program.


1:  #!/usr/local/bin/perl

2:  

3:  while ($line = <STDIN>) {

4:          $line =~ tr/A-Z/a-z/;

5:          print ($line);

6:  }


$ program7_13

THIS LINE IS IN UPPER CASE.

this line is in upper case.

ThiS LiNE Is iN mIxED cASe.

this line is in mixed case.

^D

$

This program reads a line at a time from the standard input file, terminating when it sees a line containing the Ctrl+D (end-of-file) character.

Line 4 performs the translation operation. As in the other pattern-matching operations, the range character (-) indicates a range of characters to be included. Here, the range a-z refers to all the lowercase characters, and the range A-Z refers to all the uppercase characters.

NOTE

There are two things you should note about the translation operator:

The pattern special characters are not supported by the translation operator.

You can use y in place of tr if you want.

$string =~ y/a-z/A-Z/;

Options for the Translation Operator

The translation operator supports three options, which are listed in Table 7.6.

The c option (c is for "complement") translates all characters that are not specified. For example, the statement


$string =~ tr/\d/ /c;

replaces everything that is not a digit with a space.

Table 7.6. Options for the translation operator.

Option	Description
`c`	Translate all characters not specified
`d`	Delete all specified characters
`s`	Replace multiple identical output characters with a single character

The d option deletes every specified character.


$string =~ tr/\t //d;

This deletes all the tabs and spaces from $string.

The s option (for "squeeze") checks the output from the translation. If two or more consecutive characters translate to the same output character, only one output character is actually used. For example, the following replaces everything that is not a digit and outputs only one space between digits:


$string =~ tr/0-9/ /cs;

Listing 7.14 is a simple example of a program that uses some of these translation options. It reads a number from the standard input file, and it gets rid of every input character that is not actually a digit.

Listing 7.14. A program that ensures that a string consists of nothing but digits.


1:  #!/usr/local/bin/perl

2:  

3:  $string = <STDIN>;

4:  $string =~ tr/0-9//cd;

5:  print ("$string\n");


$ program7_14

The number 45 appears in this string.

45

$

Line 4 of this program performs the translation. The d option indicates that the translated characters are to be deleted, and the c option indicates that every character not in the list is to be deleted. Therefore, this translation deletes every character in the string that is not a digit. Note that the trailing newline character is not a digit, so it is one of the characters deleted.

Extended Pattern-Matching

Perl 5 provides some additional pattern-matching capabilities not found in Perl 4 or in standard UNIX pattern-matching operations.

Extended pattern-matching capabilities employ the following syntax:


(?<c>pattern)

<c> is a single character representing the extended pattern-matching capability being used, and pattern is the pattern or subpattern to be affected.

The following extended pattern-matching capabilities are supported by Perl 5:

Parenthesizing subpatterns without saving them in memory
Embedding options in patterns
Positive and negative look-ahead conditions
Comments

Parenthesizing Without Saving in Memory

In Perl, when a subpattern is enclosed in parentheses, the subpattern is also stored in memory. If you want to enclose a subpattern in parentheses without storing it in memory, use the ?: extended pattern-matching feature. For example, consider this pattern:


/(?:a|b|c)(d|e)f\1/

This matches the following:

One of a, b, or c
One of d or e
f
Whichever of d or e was matched earlier

Here, \1 matches either d or e, because the subpattern a|b|c was not stored in memory. Compare this with the following:


/(a|b|c)(d|e)f\1/

Here, the subpattern a|b|c is stored in memory, and one of a, b, or c is matched by \1.

Embedding Pattern Options

Perl 5 provides a way of specifying a pattern-matching option within the pattern itself. For example, the following patterns are equivalent:


/[a-z]+/i

/(?i)[a-z]+/

In both cases, the pattern matches one or more alphabetic characters; the i option indicates that case is to be ignored when matching.

The syntax for embedded pattern options is


(?option)

where option is one of the options shown in Table 7.7.

Table 7.7. Options for embedded patterns.

Option	Description
`i`	Ignore case in pattern
`m`	Treat pattern as multiple lines
`s`	Treat pattern as single line
`x`	Ignore white space in pattern

The g and o options are not supported as embedded pattern options.

Embedded pattern options give you more flexibility when you are matching patterns. For example:


$pattern1 = "[a-z0-9]+";

$pattern2 = "(?i)[a-z]+";

if ($string =~ /$pattern1|$pattern2/) {

        ...

}

Here, the i option is specified for some, but not all, of a pattern. (This pattern matches either any collection of lowercase letters mixed with digits, or any collection of letters.)

Positive and Negative Look-Ahead

Perl 5 enables you to use the ?= feature to define a boundary condition that must be matched in order for the pattern to match. For example, the following pattern matches abc only if it is followed by def:


/abc(?=def)/

This is known as a positive look-ahead condition.

NOTE

The positive look-ahead condition is not part of the pattern matched. For example, consider these statements:

$string = "25abc8"; $string =~ /abc(?=[0-9])/; $matched = $&;

Here, as always, $& contains the matched pattern, which in this case is abc, not abc8.

Similarly, the ?! feature defines a negative look-ahead condition, which is a boundary condition that must not be present if the pattern is to match. For example, the pattern /abc(?!def)/ matches any occurrence of abc unless it is followed by def.

Pattern Comments

Perl 5 enables you to add comments to a pattern using the ?# feature. For example:

if ($string =~ /(?i)[a-z]{2,3}(?# match two or three alphabetic characters)/ {
...
}

Adding comments makes it easier to follow complicated patterns.

Summary

Perl enables you to search for sequences of characters using patterns. If a pattern is found in a string, the pattern is said to be matched.

Patterns often are used in conjunction with the pattern-match operators, =~ and !~. The =~ operator returns true if the pattern matches, and the !~ operator returns true if the pattern does not match.

Special-pattern characters enable you to search for a string that meets one of a variety of conditions.

The + character matches one or more occurrences of a character.
The * character matches zero or more occurrences of a character.
The [] characters enclose a set of characters, any one of which matches.
The ? character matches zero or one occurrences of a character.
The ^ and $ characters match the beginning and end of a line, respectively. The \b and \B characters match a word boundary or somewhere other than a word boundary, respectively.
The {} characters specify the number of occurrences of a character.
The | character specifies alternatives, either of which match.

To give a special character its natural meaning in a pattern, precede it with a backslash \.

Enclosing a part of a pattern in parentheses stores the matched subpattern in memory; this stored subpattern can be recalled using the character sequence \n, and stored in a scalar variable using the built-in scalar variable $n. The built-in scalar variable $& stores the entire matched pattern.

You can substitute for scalar-variable names in patterns, specify different pattern delimiters, or supply options that match every possible pattern, ignore case, or perform scalar-variable substitution only once.

The substitution operator, s, enables you to replace a matched pattern with a specified string. Options to the substitution operator enable you to replace every matched pattern, ignore case, treat the replacing string as an expression, or perform scalar-variable substitution only once.

The translation operator, tr, enables you to translate one set of characters into another set. Options exist that enable you to perform translation on everything not in the list, to delete characters in the list, or to ignore multiple identical output characters.

Perl 5 provides extended pattern-matching capabilities not provided in Perl 4. To use one of these extended pattern features on a subpattern, put (? at the beginning of the subpattern and ) at the end of the subpattern.

Q&A

Q:	How many subpatterns can be stored in memory using \1, \2, and so on?
A:	Basically, as many as you like. After you store more than nine patterns, you can retrieve the later patterns using two-digit numbers preceded by a backslash, such as `\10`.
Q:	Why does pattern-memory variable numbering start with 1, whereas subscript numbering starts with 0?
A:	Subscript numbering starts with 0 to remain compatible with the C programming language. There is no such thing as pattern memory in C, so there is no need to be compatible with it.
Q:	What happens when the replacement string in the translate command is left out, as in `tr/abc//`?
A:	If the replacement string is omitted, a copy of the first string is used. This means that `:t:r/abc//`does not do anything, because it is the same as `tr/abc/abc/`If the replacement string is omitted in the substitute command, as in `s/abc//`the pattern matched-in this case, `abc`-is deleted.
Q:	*Why does Perl use characters such as `+`, ``, and `?` as pattern special characters?**
A:	These special characters usually correspond to special characters used in other UNIX applications, such as `vi` and `csh`. Some of the special characters, such as `+`, are used in formal syntax description languages.
Q:	Why does Perl use both `\1` and `$1` to store pattern memory?
A:	To enable you to distinguish between a subpattern matched in the current pattern (which is stored in `\1`) and a subpattern matched in the previous statement (which is stored in `$1`).

Workshop

The Workshop provides quiz questions to help you solidify your understanding of the material covered and exercises to give you experience in using what you've learned. Try and understand the quiz and exercise answers before you go on to tomorrow's lesson.

Quiz

What do the following patterns match?
a. /a|bc*/b. /[\d]{1,3}/c. /\bc[aou]t\b/d. /(xy+z)\.\1/e. /^$/
Write patterns that match the following:
a.   Five or more lowercase letters (a-z).
b.   Either the number 1 or the string one.
c.   string of digits optionally containing a decimal point.
d.   Any letter, followed by any vowel, followed by the same letter again.
e.   One or more + characters.
Suppose the variable $var has the value abc123. Indicate whether the following conditional expressions return true or false.
a. $var =~ /./ b. $var =~ /[A-Z]*/ c. $var =~ /\w{4-6}/ d. $var =~ /(\d)2(\1)/ e. $var =~ /abc$/ f. $var =~ /1234?/
Suppose the variable $var has the value abc123abc. What is the value of $var after the following substitutions?
a. $var =~ s/abc/def/; b. $var =~ s/[a-z]+/X/g; c. $var =~ s/B/W/i; d. $var =~ s/(.)\d.*\1/d/; e. $var =~ s/(\d+)/$1*2/e;
Suppose the variable $var has the value abc123abc. What is the value of $var after the following translations?
a. $var =~ tr/a-z/A-Z/;b. $var =~ tr/123/456/;c. $var =~ tr/231/564/;d. $var =~ tr/123/ /s;e. $var =~ tr/123//cd;

Exercises

Write a program that reads all the input from the standard input file, converts all the vowels (except y) to uppercase, and prints the result on the standard output file.
Write a program that counts the number of times each digit appears in the standard input file. Print the total for each digit and the sum of all the totals.
Write a program that reverses the order of the first three words of each input line (from the standard input file) using the substitution operator. Leave the spacing unchanged, and print each resulting line.
Write a program that adds 1 to every number in the standard input file. Print the results.
BUG BUSTER: What is wrong with the following program?
#!/usr/local/bin/perl while ($line = <STDIN>) { # put quotes around each line of input $line =~ /^.*$/"\1"/; print ($line); }
BUG BUSTER: What is wrong with the following program?
#!/usr/local/bin/perl while ($line = <STDIN>) { if ($line =~ /[\d]*/) { print ("This line contains the digits '$&'\n"); } }