Chapter 7
Pattern Matching
CONTENTS
This lesson describes the pattern-matching features of Perl. ToChapter,
you learn about the following:
- How pattern matching works
- The pattern-matching operators
- Special characters supported in pattern matching
- Pattern-matching options
- Pattern substitution
- Translation
- Extended pattern-matching features
A pattern is a sequence of characters to be searched for
in a character string. In Perl, patterns are normally enclosed
in slash characters:
/def/
This represents the pattern def.
If the pattern is found, a match occurs. For example, if you search
the string redefine for the pattern /def/, the
pattern matches the third, fourth, and fifth characters.
redefine
You already have seen a simple example of pattern matching in
the library function split.
@array = split(/ /, $line);
Here the pattern / / matches a single space, which splits
a line into words.
Perl defines special operators that test whether a particular
pattern appears in a character string.
The =~ operator tests whether a pattern is matched, as
shown in the following:
$result = $var =~ /abc/;
The result of the =~ operation is one of the following:
- A nonzero value, or true, if the pattern is found in the string
- 0, or false, if the pattern is not matched
In this example, the value stored in the scalar variable $var
is searched for the pattern abc. If abc is found,
$result is assigned a nonzero value; otherwise, $result
is set to zero.
The !~ operator is similar to =~, except that
it checks whether a pattern is not matched.
$result = $var !~ /abc/;
Here, $result is set to 0 if abc appears
in the string assigned to $var, and to a nonzero value
if abc is not found.
Because =~ and !~ produce either true or false
as their result, these operators are ideally suited for use in
conditional expressions. Listing 7.1 is a simple program that
uses the =~ operator to test whether a particular sequence
of characters exists in a character string.
Listing 7.1. A program that illustrates the use of the matching
operator.
1: #!/usr/local/bin/perl
2:
3: print ("Ask me a question politely:\n");
4: $question = <STDIN>;
5: if ($question =~ /please/) {
6: print ("Thank you for being polite!\n");
7: } else {
8: print ("That was not very polite!\n");
9: }
$ program7_1
Ask me a question politely:
May I have a glass of water, please?
Thank you for being polite!
$

Line 5 is an example of the use of the match
operator =~ in a conditional expression. The following
expression is true if the value stored in $question contains
the word please, and it is false if it does not:
$question =~ /please/
Like all operators, the match operators have a defined precedence.
By definition, the =~ and !~ operators have
higher precedence than multiplication and division, and lower
precedence than the exponentiation operator **.
For a complete list of Perl operators and their precedence, see
Chapter 4, "More Operators."
Perl supports a variety of special characters inside patterns,
which enables you to match any of a number of character strings.
These special characters are what make patterns useful.
The special character + means "one or more of the
preceding characters." For example, the pattern /de+f/
matches any of the following:
def
deef
deeef
deeeeeeef
| NOTE |
Patterns containing + always try to match as many characters as possible. For example, if the pattern
/ab+/
is searching in the string
abbc
it matches abb, not ab.
|
The + special character makes it possible to define a
better way to split lines into words. So far, the sample programs
you have seen have used
@words = split (/ /, $line);
to break an input line into words. This works well if there is
exactly one space between words. However, if an input line contains
more than one space between words, as in
Here's multiple spaces.
the call to split produces the following list:
("Here's", "", "multiple", "", "spaces.")
The pattern / / tells split to start a new word
whenever it sees a space. Because there are two spaces between
each word, split starts a word when it sees the first
space, and then starts another word when it sees the second space.
This means that there are now "empty words" in the line.
The + special character gets around this problem. Suppose
the call to split is changed to this:
@array = split (/ +/, $line);
Because the pattern / +/ tries to match as many blank
characters as possible, the line
Here's multiple spaces.
produces the following list:
("Here's", "multiple", "spaces")
Listing 7.2 shows how you can use the / +/ pattern to
produce a count of the number of words in a file.
Listing 7.2. A word-count program that handles multiple spaces
between words.
1: #!/usr/local/bin/perl
2:
3: $wordcount = 0;
4: $line = <STDIN>;
5: while ($line ne "") {
6: chop ($line);
7: @words = split(/ +/, $line);
8: $wordcount += @words;
9: $line = <STDIN>;
10: }
11: print ("Total number of words: $wordcount\n");
$ program7_2
Here is some input.
Here are some more words.
Here is my last line.
^D
Total number of words: 14
$

This is the same word-count program you saw
in Listing 5.15, with only one change: The pattern / +/
is being used to break the line into words. As you can see, this
handles spaces between words properly.
You might have noticed the following problems with this word-count
program:
- Spaces at the beginning of a line are counted as a word, because
split always starts a new word when it sees a space.
- Tab characters are counted as a word.
For an example of the first problem, take a look at the following
input line:
This line contains leading spaces.
The call to split in line 7 breaks the preceding into
the following list:
("", "This", "line", "contains", "leading", "spaces")
This yields a word count of 6, not the expected 5.
There can be at most one empty word produced from a line, no matter
how many leading spaces there are, because the pattern / +/
matches as many spaces as possible. Note also that the program
can distinguish between lines containing words and lines that
are blank or contain just spaces. If a line is blank or contains
only spaces, the line
@words = split(/ +/, $line);
assigns the empty list to @words. Because of this, you
can fix the problem of leading spaces in lines by modifying line
8 as follows:
$wordcount += (@words > 0 && $words[0] eq "" ?
@words-1 : @words);
This checks for lines containing leading spaces; if a line contains
leading spaces, the first "word" (which is the empty
string) is not added to the word count.
To find out how to modify the program to deal with tab characters
as well as spaces, see the following section.
The [] special characters enable you to define patterns
that match one of a group of alternatives. For example, the following
pattern matches def or dEf:
/d[eE]f/
You can specify as many alternatives as you like.
/a[0123456789]c/
This matches a, followed by any digit, followed by c.
You can combine [] with + to match a sequence
of characters of any length.
/d[eE]+f/
This matches all of the following:
def
dEf
deef
dEef
dEEEeeeEef
Any combination of E and e, in any order, is
matched by [eE]+.
You can use [] and + together to modify the
word-count program you've just seen to accept either tab characters
or spaces. Listing 7.3 shows how you can do this.
Listing 7.3. A word-count program that handles multiple spaces
and tabs between words.
1: #!/usr/local/bin/perl
2:
3: $wordcount = 0;
4: $line = <STDIN>;
5: while ($line ne "") {
6: chop ($line);
7: @words = split(/[\t ]+/, $line);
8: $wordcount += @words;
9: $line = <STDIN>;
10: }
11: print ("Total number of words: $wordcount\n");
$ program7_3
Here is some input.
Here are some more words.
Here is my last line.
^D
Total number of words: 14
$

This program is identical to Listing 7.2, except
that the pattern is now /[\t ]+/.
The \t special-character sequence represents the tab
character, and this pattern matches any combination or quantity
of spaces and tabs.
| NOTE |
Any escape sequence that is supported in double-quoted strings is supported in patterns. See Chapter 3, "Understanding Scalar Values," for a list of the escape sequences that are available.
|
As you have seen, the + character matches one or more
occurrences of a character. Perl also defines two other special
characters that match a varying number of characters: *
and ?.
The * special character matches zero or more occurrences
of the preceding character. For example, the pattern
/de*f/
matches df, def, deef, and so on.
This character can also be used with the [] special character.
/[eE]*/
This matches the empty string as well as any combination of E
or e in any order.
 |
Be sure not to confuse the * special character with the + special character. If you use the wrong special character, you might not get the results that you want.
For example, suppose that you modify Listing 7.3 to call split as follows:
@words = split (/[\t ]*/, $list);
This matches zero or more occurrences of the space or tab character. When you run this with the input
a line
here's the list that is assigned to @words:
("a", "l", "i", "n", "e")
Because the pattern /[\t ]*/ matches on zero occurrences of the space or tab character, it matches after every character. This means that split starts a word after every character that is not a space or tab. (It skips spaces and tabs
because /[\t ]*/ matches them.)
The best way to avoid problems such as this one is to use the * special character only when there is another character appearing in the pattern. Patterns such as
/b*[c]/
never match the null string, because the matched sequence has to contain at least the character c.
|
The ? character matches zero or one occurrence of the
preceding character. For example, the pattern
/de?f/
matches either df or def. Note that it does
not match deef, because the ? character does
not match two occurrences of a character.
If you want your pattern to include a character that is normally
treated as a special character, precede the character with a backslash
\. For example, to check for one or more occurrences
of * in a string, use the following pattern:
/\*+/
The backslash preceding the * tells the Perl interpreter
to treat the * as an ordinary character, not as the special
character meaning "zero or more occurrences."
To include a backslash in a pattern, specify two backslashes:
/\\+/
This pattern tests for one or more occurrences of \ in
a string.
If you are running Perl 5, another way to tell Perl that a special
character is to be treated as a normal character is to precede
it with the \Q escape sequence. When the Perl interpreter
sees \Q, every character following the \Q is
treated as a normal character until \E is seen. This
means that the pattern
/\Q^ab*/
matches any occurrence of the string ^ab*, and the pattern
/\Q^ab\E*/
matches ^a followed by zero or more occurrences of b.
For a complete list of special characters in patterns that require
\ to be given their natural meaning, see the section
titled "Special-Character Precedence," which contains
a table that lists them.
| TIP |
In Perl, any character that is not a letter or a digit can be preceded by a backslash. If the character isn't a special character in Perl, the backslash is ignored.
If you are not sure whether a particular character is a special character, preceding it with a backslash will ensure that your pattern behaves the way you want it to.
|
As you have seen, the pattern
/a[0123456789]c/
matches a, followed by any digit, followed by c.
Another way of writing this is as follows:
/a[0-9]c/
Here, the range [0-9] represents any digit between 0
and 9. This pattern matches a0c, a1c, a2c,
and so on up to a9c.
Similarly, the range [a-z] matches any lowercase letter,
and the range [A-Z] matches any uppercase letter. For
example, the pattern
/[A-Z][A-Z]/
matches any two uppercase letters.
To match any uppercase letter, lowercase letter, or digit, use
the following range:
/[0-9a-zA-Z]/
Listing 7.4 provides an example of the use of ranges with the
[] special characters. This program checks whether a
given input line contains a legal Perl scalar, array, or file-variable
name. (Note that this program handles only simple input lines.
Later examples will solve this problem in a better way.)
Listing 7.4. A simple variable-name validation program.
1: #!/usr/local/bin/perl
2:
3: print ("Enter a variable name:\n");
4: $varname = <STDIN>;
5: chop ($varname);
6: if ($varname =~ /\$[A-Za-z][_0-9a-zA-Z]*/) {
7: print ("$varname is a legal scalar variable\n");
8: } elsif ($varname =~ /@[A-Za-z][_0-9a-zA-Z]*/) {
9: print ("$varname is a legal array variable\n");
10: } elsif ($varname =~ /[A-Za-z][_0-9a-zA-Z]*/) {
11: print ("$varname is a legal file variable\n");
12: } else {
13: print ("I don't understand what $varname is.\n");
14: }
$ program7_4
Enter a variable name:
$result
$result is a legal scalar variable
$

Line 6 checks whether the input line contains
the name of a legal scalar variable. Recall that a legal scalar
variable consists of the following:
- A $ character
- An uppercase or lowercase letter
- Zero or more letters, digits, or underscore characters
Each part of the pattern tested in line 6 corresponds to one of
the aforementioned conditions given. The first part of the pattern,
\$, ensures that the pattern matches only if it begins
with a $ character.
| NOTE |
The $ is preceded by a backslash, because $ is a special character in patterns. See the following section, "Anchoring Patterns," for more information on the $ special character.
|
The second part of the pattern,
[A-Za-z]
matches exactly one uppercase or lowercase letter. The final part
of the pattern,
[_0-9a-zA-Z]*
matches zero or more underscores, digits, or letters in any order.
The patterns in line 8 and line 10 are very similar to the one
in line 6. The only difference in line 8 is that the pattern there
matches a string whose first character is @, not $.
In line 10, this first character is omitted completely.
The pattern in line 8 corresponds to the definition of a legal
array-variable name, and the pattern in line 10 corresponds to
the definition of a legal file-variable name.
Although Listing 7.4 can determine whether a line of input contains
a legal Perl variable name, it cannot determine whether there
is extraneous input on the line. For example, it can't tell the
difference between the following three lines of input:
$result
junk$result
$result#junk
In all three cases, the pattern
/\$[a-zA-Z][_0-9a-zA-Z]*/
finds the string $result and matches successfully; however,
only the first line is a legal Perl variable name.
To fix this problem, you can use pattern anchors. Table
7.1 lists the pattern anchors defined in Perl.
Table 7.1. Pattern anchors in Perl.
| Anchor | Description
|
| ^ or \A | Match at beginning of string only
|
| $ or \Z | Match at end of string only
|
| \b | Match on word boundary
|
| \B | Match inside word
|
These pattern anchors are described in the following sections.
The ^ and $ Pattern Anchors
The pattern anchors ^ and $ ensure that the
pattern is matched only at the beginning or the end of a string.
For example, the pattern
/^def/
matches def only if these are the first three characters
in the string. Similarly, the pattern
/def$/
matches def only if these are the last three characters
in the string.
You can combine ^ and $ to force matching of
the entire string, as follows:
/^def$/
This matches only if the string is def.
In most cases, the escape sequences \A and \Z
(defined in Perl 5) are equivalent to ^ and $,
respectively:
/\Adef\Z/
This also matches only if the string is def.
| NOTE |
\A and \Z behave differently from ^ and $ when the multiple-line pattern-matching option is specified. Pattern-matching options are described later toChapter.
|
Listing 7.5 shows how you can use pattern anchors to ensure that
a line of input is, in fact, a legal Perl scalar-, array-, or
file-variable name.
Listing 7.5. A better variable-name validation program.
1: #!/usr/local/bin/perl
2:
3: print ("Enter a variable name:\n");
4: $varname = <STDIN>;
5: chop ($varname);
6: if ($varname =~ /^\$[A-Za-z][_0-9a-zA-Z]*$/) {
7: print ("$varname is a legal scalar variable\n");
8: } elsif ($varname =~ /^@[A-Za-z][_0-9a-zA-Z]*$/) {
9: print ("$varname is a legal array variable\n");
10: } elsif ($varname =~ /^[A-Za-z][_0-9a-zA-Z]*$/) {
11: print ("$varname is a legal file variable\n");
12: } else {
13: print ("I don't understand what $varname is.\n");
14: }
$ program7_5
Enter a variable name:
x$result
I don't understand what x$result is.
$

The only difference between this program and
the one in Listing 7.4 is that this program uses the pattern anchors
^ and $ in the patterns in lines 6, 8, and 10.
These anchors ensure that a valid pattern consists of only those
characters that make up a legal Perl scalar, array, or file variable.
In the sample output given here, the input
x$result
is rejected, because the pattern in line 6 is matched only when
the $ character appears at the beginning of the line.
Word-Boundary Pattern Anchors
The word-boundary pattern anchors, \b and \B,
specify whether a matched pattern must be on a word boundary or
inside a word boundary. (A word boundary is the beginning or end
of a word.)
The \b pattern anchor specifies that the pattern must
be on a word boundary. For example, the pattern
/\bdef/
matches only if def is the beginning of a word. This
means that def and defghi match but abcdef
does not.
You can also use \b to indicate the end of a word. For
example,
/def\b/
matches def and abcdef, but not defghi.
Finally, the pattern
/\bdef\b/
matches only the word def, not abcdef or defghi.
| NOTE |
A word is assumed to contain letters, digits, and underscore characters, and nothing else. This means that
/\bdef/
matches $defghi: because $ is not assumed to be part of a word, def is the beginning of the word defghi, and /\bdef/ matches it.
|
The \B pattern anchor is the opposite of \b.
\B matches only if the pattern is contained in a word.
For example, the pattern
/\Bdef/
matches abcdef, but not def. Similarly, the
pattern
/def\B/
matches defghi, and
/\Bdef\B/
matches cdefg or abcdefghi, but not def,
defghi, or abcdef.
The \b and \B pattern anchors enable you to
search for words in an input line without having to break up the
line using split. For example, Listing 7.6 uses \b
to count the number of lines of an input file that contain the
word the.
Listing 7.6. A program that counts the number of input lines
containing the word the.
1: #!/usr/local/bin/perl
2:
3: $thecount = 0;
4: print ("Enter the input here:\n");
5: $line = <STDIN>;
6: while ($line ne "") {
7: if ($line =~ /\bthe\b/) {
8: $thecount += 1;
9: }
10: $line = <STDIN>;
11: }
12: print ("Number of lines containing 'the': $thecount\n");
$ program7_6
Enter the input here:
Now is the time
for all good men
to come to the aid
of the party.
^D
Number of lines containing 'the': 3
$

This program checks each line in turn to see
if it contains the word the, and then prints the total
number of lines that contain the word.
Line 7 performs the actual checking by trying to match the pattern
/\bthe\b/
If this pattern matches, the line contains the word the,
because the pattern checks for word boundaries at either end.
Note that this program doesn't check whether the word the
appears on a line more than once. It is not difficult to modify
the program to do this; in fact, you can do it in several different
ways.
The most obvious but most laborious way is to break up lines that
you know contain the into words, and then check each
word, as follows:
if ($line =~ /\bthe\b/) {
@words = split(/[\t ]+/, $line);
$count = 1;
while ($count <= @words) {
if ($words[$count-1] eq "the") {
$thecount += 1;
}
$count++;
}
}
A cute way to accomplish the same thing is to use the pattern
itself to break the line into words:
if ($line =~ /\bthe\b/) {
@words = split(/\bthe\b/, $line);
$thecount += @words - 1;
}
In fact, you don't even need the if statement.
@words = split(/\bthe\b/, $line);
$thecount += @words - 1;
Here's why this works: Every time split sees the word
the, it starts a new word. Therefore, the number of occurrences
of the is equal to one less than the number of elements
in @words. If there are no occurrences of the,
@words has the length 1, and $thecount is not
changed.
 |
This trick works only if you know that there is at least one word on the line.
Consider the following code, which tries to use the aforementioned trick on a line that has had its newline character removed using chop:
$line = <STDIN>;
chop ($line);
@words = split(/\bthe\b/, $line);
$thecount += @words - 1;
This code actually subtracts 1 from $thecount if the line is blank or consists only of the word the, because in these cases @words is the empty list and the length of @words is 0.
Leaving off the call to chop protects against this problem, because there will always be at least one "word" in every line (consisting of the newline character).
|
If you like, you can use the value of a scalar variable in a pattern.
For example, the following code splits the line $line
into words:
$pattern = "[\\t ]+";
@words = split(/$pattern/, $line);
Because you can use a scalar variable in a pattern, there is nothing
to stop you from reading the pattern from the standard input file.
Listing 7.7 accepts a search pattern from a file and then searches
for the pattern in the input files listed on the command line.
If it finds the pattern, it prints the filename and line number
of the match; at the end, it prints the total number of matches.
This example assumes that two files exist, file1 and
file2. Each file contains the following:
This is a line of input.
This is another line of input.
If you run this program with command-line arguments file1
and file2 and search for the pattern another,
you get the output shown.
Listing 7.7. A simple pattern-search program.
1: #!/usr/local/bin/perl
2:
3: print ("Enter the search pattern:\n");
4: $pattern = <STDIN>;
5: chop ($pattern);
6: $filename = $ARGV[0];
7: $linenum = $matchcount = 0;
8: print ("Matches found:\n");
9: while ($line = <>) {
10: $linenum += 1;
11: if ($line =~ /$pattern/) {
12: print ("$filename, line $linenum\n");
13: @words = split(/$pattern/, $line);
14: $matchcount += @words - 1;
15: }
16: if (eof) {
17: $linenum = 0;
18: $filename = $ARGV[0];
19: }
20: }
21: if ($matchcount == 0) {
22: print ("No matches found.\n");
23: } else {
24: print ("Total number of matches: $matchcount\n");
25: }
$ program7_7 file1 file2
Enter the search pattern:
another
Matches found:
file1, line 2
file2, line 2
Total number of matches: 2
$

This program uses the following scalar variables
to keep track of information:
- $pattern contains the search pattern read in from
the standard input file.
- $filename contains the file currently being searched.
- $linenum contains the line number of the line currently
being searched.
- $matchcount contains the total number of matches
found to this point.
Line 6 sets the current filename, which corresponds to the first
element in the built-in array variable @ARGV. This array
variable lists the arguments supplied on the command line. (To
refresh your memory on how @ARGV works, refer back to
Chapter 6, "Reading from and Writing to Files.") This current
filename needs to be stored in a scalar variable, because the
<> operator in line 9 shifts @ARGV and
destroys this name.
Line 9 reads from each of the files on the command line in turn,
one line at a time. The current input line is stored in the scalar
variable $line. Once the line is read, line 10 adds 1
to the current line number.
Lines 11-15 handle the matching process. Line 11 checks whether
the pattern stored in $pattern is contained in the input
line stored in $line. If a match is found, line 12 prints
out the current filename and line number. Line 13 then splits
the line into "words," using the trick described in
the earlier section, "Word-Boundary Pattern Anchors."
Because the number of elements of the list stored in @words
is one larger than the number of times the pattern is matched,
the expression @words - 1 is equivalent to the number
of matches; its value is added to $matchcount.
Line 16 checks whether the <> operator has reached
the end of the current input file. If it has, line 17 resets the
current line number to 0. This ensures that the next pass through
the loop will set the current line number to 1 (to indicate that
the program is on the first line of the next file). Line 18 sets
the filename to the next file mentioned on the command line, which
is currently stored in $ARGV[0].
Lines 21-25 either print the total number of matches or indicate
that no matches were found.
| NOTE |
Make sure that you remember to include the enclosing / characters when you use a scalar-variable name in a pattern. The Perl interpreter does not complain when it sees the following, for example, but the result might not be what you want:
@words = split($pattern, $line);
|
As you have seen, when the special characters [] appear
in a pattern, they specify a set of alternatives to choose from.
For example, the pattern
/d[eE]f/
matches def or dEf.
When the ^ character appears as the first character after
the [, it indicates that the pattern is to match any
character except the ones displayed between the [
and ]. For example, the pattern
/d[^eE]f/
matches any pattern that satisfies the following criteria:
- The first character is d.
- The second character is anything other than e or
E.
- The last character is f.
| NOTE |
To include a ^ character in a set of alternatives, precede it with a backslash, as follows:
/d[\^eE]f/
This pattern matches d^f, def, or dEf.
|
In the section titled "Matching Any Letter or Number"
earlier in this chapter, you learned that you can represent consecutive
letters or numbers inside the [] special characters by
specifying ranges. For example, in the pattern
/a[1-3]c/
the [1-3] matches any of 1, 2, or 3.
Some ranges occur frequently enough that Perl defines special
escape sequences for them. For example, instead of writing
/[0-9]/
to indicate that any digit is to be matched, you can write
/\d/
The \d escape sequence means "any digit."
Table 7.2 lists the character-range escape sequences, what they
match, and their equivalent character ranges.
Table 7.2. Character-range escape sequences.
| Escape sequence |
Description | Range
|
| \d | Any digit
| [0-9] |
| \D | Anything other than a digit
| [^0-9] |
| \w | Any word character
| [_0-9a-zA-Z] |
| \W | Anything not a word character
| [^_0-9a-zA-Z] |
| \s | White space
| [ \r\t\n\f] |
| \S | Anything other than white space
| [^ \r\t\n\f] |
These escape sequences can be used anywhere ordinary characters
are used. For example, the following pattern matches any digit
or lowercase letter:
/[\da-z]/
| NOTE |
The definition of word boundary as used by the \b and \B special characters corresponds to the definition of word character used by \w and \W.
If the pattern /\w\W/ matches a particular pair of characters, the first character is part of a word and the second is not; this means that the first character is the end of a word, and that a word boundary exists between the first and second
characters matched by the pattern.
Similarly, if /\W\w/ matches a pair of characters, the first character is not part of a word and the second character is. This means that the second character is the beginning of a word. Again, a word boundary exists between the first and second
characters matched by the pattern.
|
Another special character supported in patterns is the period
(.) character, which matches any character except the
newline character. For example, the following pattern matches
d, followed by any non-newline character, followed by
f:
/d.f/
The . character is often used in conjunction with the
* character. For example, the following pattern matches
any string that contains the character d preceding the
character f:
/d.*f/
Normally, the .* special-character combination tries
to match as much as possible. For example, if the string banana
is searched using the following pattern, the pattern matches banana,
not ba or bana:
/b.*a/
| NOTE |
There is one exception to the preceding rule: The .* character only matches the longest possible string that enables the pattern match as a whole to succeed.
For example, suppose the string Mississippi is searched using the pattern
/M.*i.*pi/
Here, the first .* in /M.*i.*pi/ matches
Mississippi
If it tried to go further and match
Mississippi
or even
Mississippi
there would be nothing left for the rest of the pattern to match.
When the first .* match is limited to
Mississippi
the rest of the pattern, i.*pi, matches ippi, and the pattern as a whole succeeds.
|
Several special characters in patterns that you have seen enable
you to match a specified number of occurrences of a character.
For example, + matches one or more occurrences of a character,
and ? matches zero or one occurrences.
Perl enables you to define how many occurrences of a character
constitute a match. To do this, use the special characters {
and }.
For example, the pattern
/de{1,3}f/
matches d, followed by one, two, or three occurrences
of e, followed by f. This means that def,
deef, and deeef match, but df and deeeef
do not.
To specify an exact number of occurrences, include only one value
between the { and the }.
/de{3}f/
This specifies exactly three occurrences of e, which
means this pattern only matches deeef.
To specify a minimum number of occurrences, leave off the upper
bound.
/de{3,}f/
This matches d, followed by at least three es,
followed by f.
Finally, to specify a maximum number of occurrences, use 0 as
the lower bound.
/de{0,3}f/
This matches d, followed by no more than three es,
followed by f.
| NOTE |
You can use { and } with character ranges or any other special character, as follows:
/[a-z]{1,3}/
This matches one, two, or three lowercase letters.
/.{3}/
This matches any three characters.
|
The special character | enables you to specify two or
more alternatives to choose from when matching a pattern. For
example, the pattern
/def|ghi/
matches either def or ghi. The pattern
/[a-z]+|[0-9]+/
matches one or more lowercase letters or one or more digits.
Listing 7.8 is a simple example of a program that uses the |
special character. It reads a number and checks whether it is
a legitimate Perl integer.
Listing 7.8. A simple integer-validation program.
1: #!/usr/local/bin/perl
2:
3: print ("Enter a number:\n");
4: $number = <STDIN>;
5: chop ($number);
6: if ($number =~ /^-?\d+$|^-?0[xX][\da-fa-F]+$/) {
7: print ("$number is a legal integer.\n");
8: } else {
9: print ("$number is not a legal integer.\n");
10: }
$ program7_8
Enter a number:
0x3ff1
0x3ff1 is a legal integer.
$

Recall that Perl integers can be in any of
three forms:
- Standard base-10 notation, as in 123
- Base-8 (octal) notation, indicated by a leading 0,
as in 0123
- Base-16 (hexadecimal) notation, indicated by a leading 0x
or 0X, as in 0X1ff
Line 6 checks whether a number is a legal Perl integer. The first
alternative in the pattern,
^-?\d+$
matches a string consisting of one or more digits, optionally
preceded by a -. (The ^ and $ characters
ensure that this is the only string that matches.) This takes
care of integers in standard base-10 notation and integers in
octal notation.
The second alternative in the pattern,
^-?0[xX][\da-fa-F]+$
matches integers in hexadecimal notation. Take a look at this
pattern one piece at a time:
- The ^ matches the beginning of the line. This ensures
that lines containing leading spaces or extraneous characters
are not treated as valid hexadecimal integers.
- The -? matches a - if it is present. This
ensures that negative numbers are matched.
- The 0 matches the leading 0.
- The [xX] matches the x or X that
follows the leading 0.
- The [\da-fa-F] matches any digit, any letter between
a and f, or any letter between A and
F. Recall that these are precisely the characters which
are allowed to appear in hexadecimal digits.
- The + indicates that the pattern is to match one
or more hexadecimal digits.
- The closing $ indicates that the pattern is to match
only if there are no extraneous characters following the hexadecimal
integer.
 |
Beware that the following pattern matches either x or one or more of y, not one or more of x or y:
/x|y+/
See the section called "Special-Character Precedence" later toChapter for details on how to specify special-character precedence in patterns.
|
Suppose that you want to write a pattern that matches the following:
- One or more digits or lowercase letters
- Followed by a colon or semicolon
- Followed by another group of one or more digits or lowercase
letters
- Another colon or semicolon
- Yet another group of one or more digits or lowercase letters
One way to indicate this pattern is as follows:
/[\da-z]+[:;][\da-z]+[:;][\da-z]+/
This pattern is somewhat complicated and is quite repetitive.
Perl provides an easier way to specify patterns that contain multiple
repetitions of a particular sequence. When you enclose a portion
of a pattern in parentheses, as in
([\da-z]+)
Perl stores the matched sequence in memory. To retrieve a sequence
from memory, use the special character \n, where n
is an integer representing the nth pattern stored in
memory.
For example, the aforementioned pattern can be written as
/([\da-z]+])[:;]\1[:;]\1/
Here, the pattern matched by [\da-z]+ is stored in memory.
When the Perl interpreter sees the escape sequence \1,
it matches the matched pattern.
You also can store the sequence [:;] in memory, and write
this pattern as follows:
/([\da-z]+)([:;])\1\2\1/
Pattern sequences are stored in memory from left to right, so
\1 represents the subpattern matched by [\da-z]+
and \2 represents the subpattern matched by [:;].
Pattern-sequence memory is often used when you want to match the
same character in more than one place but don't care which character
you match. For example, if you are looking for a date in dd-mm-yy
format, you might want to match
/\d{2}([\W])\d{2}\1\d{2}/
This matches two digits, a non-word character, two more digits,
the same non-word character, and two more digits. This means that
the following strings all match:
12-05-92
26.11.87
07 04 92
However, the following string does not match:
21-05.91
This is because the pattern is looking for a - between
the 05 and the 91, not a period.
 |
Beware that the pattern
/\d{2}([\W])\d{2}\1\d{2}/
is not the same as the pattern
/(\d{2})([\W])\1\2\1/
In the first pattern, any digit can appear anywhere. The second pattern matches any two digits as the first two characters, but then only matches the same two digits again. This means that
17-17-17
matches, but the following does not:
17-05-91
|
Note that pattern-sequence memory is preserved only for the length
of the pattern. This means that if you define the following pattern
(which, incidentally, matches any floating-point number that does
not contain an exponent):
/-?(\d+)\.?(\d+)/
you cannot then define another pattern, such as the following:
/\1/
and expect the Perl interpreter to remember that \1 refers
to the first \d+ (the digits before the decimal point).
To get around this problem, Perl defines special built-in variables
that remember the value of patterns matched in parentheses. These
special variables are named $n, where n is the
nth set of parentheses in the pattern.
For example, consider the following:
$string = "This string contains the number 25.11.";
$string =~ /-?(\d+)\.?(\d+)/;
$integerpart = $1;
$decimalpart = $2;
In this case, the pattern
/-?(\d+)\.?(\d+)/
matches 25.11, and the subpattern in the first set of
parentheses matches 25. This means that 25 is
stored in $1 and is later assigned to $integerpart.
Similarly, the second set of parentheses matches 11,
which is stored in $2 and later assigned to $decimalpart.
 |
The values stored in $1, $2, and so on, are destroyed when another pattern match is performed. If you need these values, be sure to assign them to other scalar variables.
|
There is also one other built-in scalar variable, $&,
which contains the entire matched pattern, as follows:
$string = "This string contains the number 25.11.";
$string =~ /-?(\d+)\.?(\d+)/;
$number = $&;
Here, the pattern matched is 25.11, which is stored in
$& and then assigned to $number.
Perl defines rules of precedence to determine the order in which
special characters in patterns are interpreted. For example, the
pattern
/x|y+/
matches either x or one or more occurrences of y,
because + has higher precedence than | and is
therefore interpreted first.
Table 7.3 lists the special characters that can appear in patterns
in order of precedence (highest to lowest). Special characters
with higher precedence are always interpreted before those of
lower precedence.
Table 7.3. The precedence of pattern-matching special
characters.
| Special character | Description
|
| () | Pattern memory
|
| + * ? {} | Number of occurrences
|
| ^ $ \b \B | Pattern anchors
|
| | | Alternatives |
Because the pattern-memory special characters () have
the highest precedence, you can use them to force other special
characters to be evaluated first. For example, the pattern
(ab|cd)+
matches one or more occurrences of either ab or cd.
This matches, for example, abcdab.
 |
Remember that when you use parentheses to force the order of precedence, you also are storing into pattern memory. For example, in the sequence
/(ab|cd)+(.)(ef|gh)+\1/
the \1 refers to what ab|cd matched, not to what the . special character matched.
|
Now that you know all of the special-pattern characters and their
precedence, look at a program that does more complex pattern matching.
Listing 7.9 uses the various special-pattern characters, including
the parentheses, to check whether a given input string is a valid
twentieth-century date.
Listing 7.9. A date-validation program.
1: #!/usr/local/bin/perl
2:
3: print ("Enter a date in the format YYYY-MM-DD:\n");
4: $date = <STDIN>;
5: chop ($date);
6:
7: # Because this pattern is complicated, we split it
8: # into parts, assign the parts to scalar variables,
9: # then substitute them in later.
10:
11: # handle 31-Chapter months
12: $md1 = "(0[13578]|1[02])\\2(0[1-9]|[12]\\d|3[01])";
13: # handle 30-Chapter months
14: $md2 = "(0[469]|11)\\2(0[1-9]|[12]\\d|30)";
15: # handle February, without worrying about whether it's
16: # supposed to be a leap year or not
17: $md3 = "02\\2(0[1-9]|[12]\\d)";
18:
19: # check for a twentieth-century date
20: $match = $date =~ /^(19)?\d\d(.)($md1|$md2|$md3)$/;
21: # check for a valid but non-20th century date
22: $olddate = $date =~ /^(\d{1,4})(.)($md1|$md2|$md3)$/;
23: if ($match) {
24: print ("$date is a valid date\n");
25: } elsif ($olddate) {
26: print ("$date is not in the 20th century\n");
27: } else {
28: print ("$date is not a valid date\n");
29: }
$ program7_9
Enter a date in the format YYYY-MM-DD:
1991-04-31
1991-04-31 is not a valid date
$

Don't worry: this program is a lot less complicated
than it looks! Basically, this program does the following:
- It checks whether the date is in the format YYYY-MM-DD.
(It allows YY-MM-DD, and also enables you to use a character
other than a hyphen to separate the year, month, and date.)
- It checks whether the year is in the twentieth century or
not.
- It checks whether the month is between 01 and 12.
- Finally, it checks whether the date field is a legal date
for that month. Legal date fields are between 01 and
either 29, 30, or 31, depending on
the number of Chapters in that month.
If the date is legal, the program tells you so. If the date is
not a twentieth-century date but is legal, the program informs
you of this also.
Because the pattern to be matched is too long to fit on one line,
this program breaks it into pieces and assigns the pieces to scalar
variables. This is possible because scalar-variable substitution
is supported in patterns.
Line 12 is the pattern to match for months with 31 Chapters. Note
that the escape sequences (such as \d) are preceded by
another backslash (producing \\d). This is because the
program actually wants to store a backslash in the scalar variable.
(Recall that backslashes in double-quoted strings are treated
as escape sequences.) The pattern
(0[13578]|1[02])\2(0[1-9]|[12]\d|3[01])
which is assigned to $md1, consists of the following
components:
- The sequence (0[13578]|1[02]), which matches the
month values 01, 03, 05, 07,
08, 10, and 12 (the 31-Chapter months)
- \2, which matches the character that separates the
Chapter, month, and year
- The sequence (0[1-9]|[12]\d|3[01]), which matches
any two-digit number between 01 and 31
Note that \2 matches the separator character because
the separator character will eventually be the second pattern
sequence stored in memory (when the pattern is finally assembled).
Line 14 is similar to line 12 and handles 30-Chapter months. The only
differences between this subpattern and the one in line 12 are
as follows:
- The month values accepted are 04, 06, 09,
and 11.
- The valid date fields are 01 through 30,
not 01 through 31.
Line 17 is another similar pattern that checks whether the month
is 02 (February) and the date field is between 01
and 29.
Line 20 does the actual pattern match that checks whether the
date is a valid twentieth-century date. This pattern is divided
into three parts.
- ^(19)?\d\d, which matches any two-digit number at
the beginning of a line, or any four-digit number starting with
19
- The separator character, which is the second item in parentheses-the
second item stored in memory-and thus can be retrieved using \2
- ($md1|$md2|$md3)$, which matches any of the valid
month-Chapter combinations defined in lines 12, 14, and 17, provided
it appears at the end of the line
The result of the pattern match, either true or false, is stored
in the scalar variable $match.
Line 22 checks whether the date is a valid date in any century.
The only difference between this pattern and the one in line 20
is that the year can be any one-to-four-digit number. The result
of the pattern match is stored in $olddate.
Lines 23-29 check whether either $match or $olddate
is true and print the appropriate message.
As you can see, the pattern-matching facility in Perl is quite
powerful. This program is less than 30 lines long, including comments;
the equivalent program in almost any other programming language
would be substantially longer and much more difficult to write.
So far, all the patterns you have seen have been enclosed by /
characters.
/de*f/
These / characters are known as pattern delimiters.
Because / is the pattern-delimiter character, you must
use \/ to include a / character in a pattern.
This can become awkward if you are searching for a directory such
as, for example, /u/jqpublic/perl/prog1.
/\/u\/jqpublic\/perl\/prog1/
To make it easier to write patterns that include / characters,
Perl enables you to use any pattern-delimiter character you like.
The following pattern also matches the directory /u/jqpublic/perl/prog1:
m!/u/jqpublic/perl/prog1!
Here, the m indicates the pattern-matching operation.
If you are using a pattern delimiter other than /, you
must include the m.
 |
There are two things you should watch out for when you use other pattern delimiters.
First, if you use the ' character as a pattern delimiter, the Perl interpreter does not substitute for scalar-variable names.
m'$var'
This matches the string $var, not the current value of the scalar variable $var.
Second, if you use a pattern delimiter that is normally a special-pattern character, you will not be able to use that special character in your pattern. For example, if you want to match the pattern ab?c (which matches a, optionally
followed by b, followed by c) you cannot use the ? character as a pattern delimiter. The pattern
m?ab?c?
produces a syntax error, because the Perl interpreter assumes that the ? after the b is a pattern delimiter. You can still use
m?ab\?c?
but this pattern won't match what you want. Because the ? inside the pattern is escaped, the Perl interpreter assumes that you want to match the actual ? character, and the pattern matches the sequence ab?c.
|
When you specify a pattern, you also can supply options that control
how the pattern is to be matched. Table 7.4 lists these pattern-matching
options.
Table 7.4. Pattern-matching options.
| Option | Description
|
| g | Match all possible patterns
|
| i | Ignore case
|
| m | Treat string as multiple lines
|
| o | Only evaluate once
|
| s | Treat string as single line
|
| x | Ignore white space in pattern
|
All pattern options are included immediately after the pattern.
For example, the following pattern uses the i option
to ignore case:
/ab*c/i
You can specify as many of the options as you like, and the options
can be in any order.
The g operator tells the Perl interpreter to match all
the possible patterns in a string. For example, if you search
the string balata using the pattern
/.a/g
which matches any character followed by a, the pattern
matches ba, la, and ta.
If a pattern with the g option specified appears as an
assignment to an array variable, the array variable is assigned
a list consisting of all the patterns matched. For example,
@matches = "balata" =~ /.a/g;
assigns the following list to @matches:
("ba", "la", "ta")
Now, consider the following statement:
$match = "balata" =~ /.a/g;
The first time this statement is executed, $match is
assigned the first pattern matched, which in this case is ba.
If this assignment is performed again, $match is assigned
the second pattern matched in the string, which is la,
and so on until the pattern runs out of matches.
This means that you can use patterns with the g option
in loops. Listing 7.10 shows how this works.
Listing 7.10. A program that loops using a pattern.
1: #!/usr/local/bin/perl
2:
3: while ("balata" =~ /.a/g) {
4: $match = $&;
5: print ("$match\n");
6: }
$ program7_10
ba
la
ta
$

The first time through the loop, $match
has the value of the first pattern matched, which is ba.
(The system variable $& always contains the last
pattern matched; this pattern is assigned to $match in
line 4.) When the loop is executed for a second time, $match
has the value la. The third time through, $match
has the value ta. After this, the loop terminates; because
the pattern doesn't match anything else, the conditional expression
is now false.
Determining the Match Location
If you need to know how much of a string has been searched by
the pattern matcher when the g operator is specified,
use the pos function.
$offset = pos($string);
This returns the position at which the next pattern match will
be started.
You can reposition the pattern matcher by putting pos()
on the left side of an assignment.
pos($string) = $newoffset;
This tells the Perl interpreter to start the next pattern match
at the position specified by $newoffset.
 |
If you change the string being searched, the match position is reset to the beginning of the string.
|
| NOTE |
The pos function is not available in Perl version 4.
|
The i option enables you to specify that a matched letter
can either be uppercase or lowercase. For example, the following
pattern matches de, dE, De, or DE:
/de/i
Patterns that match either uppercase or lowercase letters are
said to be case-insensitive.
The m option tells the Perl interpreter that the string
to be matched contains multiple lines of text. When the m
option is specified, the ^ special character matches
either the start of the string or the start of any new line. For
example, the pattern
/^The/m
matches the word The in
This pattern matches\nThe first word on the second line
The m option also specifies that the $ special
character is to match the end of any line. This means that the
pattern
/line.$/m
is matched in the following string:
This is the end of the first line.\nHere's another line.
| NOTE |
The m option is defined only in Perl 5. To treat a string as multiple lines when you run Perl 4, set the $* system variable, described on Chapter 17, "System Variables."
|
The o option enables you to tell the Perl interpreter
that a pattern is to be evaluated only once. For example, consider
the following:
$var = 1;
$line = <STDIN>;
while ($var < 10) {
$result = $line =~ /$var/o;
$line = <STDIN>;
$var++;
}
The first time the Perl interpreter sees the pattern /$var/,
it replaces the name $var with the current value of $var,
which is 1; this means that the pattern to be matched
is /1/.
Because the o option is specified, the pattern to be
matched remains /1/ even when the value of $var
changes. If the o option had not been specified, the
pattern would have been /2/ the next time through the
loop.
| TIP |
There's no real reason to use the o option for patterns unless you are keen on efficiency. Here's an easier way to do the same thing:
$var = <STDIN>;
$matchval = $var;
$line = <STDIN>;
while ($var < 10) {
$result = $line =~ /$matchval/;
$line = <STDIN>;
$var++;
}
The value of $matchval never changes, so the o option is not necessary.
|
The s option specifies that the string to be matched
is to be treated as a single line of text. In this case, the .
special character matches every character in a string, including
the newline character. For example, the pattern /a.*bc/s
is matched successfully in the following string:
axxxxx \nxxxxbc
If the s option is not specified, this pattern does not
match, because the . character does not match the newline.
| NOTE |
The s option is defined only in Perl 5.
|
One problem with patterns in Perl is that they can become difficult
to follow. For example, consider this pattern, which you saw earlier:
/\d{2}([\W])\d{2}\1\d{2}/
Patterns such as this are difficult to follow, because there are
a lot of backslashes, braces, and brackets to sort out.
Perl 5 makes life a little easier by supplying the x
option. This tells the Perl interpreter to ignore white space
in a pattern unless it is preceded by a backslash. This means
that the preceding pattern can be rewritten as the following,
which is much easier to follow:
/\d{2} ([\W]) \d{2} \1 \d{2}/x
Here is an example of a pattern containing an actual blank space:
/[A-Z] [a-z]+ \ [A-Z] [a-z]+ /x
This matches a name in the standard first-name/last-name format
(such as John Smith). Normally, you won't want to use
the x option if you're actually trying to match white
space, because you wind up with the backslash problem all over
again.
| NOTE |
The x option is defined only in Perl 5.
|
Perl enables you to replace part of a string using the substitution
operator, which has the following syntax:
s/pattern/replacement/
The Perl interpreter searches for the pattern specified by the
placeholder pattern. If it finds pattern, it
replaces it with the string represented by the placeholder replacement.
For example:
$string = "abc123def";
$string =~ s/123/456/;
Here, 123 is replaced by 456, which means that
the value stored in $string is now abc456def.
You can use any of the pattern special characters in the substitution
operator. For example,
s/[abc]+/0/
searches for a sequence consisting of one or more occurrences
of the letters a, b, and c (in any
order) and replaces the sequence with 0.
If you just want to delete a sequence of characters rather than
replace it, leave out the replacement string as in the following
example, which deletes the first occurrence of the pattern abc:
s/abc//
You can use pattern-sequence variables to include a matched pattern
in the replacement string. The following is an example:
s/(\d+)/[$1]/
This matches a sequence of one or more digits. Because this sequence
is enclosed in parentheses, it is stored in the scalar variable
$1. In the replacement string, [$1], the scalar
variable name $1 is replaced by its value, which is the
matched pattern.
| NOTE |
Because the replacement string in the substitution operator is a string, not a pattern, the pattern special characters, such as [], *, and +, do not have a special meaning. For example, in the substitution
s/abc/[def]/
the replacement string is [def] (including the square brackets).
|
The substitution operator supports several options, which are
listed in Table 7.5.
Table 7.5. Options for the substitution operator.
| Option | Description
|
| g | Change all occurrences of the pattern
|
| i | Ignore case in pattern
|
| e | Evaluate replacement string as expression
|
| m | Treat string to be matched as multiple lines
|
| o | Evaluate only once
|
| s | Treat string to be matched as single line
|
| x | Ignore white space in pattern
|
As with pattern matching, options are appended to the end of the
operator. For example, to change all occurrences of abc
to def, use the following:
s/abc/def/g
Global Substitution
The g option changes all occurrences of a pattern in
a particular string. For example, the following substitution puts
parentheses around any number in the string:
s/(\d+)/($1)/g
Listing 7.11 is an example of a program that uses global substitution.
It examines each line of its input, removes all extraneous leading
spaces and tabs, and replaces multiple spaces and tabs between
words with a single space.
Listing 7.11. A simple white space cleanup program.
1: #!/usr/local/bin/perl
2:
3: @input = <STDIN>;
4: $count = 0;
5: while ($input[$count] ne "") {
6: $input[$count] =~ s/^[ \t]+//;
7: $input[$count] =~ s/[ \t]+\n$/\n/;
8: $input[$count] =~ s/[ \t]+/ /g;
9: $count++;
10: }
11: print ("Formatted text:\n");
12: print (@input);
$ program7_11
This is a line of input.
Here is another line.
This is my last line of input.
^D
Formatted text:
This is a line of input.
Here is another line.
This is my last line of input.
$

This program performs three substitutions on
each line of its input. The first substitution, in line 6, checks
whether there are any spaces or tabs at the beginning of the line.
If any exist, they are removed.
Similarly, line 7 checks whether there are any spaces or tabs
at the end of the line (before the trailing newline character).
If any exist, they are removed. To do this, line 7 replaces the
following pattern (one or more spaces and tabs, followed by a
newline character, followed by the end of the line) with a newline
character:
/[ \t]+\n$/
Line 8 uses a global substitution to remove extra spaces and tabs
between words. The following pattern matches one or more spaces
or tabs, in any order; these spaces and tabs are replaced by a
single space:
/[ \t]+/
Ignoring Case
The i option ignores case when substituting. For example,
the following substitution replaces all occurrences of the words
no, No, NO, and nO with NO.
(Recall that the \b escape character specifies a word
boundary.)
s/\bno\b/NO/gi
Replacement Using an Expression
The e option treats the replacement string as an expression,
which it evaluates before replacing. For example, consider the
following:
$string = "0abc1";
$string =~ s/[a-zA-Z]+/$& x 2/e
The substitution shown here is a quick way to duplicate part of
a string. Here's how it works:
- The pattern /[a-zA-Z]+/ matches abc, which
is stored in the built-in variable $&.
- The e option indicates that the replacement string,
$& x 2, is to be treated as an expression. This expression
is evaluated, producing the result abcabc.
- abcabc is substituted for abc in the string
stored in $string. This means that the new value of $string
is 0abcabc1.
Listing 7.12 is another example that uses the e option
in a substitution. This program takes every integer in a list
of input files and multiplies them by 2, leaving the rest of the
contents unchanged. (For the sake of simplicity, the program assumes
that there are no floating-point numbers in the file.)
Listing 7.12. A program that multiplies every integer in a
file by 2.
1: #!/usr/local/bin/perl
2:
3: $count = 0;
4: while ($ARGV[$count] ne "") {
5: open (FILE, "$ARGV[$count]");
6: @file = <FILE>;
7: $linenum = 0;
8: while ($file[$linenum] ne "") {
9: $file[$linenum] =~ s/\d+/$& * 2/eg;
10: $linenum++;
11: }
12: close (FILE);
13: open (FILE, ">$ARGV[$count]");
14: print FILE (@file);
15: close (FILE);
16: $count++;
17: }
If a file named foo contains the text
This contains the number 1.
This contains the number 26.
and the name foo is passed as a command-line
argument to this program, the file foo becomes
This contains the number 2.
This contains the number 52.

This program uses the built-in variable @ARGV to retrieve
filenames from the command line. Note that the program cannot
use <>, because the following statement reads the
entire contents of all the files into a single array:
@file = <>;
Lines 8-11 read and substitute one line of a file at a time. Line
9 performs the actual substitution as follows:
- The pattern \d+ matches a sequence of one or more
digits, which is automatically assigned to $&.
- The value of $& is substituted into the replacement
string.
- The e option indicates that this replacement string
is to be treated as an expression. This expression multiplies
the matched integer by 2.
- The result of the multiplication is then substituted into
the file in place of the original integer.
- The g option indicates that every integer on the
line is to be substituted for.
After all the lines in the file have been read, the file is closed
and reopened for writing. The call to print in line 14
takes the list stored in @file-the contents of the current
file-and writes them back out to the file, overwriting the original
contents.
As with the match operator, the o option to the substitution
operator tells the Perl interpreter to replace a scalar variable
name with its value only once. For example, the following statement
substitutes the current value of $var for its name, producing
a replacement string:
$string =~ /abc/$var/o;
This replacement string then never changes, even if the value
of $var changes. For example:
$var = 17;
while ($var > 0) {
$string = <STDIN>;
$string =~ /abc/$var/o;
print ($string);
$var--; # the replacement string is still "17"
}
Again, as with the match operator, there is no real reason to
use the o option.
As in the pattern-matching operator, the s and m
options specify that the string to be matched is to be treated
as a single line or as multiple lines, respectively.
The s option ensures that the newline character \n
is matched by the . special character.
$string = "This is a\ntwo-line string.";
$string =~ s/a.*o/one/s;
# $string now contains "This is a one-line string."
If the m option is specified, ^ and $
match the beginning and end of any line.
$string = "The The first line\nThe The second line";
$string =~ s/^The//gm;
# $string now contains "The first line\nThe second line"
$string =~ s/e$/k/gm;
# $string now contains "The first link\nThe second link"
 |
The \A and \Z escape sequences (defined in Perl 5) always match only the beginning and end of the string, respectively. (This is the only case where \A and \Z behave differently from ^ and $.)
|
| NOTE |
The m and s options are defined only in Perl 5. To treat a string as multiple lines when you run Perl 4, set the $* system variable, described on Chapter 17.
|
The x option tells the Perl interpreter to ignore all
white space unless preceded by a backslash. As with the pattern-matching
operator, ignoring white space makes complicated string patterns
easier to read.
$string =~ s/\d{2} ([\W]) \d{2} \1 \d{2}/$1-$2-$3/x
This converts a Chapter-month-year string to the dd-mm-yy
format.
| NOTE |
Even if the x option is specified, spaces in the replacement string are not ignored. For example, the following replaces 14/04/95 with 14 - 04 - 95, not 14-04-95:
$string =~ s/\d{2} ([\W]) \d{2} \1 \d{2}/$1 - $2 - $3/x
Also note that the x option is defined only in Perl 5.
|
You can specify a different delimiter to separate the pattern
and replacement string in the substitution operator. For example,
the following substitution operator replaces /u/bin with
/usr/local/bin:
s#/u/bin#/usr/local/bin#
The search and replacement strings can be enclosed in parentheses
or angle brackets.
s(/u/bin)(/usr/local/bin)
s</u/bin>/\/usr\/local\/bin/
| NOTE |
As with the match operator, you cannot use a special character both as a delimiter and in a pattern.
s.a.c.def.
This substitution will be flagged as containing an error because the . character is being used as the delimiter. The substitution
s.a\.c.def.
does work, but it substitutes def for a.c, where . is an actual period and not the pattern special character.
|
Perl also provides another way to substitute one group of characters
for another: the tr translation operator. This operator
uses the following syntax:
tr/string1/string2/
Here, string1 contains a list of characters to be replaced,
and string2 contains the characters that replace them.
The first character in string1 is replaced by the first
character in string2, the second character in string1
is replaced by the second character in string2, and so
on.
Here is a simple example:
$string = "abcdefghicba";
$string =~ tr/abc/def/;
Here, the characters a, b, and c are
to be replaced as follows:
- All occurrences of the character a are to be replaced
by the character d.
- All occurrences of the character b are to be replaced
by the character e.
- All occurrences of the character c are to be replaced
by the character f.
After the translation, the scalar variable $string contains
the value defdefghifed.
| NOTE |
If the string listing the characters to be replaced is longer than the string containing the replacement characters, the last character of the replacement string is repeated. For example:
$string = "abcdefgh";
$string =~ tr/efgh/abc/;
Here, there is no character corresponding to d in the replacement list, so c, the last character in the replacement list, replaces h. This translation sets the value of $string to abcdabcc.
Also note that if the same character appears more than once in the list of characters to be replaced, the first replacement is used:
|
$string =~ tr/AAA/XYZ/; replaces A with X
The most common use of the translation operator is to convert
alphabetic characters from uppercase to lowercase or vice versa.
Listing 7.13 provides an example of a program that converts a
file to all lowercase characters.
Listing 7.13. An uppercase-to-lowercase conversion program.
1: #!/usr/local/bin/perl
2:
3: while ($line = <STDIN>) {
4: $line =~ tr/A-Z/a-z/;
5: print ($line);
6: }
$ program7_13
THIS LINE IS IN UPPER CASE.
this line is in upper case.
ThiS LiNE Is iN mIxED cASe.
this line is in mixed case.
^D
$

This program reads a line at a time from the
standard input file, terminating when it sees a line containing
the Ctrl+D (end-of-file) character.
Line 4 performs the translation operation. As in the other pattern-matching
operations, the range character (-) indicates a range
of characters to be included. Here, the range a-z refers
to all the lowercase characters, and the range A-Z refers
to all the uppercase characters.
| NOTE |
There are two things you should note about the translation operator:
The pattern special characters are not supported by the translation operator.
You can use y in place of tr if you want.
$string =~ y/a-z/A-Z/;
|
The translation operator supports three options, which are listed
in Table 7.6.
The c option (c is for "complement")
translates all characters that are not specified. For example,
the statement
$string =~ tr/\d/ /c;
replaces everything that is not a digit with a space.
Table 7.6. Options for the translation operator.
| Option | Description
|
| c | Translate all characters not specified
|
| d | Delete all specified characters
|
| s | Replace multiple identical output characters with a single character
|
The d option deletes every specified character.
$string =~ tr/\t //d;
This deletes all the tabs and spaces from $string.
The s option (for "squeeze") checks the output
from the translation. If two or more consecutive characters translate
to the same output character, only one output character is actually
used. For example, the following replaces everything that is not
a digit and outputs only one space between digits:
$string =~ tr/0-9/ /cs;
Listing 7.14 is a simple example of a program that uses some of
these translation options. It reads a number from the standard
input file, and it gets rid of every input character that is not
actually a digit.
Listing 7.14. A program that ensures that a string consists
of nothing but digits.
1: #!/usr/local/bin/perl
2:
3: $string = <STDIN>;
4: $string =~ tr/0-9//cd;
5: print ("$string\n");
$ program7_14
The number 45 appears in this string.
45
$

Line 4 of this program performs the translation.
The d option indicates that the translated characters
are to be deleted, and the c option indicates that every
character not in the list is to be deleted. Therefore, this translation
deletes every character in the string that is not a digit. Note
that the trailing newline character is not a digit, so it is one
of the characters deleted.
Perl 5 provides some additional pattern-matching capabilities
not found in Perl 4 or in standard UNIX pattern-matching operations.
Extended pattern-matching capabilities employ the following syntax:
(?<c>pattern)
<c> is a single character representing the extended
pattern-matching capability being used, and pattern is
the pattern or subpattern to be affected.
The following extended pattern-matching capabilities are supported
by Perl 5:
- Parenthesizing subpatterns without saving them in memory
- Embedding options in patterns
- Positive and negative look-ahead conditions
- Comments
In Perl, when a subpattern is enclosed in parentheses, the subpattern
is also stored in memory. If you want to enclose a subpattern
in parentheses without storing it in memory, use the ?:
extended pattern-matching feature. For example, consider this
pattern:
/(?:a|b|c)(d|e)f\1/
This matches the following:
- One of a, b, or c
- One of d or e
- f
- Whichever of d or e was matched earlier
Here, \1 matches either d or e, because
the subpattern a|b|c was not stored in memory. Compare
this with the following:
/(a|b|c)(d|e)f\1/
Here, the subpattern a|b|c is stored in memory, and one
of a, b, or c is matched by \1.
Perl 5 provides a way of specifying a pattern-matching option
within the pattern itself. For example, the following patterns
are equivalent:
/[a-z]+/i
/(?i)[a-z]+/
In both cases, the pattern matches one or more alphabetic characters;
the i option indicates that case is to be ignored when
matching.
The syntax for embedded pattern options is
(?option)
where option is one of the options shown in Table 7.7.
Table 7.7. Options for embedded patterns.
| Option | Description
|
| i | Ignore case in pattern
|
| m | Treat pattern as multiple lines
|
| s | Treat pattern as single line
|
| x | Ignore white space in pattern
|
The g and o options are not supported as embedded
pattern options.
Embedded pattern options give you more flexibility when you are
matching patterns. For example:
$pattern1 = "[a-z0-9]+";
$pattern2 = "(?i)[a-z]+";
if ($string =~ /$pattern1|$pattern2/) {
...
}
Here, the i option is specified for some, but not all,
of a pattern. (This pattern matches either any collection of lowercase
letters mixed with digits, or any collection of letters.)
Perl 5 enables you to use the ?= feature to define a
boundary condition that must be matched in order for the pattern
to match. For example, the following pattern matches abc
only if it is followed by def:
/abc(?=def)/
This is known as a positive look-ahead condition.
| NOTE |
The positive look-ahead condition is not part of the pattern matched. For example, consider these statements:
$string = "25abc8";
$string =~ /abc(?=[0-9])/;
$matched = $&;
Here, as always, $& contains the matched pattern, which in this case is abc, not abc8.
|
Similarly, the ?! feature defines a negative look-ahead
condition, which is a boundary condition that must not be
present if the pattern is to match. For example, the pattern /abc(?!def)/
matches any occurrence of abc unless it is followed by
def.
Perl 5 enables you to add comments to a pattern using the ?#
feature. For example:
if ($string =~ /(?i)[a-z]{2,3}(?# match two or three alphabetic characters)/ {
...
}
Adding comments makes it easier to follow complicated patterns.
Perl enables you to search for sequences of characters using patterns.
If a pattern is found in a string, the pattern is said to be matched.
Patterns often are used in conjunction with the pattern-match
operators, =~ and !~. The =~ operator
returns true if the pattern matches, and the !~ operator
returns true if the pattern does not match.
Special-pattern characters enable you to search for a string that
meets one of a variety of conditions.
- The + character matches one or more occurrences of
a character.
- The * character matches zero or more occurrences
of a character.
- The [] characters enclose a set of characters, any
one of which matches.
- The ? character matches zero or one occurrences of
a character.
- The ^ and $ characters match the beginning
and end of a line, respectively. The \b and \B
characters match a word boundary or somewhere other than a word
boundary, respectively.
- The {} characters specify the number of occurrences
of a character.
- The | character specifies alternatives, either of
which match.
To give a special character its natural meaning in a pattern,
precede it with a backslash \.
Enclosing a part of a pattern in parentheses stores the matched
subpattern in memory; this stored subpattern can be recalled using
the character sequence \n, and stored in a scalar variable
using the built-in scalar variable $n. The built-in scalar
variable $& stores the entire matched pattern.
You can substitute for scalar-variable names in patterns, specify
different pattern delimiters, or supply options that match every
possible pattern, ignore case, or perform scalar-variable substitution
only once.
The substitution operator, s, enables you to replace
a matched pattern with a specified string. Options to the substitution
operator enable you to replace every matched pattern, ignore case,
treat the replacing string as an expression, or perform scalar-variable
substitution only once.
The translation operator, tr, enables you to translate
one set of characters into another set. Options exist that enable
you to perform translation on everything not in the list, to delete
characters in the list, or to ignore multiple identical output
characters.
Perl 5 provides extended pattern-matching capabilities not provided
in Perl 4. To use one of these extended pattern features on a
subpattern, put (? at the beginning of the subpattern
and ) at the end of the subpattern.
| Q: | How many subpatterns can be stored in memory using \1, \2, and so on?
|
| A: | Basically, as many as you like. After you store more than nine patterns, you can retrieve the later patterns using two-digit numbers preceded by a backslash, such as \10.
|
| Q: | Why does pattern-memory variable numbering start with 1, whereas subscript numbering starts with 0?
|
| A: | Subscript numbering starts with 0 to remain compatible with the C programming language. There is no such thing as pattern memory in C, so there is no need to be compatible with it.
|
| Q: | What happens when the replacement string in the translate command is left out, as in tr/abc//?
|
| A: | If the replacement string is omitted, a copy of the first string is used. This means that
:t:r/abc//
does not do anything, because it is the same as
tr/abc/abc/
If the replacement string is omitted in the substitute command, as in
s/abc//
the pattern matched-in this case, abc-is deleted.
|
| Q: | Why does Perl use characters such as +, *, and ? as pattern special characters?
|
| A: | These special characters usually correspond to special characters used in other UNIX applications, such as vi and csh. Some of the special characters, such as +, are used in formal
syntax description languages.
|
| Q: | Why does Perl use both \1 and $1 to store pattern memory?
|
| A: | To enable you to distinguish between a subpattern matched in the current pattern (which is stored in \1) and a subpattern matched in the previous statement (which is stored in $1).
|
The Workshop provides quiz questions to help you solidify your
understanding of the material covered and exercises to give you
experience in using what you've learned. Try and understand the
quiz and exercise answers before you go on to tomorrow's lesson.
- What do the following patterns match?
a. /a|bc*/
b. /[\d]{1,3}/
c. /\bc[aou]t\b/
d. /(xy+z)\.\1/
e. /^$/
- Write patterns that match the following:
a. Five or more lowercase letters (a-z).
b. Either the number 1 or the string one.
c. string of digits optionally containing a decimal
point.
d. Any letter, followed by any vowel, followed
by the same letter again.
e. One or more + characters.
- Suppose the variable $var has the value abc123.
Indicate whether the following conditional expressions return
true or false.
a. $var =~ /./
b. $var =~ /[A-Z]*/
c. $var =~ /\w{4-6}/
d. $var =~ /(\d)2(\1)/
e. $var =~ /abc$/
f. $var =~ /1234?/
- Suppose the variable $var has the value abc123abc.
What is the value of $var after the following substitutions?
a. $var =~ s/abc/def/;
b. $var =~ s/[a-z]+/X/g;
c. $var =~ s/B/W/i;
d. $var =~ s/(.)\d.*\1/d/;
e. $var =~ s/(\d+)/$1*2/e;
- Suppose the variable $var has the value abc123abc.
What is the value of $var after the following translations?
a. $var =~ tr/a-z/A-Z/;
b. $var =~ tr/123/456/;
c. $var =~ tr/231/564/;
d. $var =~ tr/123/ /s;
e. $var =~ tr/123//cd;
- Write a program that reads all the input from the standard
input file, converts all the vowels (except y) to uppercase,
and prints the result on the standard output file.
- Write a program that counts the number of times each digit
appears in the standard input file. Print the total for each digit
and the sum of all the totals.
- Write a program that reverses the order of the first three
words of each input line (from the standard input file) using
the substitution operator. Leave the spacing unchanged, and print
each resulting line.
- Write a program that adds 1 to every number in the standard
input file. Print the results.
- BUG BUSTER: What is wrong with the following program?
#!/usr/local/bin/perl
while ($line = <STDIN>) {
# put quotes around each line of input
$line =~ /^.*$/"\1"/;
print ($line);
}
- BUG BUSTER: What is wrong with the following program?
#!/usr/local/bin/perl
while ($line = <STDIN>) {
if ($line =~ /[\d]*/) {
print ("This line contains the digits '$&'\n");
}
}

|