[Next Chapter] [Top page] [Previous Chapter]

CHAPTER 10 - REGULAR EXPRESSIONS


What are regular expressions?

A regular expression (RE) is a string of characters that can be used to match a set of character strings. For example, to globally search for all occurrences of the word "and" would require a search for "and", "And", "AnD", "AND", etc. Without regular expressions finding all possible occurrences of "and" would require eight separate searches. Using an RE the search could be done with one command.

Regular expressions are used by many Unix utilities, including:

ed

ex

vi

grep

sed

awk
(The awk utility interprets a special-purpose programming language that makes it possible to handle simple data-reformatting jobs easily with just a few lines of code. Awk is not covered in this course, but the GAWK Manual is a good guide to its use.)

Regular expressions are used in searches and substitutions.

Character strings

A character string is the simplest regular expression which simply matches the string itself. For example:

/hello/			- matches  'hello's/hello/goodbye/	- matches 'hello' and makes a substitution

Matching single characters

The '.' character is used to match a single character. For example:

/p.t/	- matches 'p' and 't' separated by a single character, e.g. 'pit', 'put', 'pot', etc.

Sets of characters

The expression /RE/ is used to match a set of characters in a single character position. For example:

/x[ab2X]y/	- matches any of the following:
xay
xby
x2y
xXy

In the expression /[RE]/ a range of characters can be specified. For example:

[a-z]	- matches any single lower case character[0-9]	- matches any single digit

Note however:

[0-57]	- matches any one of the following:0 1 2 3 4 5 7

i.e. 0-5 and 7. Sets of characters can be combined:

[a-d5-8X-Z]	- matches any one of the following:a b c d 5 6 7 8 X Y Z

It is possible to specify a set of characters which are not to be matched in the RE. For example:

[^0-9]	- matches any single character which is not a digit

Anchors

An anchor is used to match a RE found at a particular position. For example:

/^RE/	- matches RE at the start of a line
/RE$/	- matches RE at the end of a line
/^RE$/	- matches RE as the whole line

Note that there are two separate uses of the '^' operator. One is as the sart of line anchor, and the other as the 'logical not' operator. The latter function only applies inside square brackets.

Repetitions

Multiple occurrences of REs can be specified. For example:

a*	- matches 0 or more occurrences of 'a'aa*	- matches 1 or more occurrences of 'a'.*	- matches any string of characters

Remembered regular expressions

A null RE stands for the last RE. For example:

:/[Tt]he.*car/p
The blue car exploded with a roar.
:s//(The blue car)/p
(The blue car) exploded with a roar.

The '&' character in a replacement string stands for the most recently matched string. For example:

 :/[Tt]he.*car/p
 The blue car exploded with a roar.
 :s//(&)/p
 (The blue car) exploded with a roar.

Sub-expressions

A sub-expression in a RE can be referred to.

\(string\)	- defines an RE sub-expression\n	- refers to the nth RE sub-expression

NOTE The backslash is the escape character for REs. This means it neutralises the special meanings of special characters. For example:

:p
A line of text
:s/\(line\).*\(text\)/\2\1/p
A text line
:*

Repetition

It is possible to specify multiple occurrences of REs. For example:

c\{4\}		matches exactly 4 c'sc\{4,\}		matches 4 or more c'sc\{2,4\}		matches between 2 and 4 c's

For example, to find a line containing 5 digits:

/[0-9]\{5\}/

A summary of special characters

Special characters in the search string

start of line anchor (or NOT operator inside [] )

$ end of line anchor

. any character

* character repeated any number of times

\ escape character

[ ] contains range of characters

Special characters in the replacement string

& string matched in search string

\ escape character

Note that any regular expression can be used with grep. (It gets its name from the editor command g/RE/p which means 'globally search for RE and print it'). This opens up many new possibilities for the use of grep. Unix commands that use regular expressions often makes the use of an editor redundant.


PRACTICE

Obtain a listing of the members of your group from the password file using grep.


Introduction to sed

sed is a non-interactive stream editor which is used for text. The command to invoke sed is:

sed [-n] [-e command] [-f edfile] [input_file]

For example:

sed "s/UNIX/Unix/g" thesis > thesis.new

This will process the file thesis line by line, outputting each line to the file thesis.new and replacing each occurrence of the string "UNIX" with "Unix".

In the above example every line of thesis will be output to thesis.new, irrespective of whether it has been changed or not. This is because the default output for sed is every line of the input. Using the -n option supresses the default output, and only specified lines are output. In the above example this would mean that no lines would be output in the following example:

sed -n "s/UNIX/Unix/g" thesis > thesis.new

since a change but no output has been specified. If a print command is added, as follows:

sed  -n  "s/UNIX/Unix/gp"  thesis  >  thesis.new

then only those lines in which "UNIX" had been changed to "Unix" would be output.

As you also see in the example, the -e option is not not necessary when there is only one editor command. It is possible to specify more than one command, and in this case each must be preceded by -e. For example:

%  sed  -e  "s/a/A/"  -e  "s/b/B/"  file1  >  file2

This command will carry out the two substitutions on each line of file1.

The -f option enables the user to use a file containing editor commands, instead of typing out a series of commands with the -e option.

sed examples

The sed command to list only files (exclude directories) is:

%  ls  -l  |  sed  -n  "/ -/p"
-rw------- 1 lnp5jb      1765 mbox
-rw------- 1 lnp5jb       320 example1

The sed command to extract a list of usernames from the password file is:

%  sed  "s/:.*//"  /etc/passwd  |  more

What this does is to delete everything that comes after ':' in the password file.


Exercises

1. Reproduce the effects of the above sed examples using grep instead. Note that grep is generally better for searches, such as this, while sed can be used to make changes to files.

2. Find the system's games directory and type quiz function ed-command to do the ed commands quiz. Don't worry if there are a couple of things that you haven't come across. Try it again and see if you improve your score.


[Next Chapter] [Top page] [Previous Chapter]