Regular Expressions — a 2007 intro

When @odd mentioned he should learn about Regular Expressions I recalled I once wrote an introductory tutorial for Mac users. Here it is in all its 2007 splendour, including poor quality screenshots!

Imagine you have a list of reversed names like this:

  • Janeway, Kathryn
  • Summers, Buffy
  • Carter, Samantha

Here we have a last name, followed by a first name, separated by a comma and space. Now suppose you'd like those names to be in first-last order, like this:

  • Kathryn Janeway
  • Buffy Summers
  • Samantha Carter

Pattern search

You can either do a lot of copy and pasting, or you can use a clever find and replace routine, called grep.

Grep is able to look for patterns: check each line for a group of letters followed by a comma and space, and then another group of letters. Replace that whole thing with the second group of letters, a space, then the first group.

Suitable software

For the following instructions I use the free Tex-Edit Plus text editor. Other software can use grep too, but it may not follow precisely the instructions below.

The regular expression (grep)

Using grep in Tex-Edit Plus.
Using grep in Tex-Edit Plus.

Paste the list of reversed names into an empty Tex-Edit Plus document, then use Command F to call up the Find and Replace dialog box.

In the Find text box put exactly this (I explain it below). Note the comma and space in the middle:

([a-zA-Z]+), ([a-zA-Z]+)

In the Replace text box put this (note the space in the middle):

^2 ^1

Check the box labelled Regular expression (grep), then click the Replace All button. The names should now be reversed.

Screenshot 1: Using grep in Tex-Edit Plus, setting the replacement to be in a different colour, so it's easier to see what was changed.

Tip: Tex-Edit Plus is able to use different colours when it replaces text. Set the colour to something like red, so you can quickly glance at your document to see what was changed and spot any potential problems.

An explanation of terms

The round brackets () create groupings — in this case groups of letters. In the replacement, ^2 refers to the second grouping, and ^1 refers to the first grouping.

The square brackets [] contain the possibilities of what we're looking for. We could write out all the letters of the alphabet inside the square brackets, but I've used a shorthand above: a-zA-Z. That means: find any lower case letter or any upper case letter between a and z.

Since that sequence would find only any one letter; the + tells the program to find more than one letter.

So, in English, we could say: work line by line to look for a group consisting of one or more letters, followed by a comma and a space, followed by another group of letters. Replace that pattern with the second group, a space, and the first group.

Strip out numbers

Perhaps you have a list of DVDs you bought, with the price beside each. You'd like to send the list to Aunty Flo, but want to remove the prices. More tedious deleting by hand? Not if you use grep.

Here's the start of the list you have in Tex-Edit Plus:

  • Buffy The Vampire Slayer S5 $59.99
  • Chicken Run $25
  • Star Trek Voyager: S3 $69.69
  • Noddy goes Wild $4

Hmmm, the pattern seems to be: a bunch of letters, numbers, spaces and things, followed by a space, a dollar sign, some numbers (with or without a dot) and a return. We must be able to do something with that.

We can look for numbers with the pattern: [0-9], and we can indicate how many items by putting a minimum and maximum in braces like this: {1,2}. Tex-Edit Plus indicates a return character like this: ^c.

We have a small problem: the dot is used in grep as a wildcard to mean: "any character". Here we need it for the decimal point. To show that we mean 'dot' and not 'wildcard' we need to add a backslash in front of it. Similarly, the $ needs a backslash in front, as it is a 'reserved' character in grep.

What's more, the Noddy DVD doesn't include a dot in the price, so we need to use a ? to show that the dot is optional: it may not appear at all.

Let's try this pattern:

Find: \\$[0-9]{1,2}\\.?[0-9]{0,2}^c

Replace: ^c

Remove prices from an asset list.
Remove prices from an asset list.

The Replacement this time is just a return — we've looked for prices followed by a return, we need to replace with a return, or all the lines will run together.

Screenshot 2: Using grep in Tex-Edit Plus to remove prices from an asset list.

Grep looks complicated, and it is, a bit. But it is the start of something immensely powerful.

Tip: When you download and install Tex-Edit Plus it includes a file called Grep? that contains useful information about what regular expressions are and how to construct them for use within Tex-Edit Plus.

This article was first published in [New Zealand] Macguide magazine Issue #31 January / February 2007 and may have been modified from the original.

Miraz Jordan @Miraz

The Love Waikawa Beach website has been incorporated into the Waikawa Beach category on this blog.