Regular Expressions: Syntax in Detail | by David Farrugia | Oct, 2022

By Jessie Hobb On Oct 8, 2022

PROGRAMMING | PATTERN MATCHING | COMPUTER SCIENCE

A deeper dive into regular expressions and their syntax

In our last post, we introduced and discussed the paradigm of regular expressions (regex). Regex is a powerful tool that allows us to perform string pattern matching, replacement, and other manipulation operations.

We considered a use-case as an example to build a regex to validate a hexadecimal colour value.

You can find the introductory material here:

The purpose of this post is to serve as a scrapbook/cheat sheet style guide of some more advanced concepts in regex. When I’m learning these types of skills (skills that require practice to truly master), I find it significantly better to have concise guidelines and a practice arena where I can put the said guidelines to the test.

I typically use the following online tool to practice and test my regex. There are other good online tools. Pick one you like and as always — practice, practice, and then practice some more.

As such, this post is going to be slightly different from my typical articles. It is going to be minimal and direct. I would also love to hear your feedback on this style of guidance and whether or not you prefer a more detailed and hands-on writing style.

Experimentation is the essence of growing

Before we get started, let us first refresh our memory with regards to the terminology.

Delimiters are used to indicate the start and end of a regex.

In between the delimiters, we write our regex. The regex is the actual pattern that we want to match.

After the closing delimiter, we can also use modifiers. But, more on this later!

Two main types: Ordinary and Special

Ordinary

Ordinary characters are the simplest form of regex because they match themselves (i.e., their literal character). By matching themselves we mean that if we type the character A in a regex, it will actually look for an A in the string. Some other examples include the numbers between 0 and 9, and the remaining letters of the alphabet.

Given a string abc123, a matching regex would simply be /abc123.

Control

Control characters (or escape sequences) are a sequence of characters which represent other elements. For example, the control sequence \n represents a new line. Below, we show a number of commonly used escape sequences.

WARNING: different regex engines might have different representations — so it’s always best to double-check with the documentation of your regex flavour.

Special/Meta Characters

Character Class

Characters classes can list one or more characters. Using a character class is essentially saying that any of the listed characters is a match. We show this by using the square bracket notation to group the meta characters. We can also specify ranges using the — operator.

For example, /[abc123] means that if either one of the characters between the square brackets exist in the string, then it will be a match.

Similarly, /[a-f] represents that any letter between the a and the f can be matched (i.e., a, b, c, d, e, f).

One important thing to keep in mind when using ranges is that a regex range is based on ASCII codes. So let’s say we have our regex like \[A-z], the range will match some extra symbols such as the \ and the square brackets. Some other examples include:

[9-0] — might be an empty range
[],\[\ — invalid and will fail to compile

Have a look at the ASCII codes below to understand better what I’m talking about.

These examples are referred to as positive classes because the regex is expressing what it should match. On the other hand, we also have negative classes which express what the regex should not match. This is done by using the ^ operator at the start of the pattern which we do not want to match.

For example, let’s say we have the regex /[^abc123], and the string abchello123, the regex engine will ignore any of the characters listed in the square brackets and therefore, only match the hello part of the expression.

So what if we want to actually match a -, or even the ^ character?

This syntax depends for the operator in question. For example, escaping the — can be done by having it either at very beginning or very end of regex (i.e., [A-Z_-]). To start a range with a dash the range needs to be the first range in the character class ( [--/A-Z],[A-Z+--]).

As for the ^ character, it works quite similarly. It will count as a literal if it is not the first character.

For [ and ], the best way to escape them is by using the \ operator like \[ or \].

The dot — matches any character except a newline but matches \n with dotall modifier (some flavours of regex engines swap new line with the null byte) — Check with your regex engine!

When used in a character class it loses its power and is matched as a literal. It’s useless since a character class already matches anything.

WARNING: using a dot with a quantifier becomes VERY SLOW. (i.e.: .+)

Quantifiers are used to indicate repetation. There are 4 main repetition quantifiers.

? — repeat zero or one time

+ — repeat one or more times (unlimited)

* — repeat zero or more times (unlimited)

{} — allow us to specify the exact repetition we want. We can also pass in min and max values — {n,m} or {n,} or {,m}

{,m} is not available for most regex flavours, so it is best to use {0,m}

Quantifiers are called ‘greedy’ because they will always favour a match over a non-match. Quantifiers will also try to match as often as possible.

Let’s have a quick look at an example.

We can see that our regex matched 2 groups. The first group (light blue in the example) consists of the first 5 characters, while the second group (darker blue) is made up of the remaining 4 characters.

Greedy means that it will only stop once the condition cannot be satisfied any longer.

Lazy will stop as soon as the condition is satisfied. We specify a regex to be lazy by using the ? operator. This will make it more reluctant. It will still favour a match but will do it the least number of times possible while still making a matching — it can be 0!

For example, the greedy h.+l matches 'hell' in 'hello' but the lazy h.+?l matches 'hel'.

To leverage this and make it faster, we must be be precise and use negation (it will prevent backtracking). Negation is almost always better than using wildcards.

The whole idea is essentially to create an alternate branch. We can do this via the | operator.

The | is not special in a class; thus, it will be matched as a literal. So beaware. Also, the branch ordering is important (left-most branch is most desired but won’t block the other branches).

The | also has the lowest precedence, so it’s probably wiser to use grouping operators to indicate the start and end.

Well, as the name implies, they group things together. The grouping can be specified using the ( ) characters.

For example, let us say that we want to match the word hello fully for an unlimited number of time. We can use the regex /(hello)+ to specify that we want to match the entire group (in this case, hello) for one or mote times (via the +).

In this post, we’ve gone over the main regex syntactic details that are most commonly used. Mastering regex is definitely a skill that at first glance seems like an extra tool to learn but — and I’m speaking from experience here — in reality its super useful.

I kept this post purposefully concise because regex is one of those tools that you have to play around with to properly get a feel for it. I highly urge you to give the regex learning journey a try. You’ll become a 10x more efficient developer if you do — guaranteed!

Did you enjoy this post? If yes, consider subscribing to my email list to get notified whenever I publish new content. It’s free 🙂

Perhaps you might also consider becoming a member to support me and your other favourite writers on Medium.

For $5/month, you will have unlimited access to every article on Medium.

I would love to hear your thoughts on the topic, or anything AI and Data.

Drop me an email at [email protected] should you wish to get in touch.