How to Get Started With Regular Expressions in the Linux Terminal

5 hours ago

9 minutes read

Searching your file system can be tricky. For example, do you sometimes find it difficult to be specific or exact? Or perhaps it’s too noisy? Regex can solve these issues and more. It’s powerful, universal, and flexible, and the basics will carry you a very long way.

What is Regex?

Regex is a pattern-matching language; it’s a way to expressively describe patterns that match strings (e.g., words or sentences). For example, say you’re searching your hard drive for an image called foo, but you cannot remember if it’s a JPEG or a PNG. We can use regex with fd like this: fd ‘foo\.(jpg|png)’.

Many utilities make use of regex for searching, transforming, and interacting with text. For example, grep -E [regex], find -regex [regex], or fd [regex]. Using regex means that you can be very precise.

Regex is used everywhere. Most websites on the internet use it in one form or another. Regex is also common in utilities and applications, like ripgrep, Vim, Neovim, Emacs, and lots more.

6 Ways to Find and Replace Text in the Linux Terminal

Find and replace text without leaving the Linux terminal.

The Different Flavors of Regex

Regex comes in different flavors, which essentially means different rules (aka syntax). There are many flavors, but they differ only in small ways. If you stick to the concepts that I cover later, they will work across most flavors and Linux utilities. You don’t need to think too hard about it.

PCRE (Perl Compatible Regular Expressions) is the most full-featured flavor. All the examples try to be compatible with PCRE.

When you become a pro, you can refer to the Wikipedia article that compares regex flavors or a comprehensive regex comparison table. Keep in mind that the concepts that you will learn should apply everywhere.

A Quick Overview of Regex to Dispel Any Mystery

The following list will introduce you gently to common regex features:

Concept	Description
Character classes	A list of specific characters that you want to match, e.g., [abc].
Match groups	Brackets around related parts of the expression, like brackets in mathematics, e.g., (foo).
Modifiers	Change how the expression functions, e.g., case sensitivity.
Anchors	Define the start and end of a string, e.g., ^foo$.
Quantifiers	Indicate quantity, e.g., foo+, foo{3}, etc.
Alternation	Simply an or statement, e.g., foo\|bar.
DOTALL metacharacter	Match anything, like a wildcard—it’s just a single period.

You will mix and match these to describe a pattern.

These features only apply across the board if you use them with the command flags mentioned later. For example, grep -P foo.

The DOTALL metacharacter is like a wildcard because it matches everything. It’s simply a period. You will use this often in places where it should match anything.

A web page that displays two boxes: pattern and results. In the pattern box is the dotall metacharacter. The pattern matches and highlights several lines in the results box.

Character Classes: Match Specific Characters in Any Order

Character classes are a list of characters, enclosed in square brackets, that you wish to match. For example, the following expression matches a, b, z, 1, 2, or 9:

[abz129]

This matches any alphanumeric character, upper or lowercase:

[a-zA-Z0-9]

The hyphen (-) has a special meaning in a character class, so if you want to match it literally, you must place it first [-a-z] or escape it [a-z\-].

Again, it’s important to understand that a character class matches exactly one character, unless you use a quantifier (covered later).

A web page that displays two boxes: pattern and results. In the pattern box is a character class with the characters o, a, and 2. The results box has several highlighted lines that correspond to the pattern.

In the results box, you can see multiple characters highlighted. Each match corresponds to one of the characters in the character class.

If you look closely at the results box in the image, you will see that a single character class matches multiple characters. Global mode (g) is responsible for this. The global mode means that regex does not stop at the first match but instead keeps going and creates multiple matches.

Match Groups: Draw Boundaries Around Sub Expressions

In some ways, match groups are similar to brackets in mathematics. For example, when you write a mathematical expression like 1 + (2 / 2), it differs from (1 + 2) / 2. The calculation begins with the innermost brackets, which alters the result.

Brackets in regex work like boundaries; they group parts of the expression together. For example, foo(bar|baz) is not the same as foobar|baz, because the former will match foobar or foobaz; the latter will match foobar or baz.

Quantifiers: How to Specify Exact and Variable Amounts

Quantifiers allow us to define quantities. When we match a character with DOTALL or character classes, we use quantifiers to say how many. We can also apply quantifiers to match groups, so we can define quantities for entire expressions.

Match Zero or More Things With the Asterisk

The asterisk (*) metacharacter will match zero or more things. The following matches a, b, z, or an empty string:

[abz]*

Match One or More Things With the Plus Sign

The plus (+) metacharacter will match one or more things. The following matches one or more a, b, or z characters:

[abz]+

Make Things Optional With the Question Mark

The question mark (?) metacharacter makes the previous item optional. The following will match exactly one a, b, z, or nothing at all:

[abz]?

Define Exact Quantities With Curly Brackets

Curly brackets allow us to define an exact number. For example, the following will match a, b, or z exactly twice:

[abz]{2}

The following will match a, b, or z between 2 and 4 times:

[abz]{2,4}

A Summary of Quantifiers

?: Optional.
*: Zero or more (zero means an empty string).
+: One or more.
{n,m}: Match between n and m items.

The plus (+) and question mark (?) metacharacters don’t work with most Linux utilities unless you use appropriate command flags. Flags are covered later.

Match as Much or as Little as You Want With Lazy and Greedy Quantifiers

Some quantifiers allow us to define an unspecified amount. For example, the plus sign (+) means one or more—anything greater than 0. The plus (+) and asterisk (*) metacharacters are greedy by default, which means that they try to match as much as possible. In contrast, we can make them lazy so that they match as little as possible.

Appending a question mark (?) makes them lazy. For example, the following will match a, b, or z, but it will stop after the first match (it’s lazy).

[abz]+?

The asterisk (*) metacharacter is similar except for one small detail: it matches zero or more items. The laziest possible match is zero, so the following will match nothing.

[abz]*?

Making a quantifier lazy can be more performant because it doesn’t need to process the entire string. If you’re searching millions of strings, matching only the first few characters can save a lot of time and resources.

Anchors Match the Start and End of Lines

Anchors are simple to understand. There are two, one that indicates the start of a line (^) and one that indicates the end ($). The following pattern matches foo exactly and nothing else:

^foo$

A browser window displays two boxes, one for the pattern and one for the result. The pattern is the word foo, with start and end anchors around it. The result has two lines with foo and foobar; it highlights only foo.-2

Modifiers: Flags That Change How Regex Works

Modifiers are a way to change how regex works. For example, we can make it case-sensitive. We’ve already looked at the global modifier, but it’s worth restating that they are flags that typically live at the end of an expression.

A terminal window displays an echo command that outputs a word to sed. The word is two joined lowercase and uppercase foos. The sed command also has a modifier: i. The sed command replaces both uppercase and lowercase foos with bar.

The Global Modifier

The global modifier (g) allows regex to continue searching for matches after it finds the first one, resulting in multiple matches. In contrast, disabling the global modifier causes regex to stop after finding the first match.

A web page that displays two boxes: pattern and results. In the pattern box is a character class that defines a range of uppercase characters and a range of numbers. The results box has several highlighted lines that correspond to the pattern.

This expression matches individual uppercase letters and numbers. It ignores lowercase letters. Because the global modifier (g) is active, it matches multiple items.

7 Linux Text-Processing Tips to Get the Most Out of Your Plain Text

The terminal’s not just for code anymore.

The Case Insensitivity Modifier

The case-insensitivity modifier (i) will match against both uppercase and lowercase when active.

A web page displays two boxes: pattern and result. At the end of the pattern box is the modifier: i. A red arrow points to it. The pattern is lowercase foo, and the highlighted result is an uppercase foo.

The Multiline Modifier

Anchors define the start (^) and end ($) of a string. When the multiline modifier (m) is active, the anchors match the start and end of each line. When it’s inactive, the anchors match the entire string.

This is how the anchors behave when multiline mode is active:

^foo$
^bar$

And when it’s inactive:

^foo
bars$

To evaluate all lines when the multiline modifier (m) is active, you must also enable the global modifier (g). But remember that doing so will also create multiple matches.

Putting It All Together: Using Regex With Commands

So now for the grand finale: how do we put what we’ve learned to good use? As mentioned at the beginning, find, fd, grep, ripgrep, and sed all support regex. Pay attention to the command flags that I use; I chose them so that they use similar flavors.

For each command, I will use the following expression:

^.+/[fo]+\.(jpg|png)$

This pattern matches a POSIX path for a JPG or PNG file. For example:

/foo/bar/baz/foo.jpg

This expression covers everything that we’ve learned: anchors, character classes, quantifiers, alternation, match groups, and the DOTALL metacharacter. Here’s a summary of the expression (in order of appearance):

Segment	Note
^	Match the start of the line.
.+	Match any character, one or more times.
/	Match a forward slash just before the file name. This literal slash defines a clear boundary for our file name, and the previous DOTALL will match all other path characters (including slashes).
[fo]+	Match the letters f and o one or more times, e.g., fo, foo, ffoo, fffooo, fofofof.
\.	Match a literal period (not a DOTALL).
(jpg\|png)	A match group; I’ve used it here to group these two patterns together. The pipe is called alternation, and using it like this means jpg or png.
$	Match the end of the line.

I will match all the expressions against a file (called examples) with the following contents:

/foo/bar/baz/foo.jpg
/one/two/thr/foo.png

/this/should/not/match.jpg

For the find commands, I will create such files in my file system and search for them.

Using grep With Regex

For grep, we must use the -P flag, which enables its (limited) PCRE engine. PCRE is the most extensive flavor and supports almost every feature you can think of.

A terminal displays the grep command, which matches two file paths using the PCRE flag.

If you’re unfamiliar with grep, see this detailed guide on how to use it.

Using find With Regex

The find command supports multiple regex flavors. You can see a list of them with the following command:

find -regextype help

The flavor that most closely matches PCRE is posix-extended, aka POSIX ERE (Extended Regular Expressions). POSIX ERE is missing many advanced features, but it supports all the features that we’ve covered.

A terminal displays the find command, which matches two searched file paths using the regex and regextype flags.

If this command seems long-winded, then you should probably use an alias to set the regex options by default.

Using fd With Regex

The fd command uses the regex Rust crate (a Rust package). The regex crate is nearly compatible with PCRE, so we can use it without much concern. However, by default, fd only matches against file names, so we must use the –full-path flag if we want to match against the entire file path.

A terminal displays results from the fd command, which matches two searched file paths.

Using ripgrep With Regex

Using regex with the ripgrep command is straightforward because it uses the regex Rust crate. As mentioned earlier, the Rust crate closely matches PCRE.

A terminal displays results from the ripgrep command. It searched a file and matched two file paths.

Using sed With Regex

The sed command is different from the others; it allows us to search and replace strings. If you’re unfamiliar with it, you should check out this great guide on how to use the sed command. The general form of the command is as follows:

s/pattern/replacement/

It’s not necessary to use forward slashes, because you can use any character that you wish. In the example below, I’ve used the pipe character instead of forward slashes so that I do not confuse sed. Using a pipe means that I do not need to escape the forward slashes in the provided replacement value (a path).

For sed, we need to use the GNU ERE regex flavor, and we do that with the -E flag. Everything that works in GNU ERE will work in PCRE, which is good for us.

A terminal displays results from the sed command. It searched a file and replaced two file paths with a provided value—another path.

Number 1 is the pattern. Number 2 is the desired value to replace matched lines with. Number 3 shows matched lines replaced with the new value. Number 4 is a line from the file that did not match and is left unmodified.

In the example, I extracted the results and replaced the file paths, similar to how cat and grep work, but sed also supports in-place editing via the -i flag.

So that’s it; those are the basics of regex. The basics will carry you very far. The important thing to remember is that many utilities use different regex flavors, and so, if you stray from the mentioned flags, you may find that some features do not work.

In addition to that, regex is something that you need to get your hands on before it truly makes sense. You can practice your skills, get insights, and learn more via regex101, a powerful online playground that provides tips and guidance.

Konsole Terminal open on the Kubuntu Focus Ir14 Linux laptop.

How to Get a Cheatsheet for Any Command in the Linux Terminal

Sometimes cheating is necessary.

Source link

Error establishing a Redis connection