Delving the depths of computing,
hoping not to get eaten by a wumpus

By Timm Murray

How to write regexes that are almost readable

2022-06-06


Let’s start with a moderately simple regex:

/\A(?:\d{5})(?:-(?:\d{4}))?\z/

Some of you might be smacking your forehead at the thought of this being “simple”, but bear with me. Let’s break down what it does piece by piece. Match five digits, then optionally, a dash followed by four digits. All in non-capturing groups, and anchor to the beginning and end of the line. That tells you what it does, but not what it’s for.

Explaining out the details in plain English, as in the above paragraph, doesn’t help anyone understand what it’s for. What you can do to help is have good variable naming and commenting, such as:

# Matches US zip codes with optional extensions
my $us_zip_code_re = qr/\A(?:\d{5})(?:-(?:\d{4}))?\z/;

Like any other code, we hand off contextual clues about its purpose using variable naming and comments. In Perl, qr// gives you a precompiled regex that you can carry around and match like:

if( $incoming_data =~ $us_zip_code_re ) { ... }

Which some languages handle by having a Regex object that you can put in a variable.

There are various proposals for improved syntax, but a different syntax wouldn’t help with this more essential complexity. It could help with readability overall. Except that Perl implemented a feature for that a long time ago that doesn’t so drastically change the approach: the /x modifier. It lets you put in whitespace and comments, which means you can indent things:

my $us_zip_code_re = qr/\A
    (?:
        \d{5} # First five digits required
    )
    (?:
        # Dash and next four digits are optional
        -
        (?:
            \d{4}
        )
    )?
\z/x;

Which admittedly still isn’t perfect, but gives you hope of being maintainable. Your eyes don’t immediately get lost in the punctuation.

I’ve used arrays and join() to implement a similar style in other languages, but it isn’t quite the same:

let us_zip_code_re = [
    "\A",
    "(?:",
        "\d{5}", // First five digits required
    ")",
    "(?:",
        // Dash and next four digits are optional
        "-",
        "(?:",
            "\d{4}",
        ")",
    ")?",
].join( '' );

Which helps, but autoidenting text editors try to be smart with it and fail. Perl having the // syntax for regexes also means editors can handle syntax highlighting inside the regex, which doesn’t work when it’s just a bunch of strings.

More languages should implement the /x modifier.



Copyright © 2024 Timm Murray
CC BY-NC

Opinions expressed are solely my own and do not express the views or opinions of my employer.