KKMIN

Deciphering Wizard Runes: Variable Delimiter in Regex

I remember seeing Regular Expressions for the first time while I was still a junior in university and being really intimidated by it. The seemingly random amalgamation of symbols and brackets resulting in some kind of coherent pattern matching seemed like a kind of black magic that that would deter any beginner from believing they could comprehend it.

(?<=\. ).*?it(?=\?)\?

An example of a regex that can terrify freshmen.

A few years later, I've had to use regex in a few projects and I've come to appreciate not just their power, but their simplicity and elegance. If we are able to get past the intimidating look of the syntax, we can see the different components that constitute the regex and understand how it works.

Context

In one of my recent projects, we have to analyse an SQL query by mapping parts of it into its equivalent relational algebra operator tree/syntax tree (which in turn has its own physical database operations). The idea is that certain parts of the SQL query will be mapped to certain operators in the operator tree. To do so, we need to be able to partition the SQL query meaningfully into its constituent tree nodes. For example, the query:


SELECT * FROM A, B WHERE A.key=B.key AND A.value > 10

can be partitioned into the following segments:

  • SELECT * FROM A, B
  • A.key=B.key
  • A.value > 10

Where each segment maps to an operator. Notice that certain keywords like WHERE, AND are not included in the segments, as they are not part of the nodes in an operator tree. How can we meaningfully split such queries into segments?

Regular Expression

Since regex is used to match patterns in strings, we can tell it to match all segments that are between the start of the string, WHERE, AND and the end of the string. This is how it looks like:

(?<=^|WHERE|AND).*?(?=$|WHERE|AND)

Relax, this is not as scary as it looks. Let's break it down into its components. There are 3 main parts to this regex:

  • (?<=^|WHERE|AND) - This is the lookbehind part of the regex.
  • .*? - This is the pattern part of the regex, the content we want.
  • (?=$|WHERE|AND) - This is the lookahead part of the regex.

Lookbehind / Lookahead

Lookbehind and Lookahead expressions essentially tell regex that we want our pattern to be matched only if it is preceded or followed by whatever pattern we specify. In our case, we want our pattern to be matched only if it is preceded by the start of the string, WHERE or AND and followed by the end of the string, WHERE or AND.

For example, the string SELECT * FROM A, B is preceded by the start of the string and followed by the word WHERE in our SQL query which is SELECT * FROM A, B WHERE A.key=B.key AND A.value > 10.

Another example: A.key=B.key is preceded by WHERE and followed by AND in our SQL query.

(?<=^|WHERE|AND) just means we perform a lookbehind where the lookbehind value can be the start of the string (^), WHERE or AND. Similarly, (?=$|WHERE|AND) just means we perform a lookahead where the lookahead value can be the end of the string ($), WHERE or AND.

Pattern

The .*? part of the regex is the pattern we want to match, i.e. whatever is between the lookbehind/lookahead. The dot . means we can match any character, and * means we can match that any character as many times as possible. However, if we only use .*, consider this:

SELECT * FROM A, B WHERE A.key=B.key AND A.value > 10 is now a valid match because it satisfies lookbehind (start of string) and lookahead (end of string).

A.key=B.key AND A.value > 10 is also a valid match because it satisfies lookbehind (WHERE) and lookahead (end of string).

Lazy Matching

By default, regex will adopt a greedy approach which will match the pattern as much as possible from left to right. The * operator will end up matching the entire string since the end of the string will satisfy the lookahead condition. This is not what we want.

The ? at the end of the pattern means we want to match the pattern lazily, which means we want to match the pattern as few times as possible. This is important because our lookbehind and lookahead values include the start and end of the strings, but we may encounter another keyword earlier, such as AND. I found this article which explains it really well.

And that's it! We can now use this regex to split our SQL query into segments, and add as many keywords we want to the lookbehind and lookahead such as OR, GROUP BY, ORDER BY etc. That wasn't so hard, was it?

Delimiter

You are probably familiar with the concept of delimiters that can be used to split a string into different parts. For example, this Java code:


String s = "a,b,c";
String[] parts = s.split(",");

will result in the array ["a", "b", "c"]. The delimiter is the comma , that is used to split the string into different parts. In the case of our use case with the SQL query, what we are really trying to do is to use a variable delimiter where keywords like WHERE and AND are candidates to split the query into different segments.

As it turns out, instead of passing a character, we can also pass a regex itself as a delimiter to the split function. This actually makes our lives much easier rather than crafting an expression with lookbehind and lookahead. The following code:


String s = "SELECT * FROM A, B WHERE A.key=B.key AND A.value > 10";
String[] parts = s.split("( WHERE | AND )");

will result in the array ["SELECT * FROM A, B", "A.key=B.key", "A.value > 10"], just like the matches of the regex we crafted earlier. "(WHERE | AND)" tells the function that either the keyword WHERE or AND can act as a delimiter.

This is definitely a much simpler way of doing things for my use case, and I am embarrassed that I didn't think of this earlier. Instead, it took me a roundtrip with regex to finally see the problem as that of multiple delimiters, which of course can be done with regex itself, but it can also be done with a simpler regex inside a split function.

Closing Thoughts

I've just barely scratched the surface of regex with this small use case; it has many more features that can be used in other scenarios, and I am still learning. I hope this post provided some insights and unraveled some of the mysteries of regex. It is not difficult if we put in a little bit of time to understand it from its components.

If you've read this far, why not scroll up to the top of the page and see what that first regex is trying to match? I'll give you a hint: it's on this page :)

← Back to home

Comments