Regular Expressions Illustrated (prototype)

Hi all!

This is a prototype of an upcoming programming book. I would be very, very happy if you could send me an email with some feedback. What would make this book more useful for you?


Note that...
Best Regards // Staffan, staffan.noteberg@rekursiv.se




Scala extractor

An extractor in Scala is an object that has a member method called unapply. This method matches a value and takes it apart. We want the following test to pass:

expect(Some("http", "co.tr/jo")) { Url.unapply("http://co.tr/jo") }

The unapply() method reverses the construction process. It takes an URL string and tries to return two strings: the scheme and the domain + path of the URL.

object Url {
  // Optional injection method
  def apply(scheme: String, domainAndPath: String) = scheme + "://" + domainAndPath

  // Mandatory extraction method
  def unapply(urlString: String): Option[(String, String)] = {
    val parts = urlString split "://"
    if (parts.length == 2) Some(parts(0), parts(1)) else None
  }
}

So, what do we need this method for? Well, we can use the extractor to match selector expressions. For example, here are two isUrl() tests:

assert(isUrl("http://po.dk/yt"))
assert(!isUrl("fr.go/hg"))

These tests will pass, with help from our unapply() method if we add the following selector code:

def isUrl(s: String) = s match {
  case Url(_, _) => // invoke Url.unapply(), not Url.apply
    true
  case _ => 
    false
}

The reason I explained all the Scala features above is that one particularly useful application area of extractors are regular expressions. The following tests will pass if we create a regex extractor:

val Temperature(degrees, scale) = "98°F"
expect("98") { degrees }
expect("F") { scale }

"37°C" match {
  case Temperature(degrees, scale) => {
    expect(37) { degrees.toInt }
    expect("C") { scale }
  }
}

To compile a Regex in Scala, you just invoke the r method of a string. Also note that the three double quotes """ makes the string a raw string, meaning that it’s interpreted verbatim. E.g. backslashes will remain backslashes instead of control characters. Here’s the Regex that make the tests above pass:

val Temperature = """(\d+)\s*°\s*(C|De|F|K|N|R|Ré|Rø)""".r
(\d+) at least one digit, as many as possible, will be captured in group #1
\s* optional white spaces, as many as possible
° literally, the degree sign
\s* optional white spaces, as many as possible
(C|De|F|K|N|R|Ré|Rø) either C (Celsius), De (Deslie), F (Fahrenheit), K (Kelvin), N (Newton), Ra (Rankine), (Réaumur), or (Rømer) will be captured in group #2

Note that R comes before and in the alternation since a non-deterministic finite automaton (NFA) normally choses the leftmost alternative.

We can also use extractors with for loops to capture repeated patterns. We have an authlog that needs to be parsed. It shows all system authentication events. Here are the tests:

val logRecords = parseLogRecords("""Oct 14 17:15:01 nin session opened for user staffan by (uid=0)
Oct 14 17:15:02 nin session closed for user staffan
Oct 14 17:17:01 nin session opened for user root by (uid=0)
Oct 14 17:17:01 nin session closed for user root""")

expect(("Oct", "14", "17:15:01", "nin", "session opened for user staffan by (uid=0)")) { logRecords(0) }
expect(("Oct", "14", "17:15:02", "nin", "session closed for user staffan")) { logRecords(1) }
expect(("Oct", "14", "17:17:01", "nin", "session opened for user root by (uid=0)")) { logRecords(2) }
expect(("Oct", "14", "17:17:01", "nin", "session closed for user root")) { logRecords(3) }

The method parseLogRecords() invoked above looks like this:

def parseLogRecords(authlogFromFile: String): List[(String,String,String,String,String)] = {
  val LogEntry = """(\p{L}+)\s+(\d+)\s+([\d:]+)\s+(\S+)\s*(.*)""".r 
  var logRecords = List.empty[(String,String,String,String,String)]

  // iterate over all log entries
  for (logEntry <- LogEntry findAllIn authlogFromFile) logEntry match {
    case LogEntry(month, day, time, host, message) => logRecords +:= ((month, day, time, host, message))
  }
  logRecords.reverse // new log entries were added to the top, so we must reverse the order
}

This regex is straight forward:

(\p{L}+) capture at least one letter, as many as possible, in group #1
\s+ match, but don’t capture, at least one whitespace, as many as possible
(\d+) capture at least one digit, as many as possible, in group #2
([\d:]+) capture at least one digit or colon :, as many as possible, in group #3
(\S+) capture at least one non whitespace, as many as possible, in group #4
\s* match, but don’t capture, zero or more whitespaces, as many as possible
(.*) capture zero or more of any character except line breaks, as many as possible, in group #5

Evil anti-munging spider

As you awoke one morning from uneasy dreams, you found yourself transformed in your bed into a gigantic email spammer. In order to send unsolicited electronic mails, you write a spider – a software that collects and extracts email addresses from the websites around the world. The regex you use is quite simple. Here are a few rules you want to implement:

There are other – less used – rules that you decide to ignore. Here’s a Ruby test you want to pass:

assert_equal ['ab@fe.com', 'er@op', 'yo@po.tr', 'with+symbol@example.com', '4@y'],
    extract_emails('ab@fe.com @ge er@op yo@po.tr. Abc.example.com ' +
        'Abc.@example.com with+symbol@example.com 4@y');

In the local part you look for repetitions of allowed characters succeeded by at most one dot. Then there’s the @ which mustn’t be preceded by a dot. In the domain name part you try to find repetitions of allowed characters succeeded by at most one dot. The dot can’t be succeeded by a hyphen. And the whole thing mustn’t end in a dot character.

def extract_emails(html)
  html.scan(/(([-+\w]\.?)+(?<!\.)@([-a-zA-z\d](\.(?![-.]))?)+(?<!\.))/)
    .map {|matched_groups| matched_groups[0]}
end
([-+\w]\.?)+ local part is hyphens -, pluses +, or word characters \w (a-z, A-Z, digits or underscores _) succeeded by at most one dot – repeated at least once, as many times as possible
(?<!\.)@ the @ sign mustn’t be preceded by a dot
([-a-zA-z\d](\.(?![-.]))?)+ domain part is hyphens -, English letters, or digits succeeded by at most one dot that isn’t succeeded by another dot or hyphen – repeated at least once, as many times as possible
(?<!\.) negative lookbehind makes sure the final character isn’t a dot

Sometimes you get sub-strings of incorrect email addresses. For example this test:

assert_equal ['abc@example.com'],
    extract_emails_adv('abc..123@example.com (abc@example.com) A@bb@c@example.com')

 1) Failure:
<["abc@example.com"]>
expected but was <["123@example.com", "abc@example.com", "A@bb", "c@example.com"]>.

You have to specify what comes before and after an email address with negative lookarounds:

  html.scan(/((?<![-+\w@.])([-+\w]\.?)+(?<!\.)@([-a-zA-z\d](\.(?![-.]))?)+(?<!\.)(?![-+\w@]))/)

The new part is in the beginning and the end:

(?<![-+\w@.]) negative lookbehind – email addresses mustn’t be preceded by -, +, \w, @, or ..
(?![-+\w@]) negative lookahead – email address mustn’t be succeeded by -, +, \w, or @.

You almost forgot about munged email addresses. Address munging is when people modifies their addresses so that email sent to that address will not be delivered. The Jargon File defines ‘mung’ as “Mash Until No Good.” Why would anyone do that? Well if they mash their addresses in a way that only can be decrypted by humans, then our evil spiders won’t recognize them. What can we do about it? Of course, we can add decryption to our spider. Here are some conventions:

We want these tests to pass:

assert_equal ['abc@example.com'], extract_emails('abc@exa&#109;ple.com')
assert_equal ['abc@example.com'], extract_emails('abc@exa&#x006d;ple.com')
assert_equal ['abc@example.com'], extract_emails('abc@example.com.invalid')
assert_equal ['abc@example.com'], extract_emails('abc at example (dot) com')
assert_equal ['abc@example.com'], extract_emails('abc<td>@<td>example<td>.<td>com')
assert_equal ['abc@example.com'], extract_emails('abc<i>@</i>example<i>.</i>com')
assert_equal ['abc@example.com'], extract_emails('a b c @ e x a m p l e . c o m')
assert_equal ['abc@example.com'], extract_emails('abc@exampleNOSPAM.com')
assert_equal ['abc@example.com'], extract_emails('abc@exampleREMOVEME.com')

We add an anti-mung function remove_mung(html) and invoke it from extract_emails(html) before the address search begins:

def remove_mung(html)
  html
    .gsub(/&#(\d+);/) {|codepoint| $1.to_i.chr}
    .gsub(/&#x([\da-fA-F]+);/) {|codepoint| $1.to_i(16).chr}
    .gsub(/.invalid|NOSPAM|REMOVEME/, '')
    .gsub(/\s+[\p{Pi}\p{Ps}]?\s*at\s*[\p{Pe}\p{Pf}]?\s+/, '@')
    .gsub(/\s+[\p{Pi}\p{Ps}]?\s*dot\s*[\p{Pe}\p{Pf}]?\s+/, '.')
    .gsub(/<([^>]+)>([@.])<\/?\1>/, '\2')
    .gsub(/([-+\w.] )+@( [-+\w.])+/) {|spaced_email| spaced_email.gsub(' ', '')}
end

def extract_emails(html)
  remove_mung(html)
    .scan(/((?<![-+\w@.])([-+\w]\.?)+(?<!\.)@([-a-zA-z\d](\.(?![-.]))?)+(?<!\.)(?![-+\w@]))/)
        .map {|matched_groups| matched_groups[0]}
end

And here goes some explanations to the anti-mung code:

&#(\d+); &# literally, then capture at least one digit – as many as possible – in group #1, and finally semicolon ;
.invalid|NOSPAM|REMOVEME .invalid, NOSPAM, or REMOVEME literally
\s+ at least one white-space, as many as possible
[\p{Pi}\p{Ps}]? none ore one character that is either member of Punctuation Initial quote or Punctuation Open Unicode categories
\s at least zero white-spaces, as many as possible
[\p{Pe}\p{Pf}]? none ore one character that is either member of Punctuation Final quote or Punctuation Close Unicode categories
<([^>]+)> HTML open tag, capture tag id in group #1
<\/?\1> HTML close tag with id previously captured in group #1
[-+\w.] hyphen, plus, a-z, A-Z, digit, underscore, or full stop

Normalize chemical data files

“Chemists are a very imaginative group. They keep thinking of new file formats.” That’s the two first sentences in the Open Babel documentation. Open Babel is a project to facilitate the interconversion of chemical data from one file format to another. It’s a free, and open source project that supports over 100 chemical data file formats.

We have a service where chemists keep posting chemical data files in various data file formats. These files should be converted into the Protein Data Bank format (pdb), and we want the result to reside in the /var/spool/chem directory for further processing. Not only that, we must also add hydrogens appropriate for pH 4.

The architecture is a PHP function that generates a system call, invoking the Open Babel command line tool obabel. Our mission is to take the path of a new file as input and construct a conversion command. The usage and options we’ll use:

Here are PHPUnit tests:

$this->assertEquals('obabel /tmp/tr.mol -O /var/spool/chemical/tr.pdb -p 4',
                    chemCommand('/tmp/tr.mol'));
$this->assertEquals('obabel /tmp/local/es.xyz -O /var/spool/chemical/es.pdb -p 4',
                    chemCommand('/tmp/local/es.xyz'));
$this->assertEquals('obabel /tmp.local/ru.mol -O /var/spool/chemical/ru.pdb -p 4',
                    chemCommand('/tmp.local/ru.mol'));

PHP uses a regex engine called PCRE. This acronym may be read as Perl Compatible Regular Expressions – even though it isn’t always compatible with Perl. It’s the same library that’s used e.g. in the Apache HTTP Server and the R scripting language. Our solution benefits from greediness:

function chemCommand($path) {
  return preg_replace('%(.*/(.*)\..*)%',
                      'obabel \1 -O /var/spool/chemical/\2.pdb -p 4',
                      $path);
}

Note that three of the dots . in this regex means any character except line breaks, but one of them means literally match dot .. The latter is of course the one that is escaped with backslash \. These are the details unveiled:

( start to capture group #1
.* a sequence of any characters except line breaks, as many as possible
/ literally slash /
(.*) capture a sequence of any characters except line breaks, as many as possible in group #2
\. literally dot .
.* a sequence of any characters except line breaks, as many as possible
) end of capture in group #1

I said that this regex benefits from greediness. Why is it that the / always matches the last slash and \. always matches the last dot? It’s because the leftmost .* in this regex will match as much as possible, only giving back from the end of the string – as little as needed – to make the whole regex match. In /tmp.local/ru.mol for example, .* will first match the entire path, but / and \. will complain. Then .* will give back /ru.mol and that will make / and \. happy.

In some cases we nay have local input files, they don’t have any slash / in their path and will not work with our regex above. This test fails now:

$this->assertEquals('obabel fr.sdf -O /var/spool/chemical/fr.pdb -p 4',
                    chemCommand('fr.sdf'));

Failed asserting that two strings are equal.
--- Expected
+++ Actual
@@ @@
-'obabel fr.sdf -O /var/spool/chemical/fr.pdb -p 4'
+'fr.sdf'

We have to make the part from the beginning up to and including the last slash preferred but optional. The optional operator is written ? and its operand is what is to the left.

function chemCommand($path) {
  return $output = preg_replace('%((.*/)?(.*)\..*)%',
                                'obabel \1 -O /var/spool/chemical/\3.pdb -p 4',
                                $path);
}

We add an extra group (pair of parentheses) that captures the part from the beginning up to and including the last slash, and matches that part preferably once but accept no match. The group containing the file name without dot and extension – that we previously referred to as \2 – will now be referred to as \3. Capturing groups always have the index number based on the order of their opening parenthesis.


Verify bank card numbers

You must verify bank card numbers at client-side. The rules of the numbers are specified in ISO/IEC 7812. The first six digits are called issuer identification number (IIN). They identify the institution that issued the card. The rest of the number is allocated by the issuer and are compliant with varying rules.

First, we want to clean the input from any non-digits. These JavaScript tests should pass:

assert.equal(eraseNoneDigits('5105 1051 0510 5100'), '5105105105105100');
assert.equal(eraseNoneDigits('3714-496353-98431'), '371449635398431');

We can search-and replace any non-digit with the empty string '':

function eraseNonDigits(input) { return input.replace(/\D/g, ''); }
\D anything but a digit

The flag g means globally, i.e. replace all non-digits with the empty string.

Let’s start with MasterCard. The following rules need to be implemented:

We wait a moment with the Luhn requirement and start with the other two. These tests must pass:

assert.ok(isMasterCard('5111005111051128'));
assert.ok(isMasterCard('5112345112345114'));
assert.ok(!isMasterCard('4012888888881881')); // visa
assert.ok(!isMasterCard('511234511234511')); // too few digits

The first digit should be 5, the second in the range 1 to 5, and then there must be another 14 digits.

function isMasterCard(cardNumber) {
  return /^5[1-5]\d{14}$/.test(eraseNonDigits(cardNumber));
}
^ from start of input
5 literally match 5
[1-5] character class – match exactly one character in the range 1 to 5
\d{14} match 14 digits
$ and nothing more before end of input

What about the Luhn check digit? Hans Peter Luhn was a German computer scientist who worked for IBM in the US: And guess what – he created the Luhn algorithm. Here’s a test for that:

assert.ok(!isMasterCard('5111005111051123')); // luhn error

It’s not possible to implement Luhn in Regex, at least not in a simple way. For reference, I add a compact imperative implementation of the algorithm here. Then we can invoke it from isMasterCard().

function isLuhn(input) {
  var s = 0; var l = input.length; var p = l % 2;
  for (var i = 0; i < l; i++) {
    var d = parseInt(input.charAt(i))
    if (i % 2 == p) d *= 2;
    if (d > 9) d -= 9; s += d;
  }
  return (s % 10) == 0;
}

function isMasterCard(cardNumber) {
  cardNumber = eraseNonDigits(cardNumber);
  return /^5[1-5]\d{14}$/.test(cardNumber) && isLuhn(cardNumber);
}

All tests above pass now and they invoke the isMasterCard() that checks Luhn.

MasterCard isn’t the only bank card type. We want to support all type of cards. Next up is Visa.

Our multi bank card issuer function is called isBankCard() and should be comfortable with the following tests:

assert.ok(isBankCard('5111005111051128')); // mastercard
assert.ok(isBankCard('5112345112345114')); // mastercard
assert.ok(isBankCard('4012888888881881')); // visa
assert.ok(!isBankCard('511234511234511')); // too few digits
assert.ok(!isBankCard('5111005111051123')); // luhn error

The isBankCard() function checks that it’s either a Visa or a MasterCard number, while isVisa() implements the Visa rules stated above.

function isVisa(cardNumber) {
  cardNumber = eraseNonDigits(cardNumber);
  return /^4\d{12}(\d{3})?$/.test(cardNumber) && isLuhn(cardNumber);
}

function isBankCard(cardNumber) {
    return isVisa(cardNumber) || isMasterCard(cardNumber);
}
^ from the start of the input
4 match 4 literally
\d{12} match 12 digits
(\d{3}) match 3 digits…
? …optionally – zero or once, preferably once
$ and nothing more before end of input

Note that (\d{3})? matches zero or three digits, preferably three. Question mark ? is the optional operator here. If we remove the parentheses, and get \d{3}?, then the question mark is the reluctant quantifier modifier, meaning that the quantifier {3} is lazy. So \d{3}? is match exactly three digits.

I leave to the reader to implement the other bank cards. Below are the rules.

Issuing network IIN ranges Length Validation
American Express 34, 37 15 Luhn
Bankcard 5610, 560221-560225 16 Luhn
China UnionPay 62 16-19 none
Diners Club Carte Blanche 300-305 14 Luhn
Diners Club enRoute 2014, 2149 15 none
Diners Club International 36 14 Luhn
Diners Club United States & Canada 54, 55 16 Luhn
Discover Card 6011, 622126-622925, 644-649, 65 16 Luhn
InstaPayment 637-639 16 Luhn
JCB 3528-3589 16 Luhn
Laser 6304, 6706, 6771, 6709 16-19 Luhn
Maestro 5018, 5020, 5038, 6304, 6759, 6761, 6762, 6763, 0604 12-19 Luhn
MasterCard 51-55 16 Luhn
Solo 6334, 6767 16, 18, 19 Luhn
Switch 4903, 4905, 4911, 4936, 564182, 633110, 6333, 6759 16, 18, 19 Luhn
Visa 4 13, 16 Luhn
Visa Electron 4026, 417500, 4508, 4844, 4913, 4917 16 Luhn

Independence of attribute order

Users post HTML code to your site and you want to find all hyperlinks. Not only do you need the href URL in the <a> tags, but also the value of the rel attribute, if set. The posted code can of course have more than one hyperlink. We start with some simple Python tests:

self.assertEqual(extractHyperlink("<a href='http://tu.com' rel='prev'>ji</a>"),
                 [("http://tu.com", "prev")])
self.assertEqual(extractHyperlink("pi <a href='http://ap.com' rel='next'>hu</a> te"),
                 [("http://ap.com", "next")])
self.assertEqual(extractHyperlink("<a href='http://ut.com' rel='prefetch'>ji</a> " +
                                  "<a href='http://et.com' rel='noreferrer'>oh</a>"),
                 [("http://ut.com", "prefetch"), ("http://et.com", "noreferrer")])

We save the hyperlinks in a list of pairs. The first item of each pair is the href, and the second is the rel value. The solution use capture groups. Leftmost left parenthesis in our regex starts the URL and, second leftmost starts the rel value:

def extractHyperlink(html):
  result = []
  regex = "<a href='(.*?)' rel='(.*?)'>";
  for hyperlink in re.finditer("<a href='(.*?)' rel='(.*?)'>", html):
    result.append((hyperlink.group(1), hyperlink.group(2)))
  return result
<a href=' match literally
( start group #1
. match any character, except for line breaks…
*? …at least zero times (closure operator *) as few as possible (reluctant quantifier modifier ?)
) close group #1
' rel=' match literally
(.*?) capture the rel attribute value in group #2
'> match literally

Tests pass. That’s fine. Unfortunately, we’ve ignored a few special cases:

  1. There may be other attributes in the <a> tag, e.g. target='_blank'.
  2. The attributes rel and href may come in reverse order. Remember that it’s user entered input.
  3. There can be any positive number of consecutive white-spaces. We also have to keep in mind that dot . match any character except line breaks. If we want dot to match line breaks as well, then we need to add a dot-match-all flag. The latter is (?s) in Python, which stands for single line mode.

Here are three tests that exemplifies the bullets above:

self.assertEqual(
    extractHyperlink("ug <a target='_blank' href='http://ik.com' rel='author'>if</a> ru"),
    [("http://ik.com", "author")])
self.assertEqual(
    extractHyperlink("gu <a rel='search' href='http://la.com'>be</a> ko"),
    [("http://la.com", "search")])
self.assertEqual(
    extractHyperlink("<a href='http://al.com' \n rel='bookmark'</a>"),
    [("http://al.com", "bookmark")])

We’ll use no less than two positive lookaheads (?=...) – one to capture the href attribute and one to capture the rel attribute. This is the trick to read the same input many times. Lookarounds don’t consume the input, they only look around. Reading the same input once to capture href and another time to capture rel makes our regex independent of their order. Here’s the Python solution:

  regex = "(?s)<a (?=[^>]*href='(.*?)')(?=[^>]*rel='(.*?)').*?>"
(?s) flag that tells the Python regex engine that dot . matches every character, even line breaks
<a match literally
(?= start lookahead, but don’t advance the current pointer
[^>] negative character class – any character but greater than > and…
* …repeat at least zero times, as many times as possible
href=' match literally, we’re still in the first lookahead
(.*?) capture the href attribute value in group #1
') match single quotation mark ' literally, end first lookahead, and continue reading from where this lookahead started
(?=[^>]*rel='(.*?)') the second lookahead captures the rel attribute value in group #2 and continue reading from where this lookahead started
.*? consume all content: match at least zero of any character – even line breaks – as few times as possible
> match literally

Great! Now when light is green – all tests pass – we can refactor the code with confidence. The Python crew was first to implement the named capture feature. Instead of referring to groups with index numbers, like 1 and 2 above, we can name the captured groups with (?P<name>...). The captured text can then be referred to with match.group('name'). This is what the code looks like after that refactoring:

def extractHyperlink(html):
  result = []
  regex = "(?s)<a (?=[^>]*href='(?P<href>.*?)')(?=[^>]*rel='(?P<rel>.*?)').*?>"
  for hyperlink in re.finditer(regex , html):
    result.append((hyperlink.group('href'), hyperlink.group('rel')))
  return result

Don’t split escaped delimiter

You’ve saved your spreadsheet as a text file, where each line consists of fields separated by comma, i.e. you’ve created a CSV file. When we split the lines of a comma separated file, we have to deal with escaped commas. If there’s a backslash \ right before the comma, then the comma isn’t a delimiter. Unless, of course, the backslash itself is escaped. A backslash can be escaped by adding another backslash right before the backslash. In other words, if there’s an odd number of backslashes before a comma, then the comma isn’t a delimiter – it’s an escaped comma character. To start with, we want these C# tests to pass:

CollectionAssert.AreEqual(new [] { "go", @"ji\,eh", @"pa\\",  "ky" },
    CommaSplit(@"go,ji\,eh,pa\\,ky"));
CollectionAssert.AreEqual(new [] { @"\\", "yg", "" }, CommaSplit(@"\\,yg,"));
CollectionAssert.AreEqual(new [] { "as", @"mi\\\," }, CommaSplit(@"as,mi\\\,"));
CollectionAssert.AreEqual(new [] { "du", @" si\\", "" }, CommaSplit(@"du, si\\,"));

What are all these @ signs doing there? Well, C# doesn’t only support regular string literals. It also has verbatim string literals, where almost every character is interpreted verbatim. In particular, backslashes \ are not interpreted as escape characters. Verbatim string literals start with an @ sign. We must still escape backslashes once in our regexes, to make the regex engine understand.

Our approach is to split on commas, unless they are preceded by an odd number of backslashes. The latter can be checked with a negative lookbehind (?<!...) like this:

string [] CommaSplit(string line) {
  return Regex.Split(line, @"(?<![^\\](\\\\)*\\),");
}
(?<! negative lookbehind – check that the following sequence doesn’t prepend the comma delimiter…
[^\\] negative character class – any character, except backslash \
(\\\\)* two backslashes, repeated zero or more times, as many as possible – i.e. an even number of backslashes
\\ and finally one backslash
, split on literal comma

Yes, the tests above pass now, but we have forgotten an important case. The sub pattern [^\\] always match exactly one character. In light of that, we realize that the code above doesn’t allow a line that starts with an escaped comma, i.e. where the first field starts with an odd number of backslashes followed by a comma. So, we better add some more tests:

CollectionAssert.AreEqual(new [] { "", "pu" }, CommaSplit( ",pu"));
CollectionAssert.AreEqual(new [] { @"\,pu" }, CommaSplit(@"\,pu"));
CollectionAssert.AreEqual(new [] { @"\\", "pu" }, CommaSplit(@"\\,pu"));
CollectionAssert.AreEqual(new [] { "", "", "pu" }, CommaSplit( ",,pu"));

The first thing our negative lookbehind pattern must check is that the sequence of backslashes is prepended by either a non-backslash character or else the start of the line anchor:

return Regex.Split(line, @"(?<!(^|[^\\])(\\\\)*\\),");
^ either the start of the line anchor…
| …or…
[^\\] …a non-backslash character

Now, the problem is that most regex engines – .NET is an exception – doesn’t support infinite quantifiers like positive closure + and closure * in lookbehinds. How can you solve this then? Instead of splitting on comma, we can match the tokens:

def comma_split(line)
  line.scan(/(,|\A)(([^,\\]|\\.)*)/).map { |matched_groups| matched_groups[1] }
end
,|\A either we start with a delimiter comma or else, we are at the first token at the start of input anchor \A
( start group #1, which is the token…
[^,\\] either any character apart from comma or backslash
| …or…
\\. …any escaped character
)* close group #1 and repeat it zero or more times, as many as possible

It’s not working though, the following test fails:

assert_equal ['', 'pu' ], comma_split(',pu')

Failure: <["", "pu"]> expected but was <["pu"]>.

It doesn’t help to change the order in the first alternation:

line.scan(/(\A|,)(([^,\\]|\\.)*)/).map { |matched_groups| matched_groups[1] }

Failure: <["", "pu"]> expected but was <[""]>.

Neither can we use positive lookahead to match twice at position 0 of the line:

line.scan(/(\A(?=,)|,|\A)(([^,\\]|\\.)*)/).map { |matched_groups| matched_groups[1] }

Failure: <["", "pu"]> expected but was <[""]>.

Note that alternation | always prefers the leftmost alternative, if possible. Now, take a few minutes to analyze the last three regexes and make sure you really understand why these tests fail.

Should we give up then? No, we need to think outside the box. Infinite quantifiers like positive closure + and closure * are not allowed in lookbehind in most regex engines. However, the are always allowed in lookahead. Can we reverse the input, use lookahead, and then reverse the result back? Of course we can:

line.reverse.split(/,(?!\\(\\\\)*([^\\]|\z))/, -1).map {
    |reversed_field| reversed_field.reverse }.reverse

The -1 is a Ruby technical thing, called the limit parameter. Without it, trailing empty fields wont be captured. Here’s the doc for it: “If the limit parameter is omitted, trailing null fields are suppressed. If limit is a positive number, at most that number of fields will be returned (if limit is 1, the entire string is returned as the only entry in an array). If negative, there is no limit to the number of fields returned, and trailing null fields are not suppressed.”

Also note that first the line is reversed, then the fields are captured, then each field is reversed back and finally the order of the fields is reversed back.

, split on comma, unless…
(?! negative lookahead
\\(\\\\)*([^\\]|\z) an odd number of backslashes, followed by either any non-backslash [^\\] or end of string \z

Replace something outside shortcodes

The blogging tool WordPress supports something called Shortcode API, a simple set of functions for creating macro codes for use in post content. There are self-closing shortcode macros such as [myshortcode]. The API also supports enclosing shortcodes such as [myshortcode]content[/myshortcode]. A shortcode is hooked to a PHP handler function. We want to search and replace a text, but only outside shortcodes and shortcode content. In order to not clutter this book to much, let’s pretend that shortcode ids only consists of letters, i.e. \p{L}. We want these Ruby tests to pass:

assert_equal '[fe]bo ne[/fe]', replace_outside_shortcodes('[fe]bo ne[/fe]', 'e', 'i')
assert_equal 'ar[fe]bo ne[/fe]mi', replace_outside_shortcodes('ar[fe]bo ne[/fe]me', 'e', 'i')
assert_equal 'ri [ gi', replace_outside_shortcodes('re [ ge', 'e', 'i')
assert_equal 'ar[fe]bo ne[/fe]mi di[be]po li[/be]gi',
    replace_outside_shortcodes('ar[fe]bo ne[/fe]me di[be]po li[/be]ge', 'e', 'i')

We’ll do this in two steps. First, we break up the input in tokens, that are either shortcodes with content (inside tokens) or else text between two different shortcodes (outside tokens). Second, we check every token and perform search-and-replace only in the outside tokens.

def replace_outside_shortcodes(input, search, replace)
  input.scan(/(\[(\p{L}+)\].*?\[\/\2\]|((?!\[\p{L}+\]).)+)/m).map { |matched_groups|
    token = matched_groups[0]
    token =~ /\A\[\p{L}+\]/ ? token : token.gsub(search, replace)
  }.join

The first step is a Regex alternation | “either a inside token or else an outside token.” In the second step, we use the Ruby match operator =~ to see if it’s an inside token (leave as is) or outside token (search and replace).

Inside token:

\[ literally left bracket [
\p{L}+ at least one Unicode letter, as many as possible, and capture in group #2
\] literally right bracket ]
.*? shortcode content – zero or more of any character, as few as possible because of the reluctant quantifier modifier ?
\[\/\2\] what was captured above in group #2 should come here as an end tag, enclosed in [/ and ]

Outside token:

(?!\[\p{L}+\]) negative lookahead, make sure we don’t stand in front of a shortcode start tag
. match any character, even line breaks since we have the Ruby m flag
+ repeat as long as we don’t stand in front of a shortcode start tag

Check if previously matched token is of inside type:

\A anchored at the start of input…
\[\p{L}+\] …there’s a shortcode start tag

But, hey, we totally forgot about the self-closed shortcodes. It is shortcodes without content and end tag. Here are a few more tests that might not pass in the solution above:

assert_equal 'ar[fe]bo ne[/fe]mi di[be]po li[/be]gi[he]vi',
    replace_outside_shortcodes('ar[fe]bo ne[/fe]me di[be]po li[/be]ge[he]ve', 'e', 'i')
assert_equal '[we]fe[te]ge[/we]', replace_outside_shortcodes('[we]fe[te]ge[/we]', 'e', 'i')
assert_equal '[me] ri [be] se [/be]', replace_outside_shortcodes('[me] re [be] se [/be]', 'e', 'i')

The solution is very simple. We just make the content and end tag part optional by enclosing it in parentheses and follow it with the optional quantifier ?. Note that the optional quantifier is greedy, as all quantifiers without the reluctant modifier. This means that it prefers a match – in this case of content and end tag – over not matching.

def replace_outside_shortcodes(input, search, replace)
  input.scan(/(\[(\p{L}+)\](.*?\[\/\2\])?|((?!\[\p{L}+\]).)+)/m).map { |matched_groups|
    token = matched_groups[0]
    token =~ /\A\[\p{L}+\]/ ? token : token.gsub(search, replace)
  }.join
end

Section in Windows INI file

The INI file format used to be the way to configure applications in Microsoft Windows. Even though Microsoft nowadays officially support the Windows registry and XML, it’s still used, not only in Windows. The simplest case have sections of name=value pairs. Each pair resides on it’s own line and new sections are marked with section names enclosed in brackets, e.g. [ui].

With the following input, we want to parse triplets of <section, name, value>.

[owner]
name=Pi Do
organization=Ar Inc.
 
[database]
server=192.0.4.26
port=5678

We want the following C# test to pass:

 string iniFileContent = "[owner]\nname=Pi Do\norganization=Ar Inc.\n\n[database]\n" +
     "server=192.0.4.26\nport=5678";
 List<Tuple<string, string, string>> expected = new List<Tuple<string, string, string>> {
  Tuple.Create("owner", "name", "Pi Do"),
  Tuple.Create("owner", "organization", "Ar Inc."),
  Tuple.Create("database", "server", "192.0.4.26"),
  Tuple.Create("database", "port", "5678")
};
CollectionAssert.AreEqual(expected, ParseIni(iniFileContent));

We’ll match all value=name pair, and then for each associate it with the nearest preceding section declaration.

List<Tuple<string, string, string>> ParseIni(string input) {
  List<Tuple<string, string, string>> result = new List<Tuple<string, string, string>>();
  Regex regex = new Regex(@"(?<=^\[(.+)\]$[\s\S]+?)(.+)=(.+)", RegexOptions.Multiline);
  for (Match properties = regex.Match(input); properties.Success; properties = properties.NextMatch()) {
    result.Add(Tuple.Create(properties.Groups[1].Value, properties.Groups[2].Value,
        properties.Groups[3].Value));
  } 
  return result;
}

Before getting into details of the regex, the regex modes need to be explained. First, the flag RegexOptions.Multiline is set and it means that the anchors ^ and $ match at line breaks – i.e. in the beginning and end of each line – and not only in the beginning and the end of the whole input. Second, dot . will match any character except line breaks. That’s the way dot normally works, unless a flag for Dot match all is set.

The main part of the regex is (.+)=(.+). It matches a string, as long as possible. The matched string consists of some characters, followed by an equal sign =, followed by some characters. The characters can not be any kind of line breaks. This main part is preceded with a positive lookbehind that finds the nearest preceding section declaration. Most regex engines doesn’t support lookbehind with infinite quantifiers like positive closure + and closure *, but C# and .NET does. Kudos to Microsoft for that. The idiom [\s\S] should be thought of as “either whitespace or non-whitespace” – i.e. any character, including line breaks. Here’s a fine-grained explanation:

(?<= positive lookbehind, capture the following pattern from the already consumed input in group #1
^ at the beginning of a line…
\[(.+)\] match at least one non-line break character, as many as possible and literally enclosed by brackets [ and ]
$ …at the end of a line
[\s\S]+? match at least one character (whitespace or non-whitespace), as few as possible because of the reluctant quantifier modifier ?
(.+)=(.+) match a name (at least one character, as many as possible) and value (at least one character, as many as possible) pair, with equal sign = in the middle

No one said that there always need to be a section. And the first name and value pair may appear before the first section declaration. In those cases, the section string in the Tuple should be the empty string. Here’s a test where a section-less top pair has been added:

string iniFileContent = "top_name=top_value\n[owner]\nname=Pi Do\n";
List<Tuple<string, string, string>> expected = new List<Tuple<string, string, string>> {
  Tuple.Create("", "top_name", "top_value"),
  Tuple.Create("owner", "name", "Pi Do")
};
CollectionAssert.AreEqual(expected, ParseIni(iniFileContent));

The optional quantifier ? allows the preceding pattern to be repeated once or none – preferably once. We add a non capturing group around the section declaration match, and add the optional quantifier ? right after. We use non-capturing to not change the group indexes 1, 2, and 3 used above.

Regex regex = new Regex(@"(?<=(?:^\[(.+)\]$[\s\S]+?)?)(.+)=(.+)", RegexOptions.Multiline);
(?: start non capturing group – creates no group index
^\[(.+)\]$[\s\S]+? match a section name enclosed in brackets [ and ]
)?` …if there is one available

Markdown->html: un-htmlize inside pairs of back tick `

In our Markdown to HTML generator, we want to replace all Less Than signs < with &lt;, when they appear inside programming code. Browsers may otherwise render parts of our code as HTML tags. Programming code can be enclosed by back ticks ` in Markdown. Here goes some Ruby test we want to pass:

assert_equal "`&lt;`", replaceLt("`<`")
assert_equal "<`&lt;`", replaceLt("<`<`")
assert_equal "`&lt;`<", replaceLt("`<`<")
assert_equal "mi `na` ug<html> da", replaceLt("mi `na` ug<html> da")
assert_equal "`en&lt;html>ny` `so` <tag> po `&lt;html>`",
    replaceLt("`en<html>ny` `so` <tag> po `<html>`")
assert_equal "if `&lt;tag>` av", replaceLt("if `<tag>` av")

In our first attempt, we interpret “is enclosed in back ticks” as “has an odd number of back ticks to the right”. If the < is followed by an even number of `, mixed with other characters, then it’s outside the code blocks. To find an even number of repetitions, we can write something like ((...){2})* – repeat zero or more, as many as possible, and twice inside every repetition.

def replaceLt(input)
  return input.gsub(/<(?!(([^`]*`){2})*[^`]*\z)/, '&lt;')
end
< match less than literally
(?! negative lookahead, assert that < isn’t followed by…
[^`]*` …at least zero non-back ticks, as many as possible, followed by exactly one back tick
){2})* repeat the preceding twice and repeat the repetition at least zero times, as many as possible – i.e. an even number of times
[^`]*\z finally repeat zero or more, as many as possible, non-back ticks before end of input \z

The next step is of course to add examples where back ticks are escaped. The rule here is that code spans can actually be enclosed by any number of consecutive back sticks, as long as it’s the same number of back sticks prefixing and suffixing the code. The following test must pass:

assert_equal  "ox `&lt;tag>` <tag> `` yu &lt;tag> `mu &lt;tag> ``",
    replaceLt("ox `<tag>` <tag> `` yu <tag> `mu <tag> ``")

It’s not possible now to use the strategy of avoid if even number of delimiters, since the delimiter before and after need to be the same. Most regex engines apart from .NET doesn’t support infinite repeat lookbehind, e.g. to use closure * or positive closure + in lookbehind. We chose a strategy to first tokenize the input in either back ticked code spans or else single characters, like this ((?<!`)(`+)(?!`).+?(?<!`)\2(?!`))|.

(?<!`) neative lookbehind – find a place were the preceding character isn’t back tick
(`+) match at least one back tick, as many as possible and capture this series in group #2
(?!`) negative lookahead – make sure the next character isn’t a back tick
.+? match at least one arbitrary character (with dot match all mode), as few as possible – this is the code inside the back tick closure
(?<!`) negative lookbehind – make sure the last character of the code isn’t a back tick
\2 match the opening series of back ticks captured in group #2 once again
(?!`) negative lookahead – make sure there’s no back tick following the closing back tick series
|. this is the most interesting part: if the pattern above can be matched, then this is a code span token, but else we create a token of a single character

If we put that in a Ruby scan() and feed it with the example above:

p "ox `<tag>` <tag> `` yu <tag> `mu <tag> ``".scan(/((?<!`)(`+)(?!`).+?(?<!`)\2(?!`)|.)/m)

…then we’ll get an array of arrays like this:

[["o", nil], ["x", nil], [" ", nil], ["`<tag>`", "`"], \
[" ", nil], ["<", nil], ["t", nil], ["a", nil], \
["g", nil], [">", nil], [" ", nil], ["`` yu <tag> `mu <tag> ``", "``"]]

Note that the first entry in each inner array is either a code span or a single character. They are our tokens. For each of these tokens, we want to test if it’s a code span – in which all less than signs < should be replaced by &lt; – or if it’s just a single character that must be untouched. All tokens can then be joined to a string:

def replace_lt(input)
  input.scan(/((?<!`)(`+)(?!`).+?(?<!`)\2(?!`)|.)/m).map { |matched_groups|
    token = matched_groups[0]
    token.length == 1 ? token : token.gsub(/</, '&lt;')
  }.join
end

This solution is a actually a regex idiom. You have two levels of parsing, where each match on the first level can have multiple matches at the second level. In this case we first wanted to find all code spans enclosed by back ticks of various lengths – first level. Then we wanted to replace every less than sign < in each of the first level matches – second level.

Two more things about grouping. Group #2 is what’s inside the pair of parentheses started by the second leftmost left parenthesis of the regex. And the whole regex is enclosed by parentheses for Ruby technical reasons. The scan() method saves the captured groups in an array. Enclosing the whole regex in parentheses makes the whole match be captured in group #1.


Make url:s clickable

Visitors don’t want to cut and paste URLs that other people wrote in our web guest-book. They want them to be clickable and linked. There are many rules about URLs, but as a starter we wish to have the most common http and https URLs rendered as links. We also wish to strip the URL protocol from the rendered text. Here are some Java test that we want to pass:

assertEquals("is <a href='http://pi'>pi</a> ok", linkifyUrls("is http://pi ok"));
assertEquals("<a href='http://up.do/ae'>up.do/ae</a>", linkifyUrls("http://up.do/ae"));
assertEquals("me <a href='https://ma.da/ph.html'>ma.da/ph.html</a> la",
    linkifyUrls("me https://ma.da/ph.html la"));
assertEquals("(<a href='http://ny/oo'>ny/oo</a>)", linkifyUrls("(http://ny/oo)"));
assertEquals("<a href='http://of/xi+'>of/xi+</a>", linkifyUrls("http://of/xi+"));

Let’s allow any non whitespace in the domain name and path. Trailing full stop, right parenthesis, or quotation marks are probably part of the surrounding text and must not be part of what we figure is an URL.

String linkifyUrls(String input) throws Exception {
  return input.replaceAll("(https?://)(\\S+)(?<!\\p{P})", "<a href='$0\'>$2</a>");
}
http match http literally
s? preferably, but optionally, literally match s
:// match :// literally and capture everything so far in group #1
\\S+ match at least one non-whitespace and as many as possible – capture them in group #2
(?<! Negative lookbehind…
\\p{P} …assert that the preceding character isn’t a member of the Unicode punctuation category , i.e. the URL doesn’t end with any punctuation

The captured groups used in the replacement text:

$0 The whole match, e.g. http://up.do/ae
$2 URL without protocol, e.g. up.do/ae

It’s too strict to ban any type of trailing punctuation. We can accept that an URL end in a number sign, percent sign, commercial at, or hyphen-minus. Plus sign is a permitted URL end character, since it’s a member of the Unicode Math Symbol. Parenthesis of any kind, colon, full stop and all sort of quotes are characters we don’t want. We can benefit from the sub categories of Unicode Punctuation category \p.

We also want to render fractions that start with www. as clickable links. We believe that they are http references.

assertEquals("io <a href='http://www.ko/mo'>www.ko/mo</a> op. <a href='http://www.fy'>www.fy</a>",
    linkifyUrls("io www.ko/mo op. www.fy"));
assertEquals("<a href='http://of/xi-'>of/xi-</a> <a href='http://of/xi#'>of/xi#</a>",
    linkifyUrls("http://of/xi- http://of/xi#"));
assertEquals("<a href='http://of/xi?'>of/xi?</a>", linkifyUrls("http://of/xi?"));
assertEquals("<a href='http://www.hi/ho'>www.hi/ho</a>", linkifyUrls("http://www.hi/ho"));

And the solution:

return input.replaceAll("(http(s)?://|(www\\.))(\\S+)(?<![\\p{Pe}\\p{Pf}\".:>])",
    "<a href='http$2://$3$4'>$3$4</a>");
http(s)?:// Either http://, https, or…
(www\\.) …in worst case www. – alternation operator is left-to-right
(?<! Negative lookbehind…
[ …assert that the following character class isn’t present in the end of the URL…
\\p{Pe} …Unicode Close Punctuation category, e.g. ) and ]
\\p{Pf} …Unicode Final Punctuation category, e.g.
\".: …quotation mark, full stop, colon, or greater-than sign
] …end of disallowed character class

The captured groups used in the replacement text:

$2 Either s from https or nothing
$3 Either www. or nothing
$4 URL without protocol, e.g. up.do/ae, or without www.

How can you know the index of the captured groups? Count the left parentheses ( in the regex. The leftmost left parenthesis starts group #1, the second leftmost left parenthesis starts group #2 etc.



Part 2: Understand Regex



Matching One Character

An alphabet is a finite, nonempty set of symbols. Here are some examples of alphabets:

Did you by any chance know that the word alphabet comes from alpha and beta, the first two letters of the 3000 year old Phoenician alphabet, and that alpha meant ox and beta meant house? However, the order of symbols does not make an alphabet unique. Thus, {a, b} is the same alphabet as {b, a}. The empty set – the set composed of zero symbols – is by definition not an alphabet.

A string is a finite (ordered) sequence of symbols chosen from an alphabet. Here are some examples of strings:

The very same symbol can occur multiple times in a string. The order is significant; gof is not the same string as fog.

A language is a countable set of strings from a fixed alphabet. The restriction is that the alphabet is finite. The language may well include an infinite number of strings. Here are some examples of languages:

Our focus in this book is on regular expressions. A regular expression is an algebraic description of an entire language. All languages can’t be described by regular expressions. No regular expression can for example describe the language consisting of all palindromes created with symbols from ASCII. I will try to convince you that regular expressions are ridiculously simple and amazingly powerful. Regular expressions are based on two basic rules:

(1) Empty String Language. The empty string ε is a regular expression. It describes a language that has one single string, namely the empty string.

(2) Single Symbol Language If a symbol a is a member of an alphabet, then a is a regular expression. It describes a language that has the string "a" as its only member.

That was easy, wasn’t it? In addition to the two basic rules, we have three operators that we’ll catch in four more rules. Out of pure laziness, we have conventions for precedence between the three operators, but more on that later. First, you should try the two basic rules in IRB. We write regular expressions between two dashes. You can e.g. write the regulars expression a as /a/. IRB can help us to figure out if a string is in a language described by a particular regular expression:

'a'.match /a/ #=> #<MatchData "a"> – the string a is matched by the regular expression /a/
''.match /a/ #=> nil – the empty string is not matched by the regular expression /a/
'b'.match /a/ #=> nil – the string b is not matched by the regular expression /a/
'a'.match // #=> #<MatchData "">– the string a is not matched by the empty regular expression ε

Only when IRB returns the entire string, it’s matched by the regular expression.


Four more rules for Regular Expressions

Now that we have two basic base rules, we would like to add four more. Then we can build regular expressions recursively from small regular expressions. The first three of these rules describe the only three necessary operators in regular expressions. The fourth rule deals with parentheses:

(3) Alternation. If p and q are regular expressions, then will also p|q be a regular expression. The expression p|q matches the union of the strings matched by p and q. Think of it as: either p or q.

(4) Concatenation. If p and q are two regular expressions, then will also pq be a regular expression. Note that the symbol for concatenation is invisible. Some literature use × for concatenation, as in:p×q. The expression pq denotes a language consisting of all strings with a prefix matched by p, directly followed by a suffix matched by q.

(5) Closure. If p is a regular expression, then will also p* be a regular expression. This is the closure of concatenation with the expression itself. The expression p* match all strings that Completely can be divided into zero or more substrings, each of which is matched by p.

(6) Parentheses. If p is a regular expression, then (p) will be a regular expression as well. This means that we can enclose an expression in parentheses without altering its meaning.

In addition to these rules we will shortly put some convenience rules for operator precedence. They are not necessary, but allow us to write shorter and more readable regular expressions. Quite soon you will also see real regular expression examples with these three operators. These six rules are sufficient for writing all possible regular expressions. All other regex operators you’ve seen are just abstractions and syntactic sugar. Or?

Do you remember George Bernard Shaw’s “The golden rule is that there are no golden rules.” and Mark Twain’s “It is a good idea to obey all the rules when you’re young just so you’ll have the strength to break them when you’re old.”? This is exactly how we should do. For now, these rules are all we need. However, in modern regex engines, there are functions like a back reference, and lookarounds. To implement these, we need more rules. But until we’re there, it will be very useful to think of regular expressions as a system consisting of only these six rules.

Regular expression is thus a mathematical theory and modern regex engines are based on a super set of that theory. With the help of the theory, we can prove the following:

Solving a problem here means to determine whether a string is part of a language. The proofs mentioned above are not reproduced in this book, but they are easily accessed as they exist in every textbook on automata theory. The beauty of this analogy between regular expressions and finite automata is that I can explain several key features of regular expressions for you, with the help of graphs of finite automata. And as a matter of fact, a regex engine is really just a compiler that translates our hand-written regular expressions to computer friendly finite automata, or possibly the more advanced pushdown automata.


Operator #1: Concatenation

Using Rule number 2 and Rule number 4, we can create regular expressions that consists of any sequence of symbols from our alphabet. Rule number 2 said that if the symbol a is in the alphabet, then a is a regular expression. Rule number 4 said that if p and q are two regular expressions, then the concatenation pq is a regular expression as well. The concatenation symbol itself is invisible. Just write the two regular expressions right after each other:

'moda'.match /m/ #=> #<MatchData "m"> – we found the substring s in the string moda
'moda'.match /o/ #=> #<MatchData "o"> - /mo/ is /m/ concatenated with /o/
'moda'.match /mo/ #=> #<MatchData "mo">
'moda'.match /da/ #=> #<MatchData "da">
'moda'.match /moda/ #=> #<MatchData "moda"> - /moda/ is /mo/ concatenated with /da/
'moda'.match /mado/ #=> nil – no match, since the order was changed

There are some handy terms we usually use for substrings:

For any regular expression p, it’s true that εp = pε = p, thus we say That the empty string ε is the identity under concatenation. Concatenation is not commutative, since pq is not equal to qp, but it’s associative since for any regular expressions p and q it’s true that p(qr) = (pq)r.

If we think of concatenation as a product, then regular expressions also support exponentiation. We write the exponent enclosed in braces to the right of the regular expression:

'aaa'.match /aaa/ #=> #<MatchData "aaa">
'aaa'.match /a{3}/ #=> #<MatchData "aaa"> – yes, the string includes 3 concatenated a
'aaa'.match /a{4}/ #=> nil – no, the string doesn't include 4 a

This is obviously just syntactic sugar. All regular expressions that we can write using the exponential operator, can also be unfolded. There are more shortcuts for finite repeated concatenations:

'aa'.match /a?/ #=> #<MatchData "a"> – the optional operator written as question mark
'b'.match /a?/ #=> #<MatchData ""> – zero repeats of a matches the empty string
'aa'.match /a{,2}/ #=> #<MatchData "aa"> – at least two a
'aa'.match /a{1,2}/ #=> #<MatchData "aa"> – at least one a and at moust two a
'a'.match /a{1,2}/ #=> #<MatchData "a">

We will soon see that the concatenation of two regular expressions are not the same as the concatenation of two strings. Remember that a regular expression corresponds to a set of strings. For example, if p = {a, b} and q = {c, d}, then pq = {ac, ad, bc, bd}


Operator #2: Alternation

From rule number 2 and rule number 3 we can define paradigms – a number of possible patterns. This means that we add two or more languages by applying the set operator union to them. The union of the sets {a, b} and {c, d} is {a, b, c, d}. Hence, it’s all the elements that are either in one or more of the sets. In boolean logic, we call this the inclusively or. In regular expressions, it is called alternation and is written with a vertical bar |. Here are some examples:

'a'.match /a|b/ #=> #<MatchData "a"> - a is either a or b
'ab'.match /a|b/ #=> #<MatchData "a"> - leftmost chosen
'ba'.match /a|b/ #=> #<MatchData "b"> - leftmost chosen
'c'.match /a|b/ #=> nil - here we found neither a nor b

Note that when more than one alternative is correct, most regex engines selects the leftmost alternative. There are exceptions to this rule. A regex engine based on DFA or POSIX NFA selects the longest alternative. Most regex engines are basic NFA and select the leftmost.

Can you write a regular expression that matches all binary strings of length one? The binary alphabet is {0, 1}. Since there aren’t a huge number of binary strings of length one, you can pretty quickly list them: {0, 1}. The regular expression with alternation then becomes 0|1:

'0'.match /0|1/ #=> #<MatchData "0">
'1'.match /0|1/ #=> #<MatchData "1">
'2'.match /0|1/ #=> nil
'10'.match /0|1/ #=> #<MatchData "1">

There are four binary strings of length two – {00, 01, 10, 11}. We can capture them with 00|10|01|11:

'10'.match /00|10|01|11/ #=> #<MatchData "10">
'01'.match /00|10|01|11/ #=> #<MatchData "01">
'12'.match /00|10|01|11/ #=> nil
'11'.match /00|10|01|11/ #=> #<MatchData "11">
'1210'.match /00|10|01|11/ #=> #<MatchData "10">

Maybe you didn’t notice, but we used concatenation in the regular expression above (can you see the invisible concatenation symbol between the two binary digits in the regular expression; if not, maybe you should make an appointment with an optometrist; or maybe not; not even an optometrist can help you see invisible symbols). Each of the binary strings of length two are made up of two concatenated binary strings of length one. Since concatenation has higher precedence than alternation, we didn’t need any parentheses.

Alternation is commutative: for two regular expressions p and q it holds that p|q = q|p. It is also associative: p|(q|r) = (p|q)|r. An interesting and very useful fact is that concatenation distributes over alternation. This means that for all regular expressions p, q and r it’s true that p(q|r) = pq|pr and (p|q)r = pq|pr. The consequence of that is that (0|1)(0|1) = (0|1)0|(0|1)1 = 00|10|01|11. So another way to match any binary strings of length two is:

'10'.match /(0|1)(0|1)/ #=> #<MatchData "10">
'01'.match /(0|1)(0|1)/ #=> #<MatchData "01">
'12'.match /(0|1)(0|1)/ #=> nil
'11'.match /(0|1)(0|1)/ #=> #<MatchData "11">
'1210'.match /(0|1)(0|1)/ #=> #<MatchData "10">

The brackets were needed, of course, because concatenation has higher precedence than alternation. We can also have the empty string ε as one of our alternatives:

'moda'.match /moda|/ #=> #<MatchData "moda"> - either moda or nothing is moda
'moda'.match /mado|/ #=> #<MatchData ""> - either mado or nothing is nothing

This give me a way to convince you that all regular expressions have an infinite number of synonyms. A trivial example: (a) = (a|) = (a||) etc.


Operator #3: Kleene Star

All finite languages ​​can be described by regular expressions. You can simply list the strings as an alternation string1|string2|string3 etc. But there are also some languages with an infinite number of strings that can be described by regular expressions. To achieve that, we use an operator we call Kleene star after the American mathematician Stephen Cole Kleene. If p is a regular expression, then p* is the smallest superset of p that contains ε (the empty string) and is closed under the concatenation operation. It is the set of finite length strings that can be created by concatenating strings that match the expression p. If p can match any other string than ε, then p* will match an infinite number of possible strings.

The real name of the typographic symbol (glyph) used to denote the Kleene Star is asterisk. It’s a Greek (not geek) word and it logically enough means little star. Normally, it has five or six arms, but in its original purpose – to describe birth dates in a family tree – it had seven arms. This is a very popular symbol. In Unicode, there are a score of different variants and many communities have chosen their own meaning of the asterisk. In typography it means a footnote. In musical notation, it may mean that the sustain pedal of the piano should be lifted. On our cell phones, we use the asterisk to navigate menus in touch-tone systems. So there is no world-wide interpretation of . However, in this book it’ll always mean the operator *Kleene star.

Do you want to see a simple example? The concatenation closure of one single symbol, such as a, is a* = { ε, a, aa, aaa,... }. Want to see a more academic example? The concatenation closure of the set consisting solely of the empty string ε is – well, the set consisting solely of the empty string ε* = ε. Want to see a more complicated example? (1|0)* = { ε, 0, 1, 01, 10, 001, 010, 011,... }. It may thus be different matches of the expression that concatenates. Can you write a regular expression that matches all binary strings that contain at least one zero? Or all binary strings with an even number of ones?

The operator Kleene star – pronounced /ˈkleɪniː stɑ:r/ klay-nee star – is unary, i.e. it only takes one operand. The operand is the regular expression to the left, which allows us to say that it’s a postfix operator. It has the highest priority of all operators and it is associative. The latter means that if two operators of the same priority are competing, then the operator closest to the operand wins. Since p** = p*, we say that the Kleene star is idempotent. And I want to emphasize again that p* = (p|)*, i.e. the empty string ε is always present in a closure. We’ll later on see that there is a very common – and none the less disastrous – mistake to forget this very fact.

Here are some possible answers to the questions above:

'110'.match /(0|1)*0(0|1)*/ #=> #<MatchData "110"> - all strings with at least oe zero
'1111'.match /(0|1)*0(0|1)*/ #=> nil
'1001'.match /((10*1)|0*)*/ #=> #<MatchData "1001"> - all strings with an even number of ones
'11001'.match /((10*1)|0*)*/ #=> #<MatchData "1100">
''.match /((10*1)|0*)*/ #=> #<MatchData ""> - even the empty string has an even number of ones
'1001'.match /((10*1)|0)|/ #=> #<MatchData "1001"> - another way to express 'an even number of ones'
'11001'.match /((10*1)|0)|/ #=> #<MatchData "11">
''.match /((10*1)|0)|/ #=> #<MatchData "">
'1'.match /((10*1)|0)|/ #=> #<MatchData "">
'010'.match /((10*1)|0)|/ #=> #<MatchData "0">
'01'.match /((10*1)|0)|/ #=> #<MatchData "0">

The positive closure operator + and the at least n operator {n,} are abstractions for expressing infinite concatenation. We’ll soon explore them in more detail.


Parentheses

You might be tempted to read the following regular expression as third or fifth row:

'fifth row'.match /third|fifth row/ #=> #<MatchData "fifth row">
'third row'.match /third|fifth row/ #=> #<MatchData "third">

But unfortunately, as you can see, it’s more like either third (only) or else fifth row. This is due to something called order of operations or operator precedence. The invisible operator for concatenation has higher precedence than the alternation operator |.

To oil these wheels, we now add parentheses to our three operators. In a regular expression, the sub expression enclosed in parentheses get the highest priority:

'fifth row'.match /(third|fifth) row/ #=> #<MatchData "fifth row">
'third row'.match /(third|fifth) row/ #=> #<MatchData "third row">

Note that the parentheses are meta-characters, not literals. They won’t match anything in the subject string. And of course it’s possible to nest parentheses:

'third row'.match /(third|(four|fif)th) row/ #=> #<MatchData "third row">
'fourth row'.match /(third|(four|fif)th) row/ #=> #<MatchData "fourth row">
'fifth row'.match /(third|(four|fif)th) row/ #=> #<MatchData "fifth row">

There are three things we need to remember, to know in what order and with what operands the regular expression engine will execute the operators:

Here goes the table for the operators we have studied so far. Later on, there’s a complete table of all regex operators.

Operator Symbol Precedence Position Associativity
Kleene star * 1 Postfix Associative
Concatenation N/A 2 Infix Left-associative
Alternation | 3 Infix Left-associative

If you think this is hard to remember, then try to memorize the mnemonic SCA. It stands for Star-Concat-Alter, i.e. the order of precedence in regular expressions.


Some examples with *, | and x

Now, we have three operators and a small framework. After all this theory, you might wonder if it’s possible for us to solve any problems. Yes, of course we can. Here are some examples:

All binary strings with no more than one zero:

'01101'.match /1*(0|)1*/ #=> #<MatchData "011">
'0111'.match /1*(0|)1*/ #=> #<MatchData "0111">
'1101'.match /1*(0|)1*/ #=> #<MatchData "1101">
'11010'.match /1*(0|)1*/ #=> #<MatchData "1101">

All binary strings with at least one pair of consecutive zeroes:

'101001'.match /(1|0)*00(1|0)*/ #=> #<MatchData "101001">
'10101'.match /(1|0)*00(1|0)*/ #=> nil
'1010100'.match /(1|0)*00(1|0)*/ #=> #<MatchData "1010100">

All binary strings that have no pair of consecutive zeros:

'1010100'.match /1*(011*)*(0|)/ #=> #<MatchData "101010">
'101001'.match /1*(011*)*(0|)/ #=> #<MatchData "1010">
'0010101'.match /1*(011*)*(0|)/ #=> #<MatchData "0">
'0110101'.match /1*(011*)*(0|)/ #=> #<MatchData "0110101">

All binary strings ending in 01:

'110101'.match /(0|1)*01/ #=> #<MatchData "110101">
'11010'.match /(0|1)*01/ #=> #<MatchData "1101">
'1'.match /(0|1)*01/ #=> nil
'01'.match /(0|1)*01/ #=> #<MatchData "01">

All binary strings not ending in 01:

'010'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "010">
'011'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "011">
''.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "">
'1'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "1">
'01'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "0">
'101'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "10">

All binary strings that have every pair of consecutive zeroes before every pair of consecutive ones:

'0110101'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "0110101">
'00101100'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "0010110">
'11001011'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "110">
'1100'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "110">
'0011'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "0011">

See if you can find even better regular expressions that solve these problems. Remember that there’re an infinite number of synonyms to each regular expression.


Regex-to-fa

For each regular expression – and I mean the three operators and the six recursive rules style – there is a finite automaton that accepts exactly the same strings. Since this is not a university book in mathematics, I’ll show you an inductive reasoning about this and not a formal proof.

The hypothesis is thus that for an arbitrary regular expression p, we can create a finite automaton that has exactly one start state, no paths into the start state, no paths out from the acceptance state and that accepts exactly the same strings that are matched by p.

All three finite automata above have two states. One is start and the other one is acceptance. The difference is that the first one has an ε-path from start to acceptance, the second one has no path, and the third one has a b-path. Now we’ll continue. Imagine that we have two regular expressions p and q corresponding to finite automata s and t respectively.

Look at the pictures above. Then take a deep breath and feel if you can translate an arbitrary regular expression to a finite automaton. Finally assess the last picture where the regular expression (w|bb)* is depicted as a graph using the method described above. Does it feel reasonable?


Regular Regex?

A formal language is a given set of finite strings. The strings are finite, but the number of strings in the language may be infinite. A subclass of the formal languages is the regular languages. They are crucial in this book because they can be mapped to finite automata and to regular expressions created with the original three operators: alternation, concatenation, and kleene closure. These three live in equivalence:

Regular languages can be defined like this:

If you think that this definition is tricky, it’s ok. However, since regular expressions have become popular as extensions to imperative programming languages, new features have been added. These features are not regular and can’t be implemented by a finite automaton. To distinguish these features from the formal regular expressions, I will use the word regex instead of regular expressions.

Verifying prime numbers and making recursive algorithms are two types of problems that can be solved with regexes, despite that they match more than a regular language.

With the (non regular) feature back-reference, it’s possible to create a regex that only match strings with a prime number of identical characters:

irb(main):001:0> r = /^.?$|^((.)\2+?)\1+$/
=> /^.?$|^((.)\2+?)\1+$/
irb(main):002:0> r.match "22"
=> nil
irb(main):003:0> r.match "333"
=> nil
irb(main):004:0> r.match "4444"
=> #<MatchData "4444" 1:"44" 2:"4">
irb(main):005:0> r.match "55555"
=> nil
irb(main):006:0> r.match "tttttt"
=> #<MatchData "tttttt" 1:"tt" 2:"t">

If we also add the (non regular) feature recursion, we can match all palindromes:

irb(main):001:0> p =
/\A(?<palindrome>|.|(?:(?<prefix>.)\g<palindrome>\k<prefix+0>))\z/
=> /\A(?<palindrome>|.|(?:(?<prefix>.)\g<palindrome>\k<prefix+0>))\z/
irb(main):002:0> p.match "otto"
=> #<MatchData "otto" palindrome:"otto" prefix:"t">
irb(main):003:0> p.match "dallas sallad"
=> #<MatchData "dallas sallad" palindrome:"dallas sallad" prefix:"s">
irb(main):004:0> p.match "rais air"
=> nil
irb(main):005:0> p.match "1010"
=> nil

Pushdown Automata

Formal regular expressions can be described by a finite automaton, but modern regex engines support un-regular operators. The problem with finite automata is that they don’t have any memory. Once they are in a state, they have no idea how they got there. Wise guys invented the stack. When we add a stack to a finite automaton it becomes very powerful, but also quite complex. And of course, it’s not a finite automaton anymore – it’s a pushdown automaton. Why do I then describe the regex engines as finite automata when pushdown automata is a superset of finite automata?

The finite automaton reads a tape serially. Every new symbol read from the tape, initiates a transition in the automaton from the current state to a new state (the new state may be the same state as the current state). If we are in an accept state when we have read all the tape, then we have a match. In pushdown automata, we add a stack. For every iteration, the pushdown automaton may read a symbol from the tape, pop a symbol from the stack, or even both. Depending on the current state, the symbol read from the tape and/or the symbol popped from the stack, the pushdown automaton can now initiate a state transition, push a symbol to the stack, or do both. If we end up in a accept state when all the tape is read and the stack is at that time empty then we have a match.

The figure above shows a pushdown automaton that matches any input with the same number of blue and white dots – something that is impossible to describe with a finite automaton. This automaton has two states: a start state and an accept state. The four icons in the middle represents that a dot is popped from or pushed from the stack. Suppose that the input is blue-white-white-blue:

You can probably feel that the pushdown automaton gives us a tool for all kind of recursive implementations. Now that you understand how, I’ll go back to finite automata in all explanations where possible. It’s important for you to understand e.g. what backtracking and greediness does to performance and what input that will be matched. You don’t need a complicated pushdown automaton to learn that. We’ll stick to finite automaton in this book and at the same time remember that many operators in modern regex engines rely on a stack.

By the way: Did you notice the corner case bug in the pushdown automaton above? Doesn’t an empty input string have the same number of blue and white dots?


Workflow

A workflow is a sequence of connected steps. We have to do them in this order. Sometimes we want to execute them as a batch, without caring about intermediate results. And sometimes we want to manually step forward, perhaps to re-iterate at some point. The difference is important, because batching gives us abstraction and stepping gives us transparency. In regex, the worklflow is always the same – regardless if it’s in the editor, at the command line or in a program. The three steps are compiling, applying and gathering. Either we batch them or else we execute them one by one.

Regexes are text. Text is an excellent interface between computers and humans. It’s not perfect for humans and it’s not perfect for computers, but it’s a good compromise. The computer easily translates text into a more computer friendly format – in this case a finite automaton or, in advanced regex cases, a pushdown automaton. This translation is called compilation. In Ruby, it can look like this:

r = Regexp.compile "a|ba*"

The regex factory created a variable called r, which refers to a wrapper around the just created finite automaton. The wrapper has utility functions that we use in the apply step, right after the compilation – i.e. now.

We now have a compiled regex pattern. The second step is to apply one of the functions in the wrapper object. Note that this object is reusable. We can use r to search some text and then later use the same r to validate some other text. Filtering, replacing, and parsing are other applications. In any case we send a text to search in to r. The text might be the phone directory of Istanbul, William Shakespeare’s tragedy Romeo and Juliet or something as simple as the string "a". Actually, it can be any text and generally we call it the subject. If we want to search for patterns in the string "a ba b bb" that matches r, we can write like this in Ruby:

resultset = "a ba b bb".scan r

The variable resultset is an array that consists of "a", "ba", "b", "b", and "b". We can pick the third match with resultset[3]. Gathering is the third and final step in the regex workflow. How and what we gather depends on which function we applied. Application of a verify function returns true or false, and a search-and-replace function returns the subject text modified. The following Ruby statement prints the string “ceed”:

puts "cabad".gsub r, "e"

We can combine these three steps – compiling, applying and gathering – in one batch. The Emacs key combination C-M-s launches the regex search dialog. If we enter the regex \<\(\w+\)\s-+\1\> there, then the cursor immediately travels to the first occasion of a duplicated word – compiling, applying and gathering in one command. We’ll find all lines that contains either "Romeo" or "Juliet" or both with the following grep command:

grep -E "Romeo|Juliet" romeo_juliet.txt

To batch the regex workflow can of course be done in Ruby, Java, C# or any other programming language as well. However, it’ll always consist of these three steps: compiling, applying, and gathering.


Quantifiers

Sometimes, we want to repeat an expression, to make it match more than one instance. If it’s a fix number of repetitions, we can just repeat the whole expression. The expression LaLaLa is a repetition of the expression La three times. Apart from fixed repetition, any kind of variable number of repeats is possible with the three original operators. However it’s verbose to write (La|LaLa|LaLaLa|LaLaLaLa) if we want to express between one and four instances of La. That’s why there are quantifier operators that don’t add any new functionality to regular expressions. These operators add something else: crispness.

The two most common repetition requirement is to match either at least one instance, or at most one instance. At most one, means one, two, three or any other number. What will happen if we replace the first instance of a* with e in the subject “caalery”? The answer is “acaalery”, because “calery” starts with zero instances of a and a* says that we must match zero to many instances. Our intention was to replace repetitions of a with an e. The handy positive closure operator, written as a plus sign is the solution. The expression a+ should be read as: match at least one a. Replacing a+ with e in “callery” gives us “celery” – a delicious vegetable.

The optional operator – written as a question mark ? – match at most one instance of an expression, i.e. either zero or one instance. We want to match the name “chickpeas” in singular and plural. It’s easy. The expression chickpeas? matches “chickpea” as well as “chickpeas”. The optional quantifier ? binds to the s and makes the s optional.

There’s also a more generic quantifier operator. With braces, we can express any kind of repetition. The normal form has a least and a most acceptable number of matches. They are separated with a comma. The expression La{1,4} matches at least one La and at most four La. If we replace La{1,4}with oh in the subject "LaLaLaLaLaLa", we’ll get “ohLaLa”. The first number can be omitted. The expression La{,4}matches at most four La – it may be zero instances. If we keep the first number, then we can omit the second number. The expression La{1,} matches at least one La. It’s equivalent to La+. The final version of the brace quantifier operator is without comma and with just one number. The expression La{14} matches exactly 14 instances of La – nothing more and nothing less.

Quantifiers are unary, left associative and they have higher precedence than alternation and concatenation. Consider why the following is true:

The horizontal axis in the figure above states how many instances a quantifier matches. It starts with zero, which is to match the empty string. And it ends in infinity, i.e. there’s no limit of how many times an instance can be repeated and still match our expression. The first column of dots (to the left) shows that kleene closure *, optional operator ?, and brace operator with first number omitted {,m} match zero instances. It goes for brace with minimum zero {0, m} as well. The second column of dots explains that kleene closure *, positive closure +, and braces with second number omitted {n,} have no upper limit in number of possible matches.


Quantifier equations

The notation of regular expressions, as Stephen Cole Kleene defined it, has only three operators – concatenation, alternation, and kleene closure. This is enough to define shorthand operators, like some quantifiers. The shorthands are abbreviated symbolic writing methods that make our expressions more readable and maintainable. To convince you that the quantifiers really are just shorthands of the three base operators, I’ll give you, what I call regular equations. An equation is a mathematical statement that two expressions are equal.

Let’s start with the optional quantifier, written as a question mark. Like all quantifiers, it binds the expression to the left. In the figure above, you can see that saying that blue dot is optional is equivalent to using alternation to say either we match nothing, or else we match one blue dot.

The positive closure operator is written as a plus sign. Positive closure means that we must match at least one instance of the expression to the left. In the second equation in the figure above, I claim that the positive closure of a blue dot is equivalent to one mandatory blue point concatenated with the kleene closure (written as a star) of another blue dot. One blue dot is mandatory, and then follows zero-to-many blue dots. That wasn’t a very provocative proposition, was it?

Did you ever say ‘curly brackets’? People in US call them braces. In UK they say squiggly brackets and in India flower brackets. In Sweden, we say gull wings. However, in regex we use them as another shorthand quantifier. They hold a pair of numbers that defines the minimum and the maximum number of repetitions of the expression to the left. The third equation in the figure above states that it’s equivalent to a) repeat the blue dot at least two times and at most four times, and b) match two blue dots concatenated with the alternation zero, one, or two blue dots.

Here are some more equations. Scrutinize them to make sure you agree with me that they are true:


Greedy Quantifiers


Backtracking (NFA)


Lazy Quantifiers

Because of backtracking, we might sometimes match more than we hoped for. The task below is to catch all the div tags in an HTML document and put them in a vector. Our naïve solution gives the wrong answer:

'<div>a</div><span>c</span><div>b</div>'.scan /<div>.*<\/div>/ #=> \
  ["<div>a</div><span>c</span><div>b</div>"]

Quantifiers are greedy in regex. They devour as much as they can. The dot in the idiom .* matches anything except newline (remember Barbapapa?). And the star means that this anything is repeated as many times as possible. Since the string ends with a "</div>" the entire substring "a</div><span>c</span><div>b" will be consumed by .*. Problems may arise from greed.

There are many stories about greedy people who claim more than they need. As Louis Blanc wrote already 1840 in The Organization of Work: “From each according to his abilities, to each according to his needs.” We don’t think that the star and the dot in the above example needs to consume more than the substring "<div>a</div>". Alexander Pushkin describes in The Tale of the Fisherman and the Fish how a magic fish promises to fulfill any of the fisherman’s whishes. The fisherman’s wife starts asking the fish for bigger and better things – and she gets them – until she eventually wants to become Ruler of the Sea. Then the magic fish takes back everything he gave the fisherman’s wife.

As a complement to backtracking, regex also provides forward tracking. The quantifier tries to take as little as it can and then holds to see if the whole expression can be matched. If the entire expression can’t be matched, then the quantifier will capture one more letter and holds to see if it’s now possible to match everything. The method is repeated until either we can match everything or we can confirm that there’s no possible way to create a match. Does it matter if we use forward tracking instead of backtracking? Well, the strings that can be matched with backtracking can also be matched with forward tracking and vice versa. But, when there are multiple possible matches, forward tracking will sometimes choose a different match than backtracking does.

There is no special symbol for forward tracking quantifiers. Instead, there’s modifier symbol – written as a question mark – that can be used right after any quantifier. While * means repeat as many times as possible, *? means repeat as few times as possible. Similarly, we can modify all the other quantifiers:

Note that the question mark that modifies quantifiers isn’t the same question mark that’s used as the conditional quantifier. They can even be used in conjunction, as ?? shows above.

Here are some examples. The first one is the solution to the div tag problem:

'<div>a</div><span>c</span><div>b</div>'.scan /<div>.*?<\/div>/ #=> ["<div>a</div>", "<div>b</div>"]
'aa'.match /a?/ #=> #<MatchData "a">
'aa'.match /a??/ #=> #<MatchData "">
'aaaaa'.match /a{2,4}/ #=> #<MatchData "aaaa"> - at least 2, at most 4, as much as possible
'aaaaa'.match /a{2,4}?/ #=> #<MatchData "aa"> - at least 2, at most 4, as little as possible
'aaaaa'.match /a{2,}/ #=> #<MatchData "aaaaa">
'aaaaa'.match /a{2,}?/ #=> #<MatchData "aa">
'aaaaa'.match /a{,4}/ #=> #<MatchData "aaaa">
'aaaaa'.match /a{,4}?/ #=> #<MatchData "">

DFA


Glob


History


grep -E and egrep


Functions


Quant Algebra


Possessive Quant


Meta chars


Dot


Shrthnds


Unicode


Escape


Character Class


Char Class Escape


Canonical


Non-printable


Capture & Back ref


Non-capturing


Named Captures


Atomic Group


Assertions


Anchors


Modes


Lookahead


Lookbehind


Capture Cond


Named Cond


Lookaround Cond


The scent of regex requisite

Did you know that the famous quote “Some people, when confronted with a problem, think: I know, I’ll use regular expressions. Now they have two problems.” dates all the way back to 1997? However, most programmers agree that regex has its time and place. But, how can we know when to use regex? It’s really simple. We must use our nose and feel the scent of regex requisite. Below is a list of five scents that puts the R-word in our working memory.

Some of these five scents partly overlap, but each of them are well worth to remember. Facing a programming problem, if you can’t feel any of them, then you can be pretty sure that there are better tools than regex.


Multiple files search-and-replace

How can you do a custom refactoring in millions of code lines? I faced the problem when I was maintaining a cash register. The code had been developed for ten years by at least as many developers – not together, but in a relay race one developer after each other. It started as a Java 1.3 project and ten years later it was still 1.3 – despite that Java’s version number had increased to 6 during that decade. One of several reasons for the product owner to migrate the code to Java 6 was to go from console printing of technical messages, to the logging library that’s been available in Java ever since version 1.4. Could this be done with regex?

Why even care about replacing system out printing with logging? After all the cash register consisted of 3000 Java files, with up to 10000 lines of code each. To make this refactoring is complicated, demands a lot of effort and comes with risk to destroy working software.

Logging is a low tech-debugging. It’s not a substitute for testing, it’s a complement. Properly used, it’ll give us detailed information about failures in production, as well as nasty behaviors during development. There are many things we can control in Java standard logging that is impossible with system out printing. We manage the logging behavior from a configuration file, with no need to recompile the binary execution file. For example, we can easily change the log level from finest debug messages that trace your internal variables, to course grained information when an error occurs. It’s even possible for us to store a circular buffer with the last hundreds of fine-grained low-level programmer messages, in runtime memory. Whenever a severe error happens at our customer site, this buffer can be sent to our development department. It gives us the detailed history of what happened before the failure in production. And that’s another advantage: we send our log messages can to file, network, console, database, or whatever is most suitable right now.

I used a search-and-replace tools that work on multiple files. Of course, it supported regex. I decided to replace every invocation of System.out.println with logger.fine and every System.err.println with logger.warning. That was the easy part. I also needed to add an import statement in those – and only those – files that now had logging. It’s a line like this:

import java.util.logging.Logger;

Finally, I had to add a line, inside the class declaration, that looks something like this:

private static Logger theLogger = Logger.getLogger(HelloWorld.class.getName());

Unfortunately, the HelloWorld part differs in every file. It’s the name of this particular class. In each file I caught the class name, located the first line in the class body, and inserted this customized line – with a search-and-replace regex.

It took me one and a half day to figure out and test these three regexes. Then it took another 10 minutes to execute it in 3000 Java files. The alternative, to go through all these files manually, would have taken me at least a month. Manually means that I would certainly have made many mistakes. And during this manual month, some files would have evolved because of the normal development of new features. This is an example of how you can solve a complicated refactoring cleanly, quickly and correctly with three regexes.


TDD