Ad

(This is a draft for a potential Kata series, left as a Kumite for now)

Regex engine from scratch

Throughout this series you'll be gradually piecing together a regex engine, with each successive Kata adding an additional piece of syntax into the mix.

You may wish to copy code you've developed in earlier parts of the series for later parts so that you can gradually build up the engine as you go, or you can cater each solution to the specific syntax proposed. The latter might be better for the first few steps as they're very simple.

More or less the same

Following on from the previous part, we'll now be adding support for the simple quantifiers: *, +, ?. Support for braced quantifiers such as {3} will not be added, so braces should be treated like letters.

In part 1 you implemented the dot syntax, so you must implement that in this part too.

All other syntax should be treated as letters (and just matched on), even if they would be valid regex syntax in a real regex parser.

For example, even though [a] would typically be a character class containing the letter a, we haven't implemented character classes yet, so it should be treated as \[a\]. This also means you can't just import re, it won't work!

Your function will be provided with a pattern as a string. This pattern should be handled somehow to produce a function that can be used to match on a string. You should expect this prepared function to be called multiple times with different inputs, it can't match just once.

The produced match function should behave as if it had anchors on either end, so it should match the entire string.

Examples

match('.*')('cat') => True # .* should match anything
match('c.?t')('ct') => True # the dot is made optional by the ?
match('c.?t')('cat') => True
match('c[a]t')('cat') => False # We haven't implemented character classes
match('c[a]t')('c[a]t') => True
match('c{2}t')('cct') => False # We also won't be implementing braced quantifiers
match('c{2}t')('c{2}t') => True
# This is a k=1 solution and I'm not too happy with it.
# I was hoping to try and find an okay balance for something
# that could be written and understood in a Kata solution without being a full
# blown automata, but it's kind of unwieldy.

class PeekableIterator(object):
    def __init__(self, iterator):
        self.iterator = iterator
        self.ahead = None
        
    def peek(self):
        if self.ahead:
            return self.ahead
        
        self.ahead = self.read()
        return self.ahead
    
    def read(self):
        if (result := self.ahead):
            self.ahead = None
            return result
        
        try:
            return next(self.iterator)
        except StopIteration:
            return ''
            

def match(pattern):
    ops = []
    for part in pattern:
        if part == '?':
            op = ops.pop()
            ops.append(('optional', op))
        elif part == '*':
            op = ops.pop()
            ops.append(('many_or_none', op))
        elif part == '+':
            op = ops.pop()
            ops.append(('many', op))
        else:
            ops.append(('match', part))
    
    def do_op(stream, op, component):
        if op == 'match':
            return (component == '.' and stream.peek()) or stream.peek() == component
        elif op == 'optional':
            op, component = component
            if do_op(stream, op, component) and op == 'match':
                stream.read()
            return True
        elif op == 'many_or_none' or op == 'many':
            original_op = op
            op, component = component
            i = 0
            while do_op(stream, op, component):
                if op == 'match':
                    stream.read()
                i += 1
            return original_op == 'many_or_none' or i > 0
        raise "Huh?"
    
    def matcher(string):
        stream = PeekableIterator(iter(string))
        
        for op, component in ops:
            if not do_op(stream, op, component):
                return False
            
            if op == 'match':
                stream.read()
        return stream.read() == ''
    return matcher
Code
Diff
  • def match(pattern):
        def matcher(string):
            return len(pattern) == len(string) and all(a == '.' or a == b for a, b in zip(pattern, string))
        return matcher
    • def match(pattern):
    • def matcher(string):
    • return len(pattern) == len(string) and all(a == b if a != '.' else True for a, b in zip(pattern, string))
    • return len(pattern) == len(string) and all(a == '.' or a == b for a, b in zip(pattern, string))
    • return matcher

(This is a draft for a potential Kata series, left as a Kumite for now)

Regex engine from scratch

Throughout this series you'll be gradually piecing together a regex engine, with each successive Kata adding an additional piece of syntax into the mix.

You may wish to copy code you've developed in earlier parts of the series for later parts so that you can gradually build up the engine as you go, or you can cater each solution to the specific syntax proposed. The latter might be better for the first few steps as they're very simple.

It all begins with a dot.

For the first step we're going to implement regex's . syntax.

To keep things simple, every other symbol except for a period should be treated as if it was inside a character set, even if the symbol would be treated as syntax by a real regex engine.

For example, the input [.], even though it looks like a character class, should be handled as if it was \[.\] because we're only implementing the dot metacharacter for now. This also means you can't cheat and just import re!

Your function will be provided a pattern as a string. This pattern should be handled somehow to produce a function that can then be used to match on a string. You should expect this prepared function to be called multiple times with different inputs, it can't match just once.

The produced match function should behave as if it had anchors on either end, so it should match the entire string.

Examples

match('c.t')("cat") => True
match('c.t')("cbt") => True
match('c.t')("catastrophic") => False (must match entire string)
match('c.t')("abc") => False (not a match)
match('[.]')('[a]') => True (we don't support character classes, treat them like letters)
match('[a]')('a') => False (we've not implemented character classes yet)
def match(pattern):
    def matcher(string):
        return len(pattern) == len(string) and all(a == b if a != '.' else True for a, b in zip(pattern, string))
    return matcher
Code
Diff
  • # A whopping 2 character saving.
    cipher_breaker=lambda s:"".join({c:chr(219-ord(c))," ":c}[c]for c in s)
    • cipher_breaker=lambda s:"".join(c==" "and" "or chr(219-ord(c))for c in s)
    • # A whopping 2 character saving.
    • cipher_breaker=lambda s:"".join({c:chr(219-ord(c))," ":c}[c]for c in s)
Code
Diff
  • # Shorter form of the last version.
    add=lambda s,o:sum(filter(lambda n:not o or~(n+o)&1,map(int,s)))
    • # 1 & 1 is 1, 10 & 1 is 0, we know that the least significant bit for any integer is 1 for odd and 0 for even
    • # so we can just compare the least significant bit of both values to answer the question
    • add=lambda s,o:sum(filter(lambda n: n & 1 == o & 1 if o and o < 3 else n, map(int, s)))
    • # Ever so slightly shorter
    • add=lambda s,o:sum(filter(lambda n: ~(n + o) & 1 if o and o < 3 else n, map(int, s)))
    • # Shorter form of the last version.
    • add=lambda s,o:sum(filter(lambda n:not o or~(n+o)&1,map(int,s)))
Code
Diff
  • # 1 & 1 is 1, 10 & 1 is 0, we know that the least significant bit for any integer is 1 for odd and 0 for even
    # so we can just compare the least significant bit of both values to answer the question
    add=lambda s,o:sum(filter(lambda n: n & 1 == o & 1 if o and o < 3 else n, map(int, s)))
    
    # Ever so slightly shorter
    add=lambda s,o:sum(filter(lambda n: ~(n + o) & 1 if o and o < 3 else n, map(int, s)))
    • add=lambda s,o:sum(filter((id,(1).__and__,lambda x:~x&1)[o%3],map(int,s)))
    • # 1 & 1 is 1, 10 & 1 is 0, we know that the least significant bit for any integer is 1 for odd and 0 for even
    • # so we can just compare the least significant bit of both values to answer the question
    • add=lambda s,o:sum(filter(lambda n: n & 1 == o & 1 if o and o < 3 else n, map(int, s)))
    • # Ever so slightly shorter
    • add=lambda s,o:sum(filter(lambda n: ~(n + o) & 1 if o and o < 3 else n, map(int, s)))
Fundamentals
Numbers
Data Types
Integers
Code
Diff
  • public static class Kata
    {
        public static bool IsOdd(int input, int div)
        {
            return input % div == 0;
        }
    }
    • public static class Kata
    • {
    • public static bool IsOdd(int input, int div)
    • {
    • if (input % div == 0)
    • return true;
    • return false;
    • return input % div == 0;
    • }
    • }