Week 11

The part of assignment 2 part 2 that I had the biggest problem with was all_regex_permutations.

Not because it’s hard to generate permutations and filter them through a working is_regex. That’s easy — we saw it in class with searching for anagrams. But we were discouraged from using a model like that. Why? Because the number of permutations to generate and filter gets larger and larger (factor of n). It wasn’t necessary to speed it up, as zingarod said on the board, but I really wanted to.

The first thing I knew was that there are some very easy cases. Base case is that your string is a leaf, in which case it has only one regex permutation, itself. And the star is also easy. Just subtract a star from the string, recursively make all the permutations you can make out of what’s left, and stick a star on the end of each one. Because this was going to be my approach in a couple of cases, I wrote a helper function called complement:

 

def complement(whole: str, s: str) -> str:
    '''
    Helper function for all_regex_permutations and all_combinations.
    Return a string equal to whole with all characters in s removed.
    '''

    for char in s:
        whole = whole.replace(char, '', 1)
    return whole

Now—and not mutually exclusively, that is, not with an elif but with an if—I had to deal with bracket variations.

My approach started off the same: use complement to subtract the brackets from the string, and then, with each operator, subtract it and make valid regexes out of what’s left.

The difficulty is that you have to split the string across the operator in every possible way in order to find every possible valid regex arrangement of them. (I’m sure there’s a better way, but I haven’t figured it out yet.) And that sounds like… yup… permutations.

Well, it’s actually a little better than permutations. It’s combinations, which means the order doesn’t matter. For example, permutations ’01’ and ’10’ are just a single combination, ’01’. So you do have many fewer options.

And specifically, I needed combinations of every length. That is, a base string ‘123’ should yield length 1: ‘1’, ‘2’, ‘3’, length 2: ’12’, ’13’, ’23’, length 3: ‘123’. Then, I could put each possible combination on the left side and use my complement function again to find what would go on the right side for each combination.

So the question was: how could I efficiently find all combinations?

And that was what was so hard, for some reason. I kept trying to build it through iterators. My idea was: take every length 1 string, then add each remaining character once. Take all these length 2 strings, and add each remaining character once. Take all these length 3 strings… etc. But I wasn’t quite sure how to do that. I had multiple iterators iterating through it with inner loops. But I realized that this only worked for string of certain lengths. Once the string was long enough, I needed another iterator to capture everything.

Now… full disclosure: I hadn’t written complement yet, in fact! I had been replicating its effects with longer code, but it was through this process that I eventually realized a lot of what I was doing was solvable with complement.

I actually grabbed someone else’s code from stackoverflow.com, since I’d seen on the board that we can use others’ code if we cite it. I found one that generated all combinations of length n:

 

def _combinations_of_length(s: str, n: int) -> 'generator':
    '''
    Yield all combinations of s of length n.
    '''

    ## Combinations generation adapted from code found at
    ## http://stackoverflow.com/a/2837693
    for i in range(len(s)):
        if n == 1:
            yield s[i]
        else:
            for rest in _combinations_of_length(s[i+1:len(s)], n-1):
                yield s[i] + rest

Then I built a wrapper function to feed every length into this function and return a set of all of them:

def all_combinations_theirs(s: str) -> set:
    '''
    Helper function for all_regex_permutations.
    Return a set of all combinations of s of len 1 < len < len(s).
    '''

    combinations = []

    for n in range(len(s)+1):
        combinations.extend(_combinations_of_length(s, n))

    return set(combinations)

This works wonderfully fast, but I don’t fully understand what it does. Mostly because we haven’t been taught what yield is. (It has something to do with a generator object… it’s odd. If you look at the actual object return by a function that yields instead of returns, you get <generator …>. But if you cast that in a list, you get what you’d expect. Need to do more reading.)

Later on, I felt like I could in fact understand what the other person’s code was doing, and I realized something: in order to generate every combination of length n, it’s recursively generating everything up to that length anyway! But it’s throwing it out because the person designed that function only to produce those of length n. So I monkeyed around with it a bit, trying to alter it to retain the smaller lengths as well. But I couldn’t do it, and (again) I’m pretty sure it relates to yield. So I decided to write my own version, because I felt like if I could capture those smaller lengths as I generate them, I could actually be faster.

Well, I wasn’t quite right. I can’t fight the yield. But I did produce my own version at last, after having spent days on it and then given up. I wrote complement in order to do more or less what I described above, and produced this:


def all_combinations(s: str) -> set:
    '''
    Helper function for all_regex_permutations.
    Return a set of all combinations of s of len 1 < len < len(s).
    '''

    def _helper(s: str, n: int, combinations: list=[['']]) -> list:
        '''
        Return a list of lists of combinations of s of length 1...n.
        '''

        while len(combinations) < n+1:
            new_length = []
            for combo in _helper(s, n-1, combinations)[-1]:

                for new_char in complement(s, combo):

                    if not any(''.join(sorted(combo+new_char)) ==
                      ''.join(sorted(old)) for old in new_length):
                        new_length.append(combo+new_char)

            combinations.append(new_length)

        return combinations

    result = set()
    for length in _helper(s, len(s)):
        for combo in length:
            result = result.union({combo})
    return result

Not too pretty but it gets the job done.

A couple of clarifications. One, note that the embedded helper here separates the overall list into sublists, one for each length. This is just to help identify the anchors to build the combinations that are one character longer than the previous one. Then they’re all joined in the last bit of code in the main function.

Two, note this “any” business. The problem with this code is that it essentially overgenerates, almost generating all permutations (which is bad! bad! bad!). If a character is doubled, as you will often have in regexes, you’ll have variations of the same combination, i.e. permutations. So I used the signature technique zingarod showed us in lecture: I alphabetize the combination and see if any other combinations already generated match it when alphabetized, and I only add it if it’s unique. At first this might just seem like a filter that doesn’t increase efficiency. It would be — if we only ran it at the end. But if at each stage we cut the permutations down to the unique combinations, then we won’t have as many in our anchor and therefore won’t generate as many permutations in the next stage, so the total number of permutations generated and filtered will be far fewer.

Since I had a working alternative to the borrowed code, I decided to ask zingarod if it was okay that I was using this much cited code. He said he’d prefer if I used my own. So I had to suck up the fact that my code was not fast. But at least it was my own solution!

There was one point of slowdown that I could address, though. After generating all these combinations, remember that my solution for binary regexes in all_regex_permutations (oh yeah… all the way back up there!) required that I find all permutations and stick them on the left and right sides of the operator. Running all_regex_permutations recursively on a large list of combinations was going to be very slow. So I did a final optimization by putting many short-circuits in all_regex_permutations, i.e. quick heuristics that allow me to predict when a string will never make a good regex no matter how many permutations I make of it. These included: Does it lack at least one leaf? Is the number of leaves anything but the number of operators plus one? Is the number of brackets not exactly double the number of operators? And so on. This did save a lot of time.

So… long story short: my solution does essentially involve generating lots and lots of permutations, despite my best efforts, but it does cut down on them notably by (1) focusing on combinations rather than permutations, (2) using a signature to eliminate extras, and (3) short-circuiting most of the recursions of all_regex_permutations on the final list of combinations. Yay, compromise!

 

P.S. Sorry no week 10. But I did so much reading for week 9’s list of slogs I like, that, you know…. ah, what am I saying. There’s no excuse.

Slogs I like

I just finished reading one or two posts of every slog (okay, a couple I skimmed). Here are the ones that, for one reason or another, I particularly like: they’re funny, they’re well-designed, they’re informative, they’re well-written, etc.

http://sajidcsc148.blogspot.ca/

http://ilovecsc148.wordpress.com/

http://ddc-csc148slog.blogspot.ca/

http://compscikitty.wordpress.com/

http://whatshappeningcsc148.blogspot.ca/

http://killerapp.4nh.pw/

http://fatmothinmycode.wordpress.com/

http://thegreatmoth.blogspot.ca/

http://pythonmenthol.tumblr.com/

http://csc148sharon.wordpress.com/

http://pythonbivittatus.wordpress.com/

http://pcsc148slog.wordpress.com/

http://hannay148.wordpress.com/

On the other hand, repeating keywords rules a slog out for me. Not updating in over a month also makes it a non-visit, even if that first and only post was brilliant. :/

Also, observation: All poor slogs are alike; each good slog is good in its own way.

Hope the list leads you to an enjoyable read!

Week 8

Assignment 2 part 1 took me forever.

Not because the coding itself was hard. It wasn’t. Nor because I couldn’t figure out how to do it at a design level. I could. It was because no particular design seemed better than any other. Or it did but it contradicted the instructions.

My first idea was to make three nodes: BinaryNode, UnaryNode, and LeafNode. Each took its appropriate number of children and a head. They each had their own methods. There was no inheritance at all, actually, until I wanted error-checking and then I made an empty Node class that any child parameter had to be an instance of.

But typing in these classes in the shell to construct them was so annoying. Three different classes to type and you had to remember how many children for each one?

So I decided to make a RegexTree class that moderated them all, taking a proper regex string as its one parameter and parsing it, distributing it to the various Node classes. It was beautiful! Then I realized that this was part of part 2 and abandoned it.

For a while I continued with that RegexTree wrapper. I just made it take a head and two child parameters, made optional by giving them a default value of None. Then, based on how many children it was passed, it would create BinaryNodes, UnaryNodes, and LeafNodes, like it had done before.

But I realized at this point that RegexTree wasn’t really serving the purpose it had used when I made it parse a full regex string. Now it was just kind of an extra layer. So I returned it to just having those three nodes.

I still wasn’t happy. Remember that I didn’t like how you had to type them so carefully into the shell. Then I realized: wait, if RegexTree was able to sort out its parameters and make the proper Nodes… why not just have it be all those Nodes? That is, why not just make it behave like a BinaryNode if both children are given it, like a UnaryNode if only one child is given it, and like a LeafNode if no children are given it? Make it adaptable. Also, I could have all my error-checking in one place, right in the __init__ method, rather than scattered through various classes’ __init__ methods.

So I implemented that, and it worked fine. And now you just wrote RegexTree(‘0’) or RegexTree (‘*’, RegexTree(‘0’)) or RegexTree(‘.’, RegexTree(‘0’), RegexTree(‘1’)), and it sorted everything out for you. Much easier typing in the shell! I even considered renaming it ReTree to save time.

That was what I submitted first.

But then I read the thread that said you shouldn’t use just a single class, even if it gets the job done. I was dismayed. I didn’t know why this wasn’t preferable. (Since then, zingarod has pointed out that building the behaviours of three classes into one class isn’t too tricky, but what if there were more basic Node types? Having one class adapt to handle all of them would start to look very confusing.)

At first I resisted it and thought, “Eh, let them do with it what they want.” But then I decided I wanted the marks, and I wanted to understand why more than one class was better, so I sorted through all my regex_design, regex_design_old, regex_design_old_2, regex_design_tentative, etc. files and reinstated BinaryNode, UnaryNode, and LeafNode.

I also remembered zingarod  saying that we should use inheritance, which I’d use none of except that empty Node class I described above. So I took a leap of faith and decided to give that Node class something to do.

The first thing was realizing that __init__ was very similar for all three of them. But in order to justify having more than one class (i.e., finding a reason not to just turn Node into the RegexTree class again), I decided to implement a new __init__ in each child class that nevertheless benefited from Node’s __init__. So I had Node’s __init__ be able to take both optional children. And then I made each child class’s __init__ call Node’s init inside it, passing it None for the each child parameter I didn’t want it to have. That way, the child classes could still have parameters for children they didn’t have, but still behave uniquely and appropriately.

Having the same number of children, even if some were None, made __eq__ totally inheritable. It just required that instead of isinstance(other, …), I used type(self) == type(other).

__repr__ could easily have been inheritable too, except that I decided I didn’t want the output to have ‘None’ where there were no children. That is to say, if Node had the only __repr__ and it was inherited, the outputs would have looked like “BinaryNode(‘.’, LeafNode(‘0’), LeafNode(‘1’))” but also “UnaryNode(‘*’, LeafNode(‘0’), None)” and “LeafNode(‘0’, None, None)”, which is hideous. So I just gave each one a unique __repr__.

I also instituted a method called “checker” that gets called during __init__. The method makes sure that the Node has the right kind of head for its number of children and that all children are nodes. It handles errors less specifically than my previous means did, but I could still distinguish a HeadError and a ChildError, and most importantly it could now be inherited.

The final note is that zingarod explained to me the design that he’ll post soon and that we’ll use as our base for part 2. He made a separate Node class for every possible regex item! The advantage of mine is that it can be scaled. I have three constants: BINARY, UNARY, and LEAVES, all tuples that list the possible heads of each node type. If we extended our alphabet, all you have to do is add the new values into my tuples. For example, if we were given a second unary operator, you wouldn’t have to create a new class for it; you’d just add it to UNARY and UnaryNode would handle it.

The advantage of zingarod’s is, I’m guessing, that each separate Node can have very specific behaviour, which I think will help with matching strings. That’d be kind of hard to do right now with my design—you’d have to know which binary operator you want, not just that your operator is binary, which is all my class tells you. So in terms of simplicity of class expansion, my approach is easier, but when you actually want to use these regexes for something, zingarod’s is going to be much better.

Looking forward to part 2. Especially since I’ve got the parsing down already. I just have to rescue it from regex_design_variation_3_old.py.