Week 11

The part of assignment 2 part 2 that I had the biggest problem with was all_regex_permutations.

Not because it’s hard to generate permutations and filter them through a working is_regex. That’s easy — we saw it in class with searching for anagrams. But we were discouraged from using a model like that. Why? Because the number of permutations to generate and filter gets larger and larger (factor of n). It wasn’t necessary to speed it up, as zingarod said on the board, but I really wanted to.

The first thing I knew was that there are some very easy cases. Base case is that your string is a leaf, in which case it has only one regex permutation, itself. And the star is also easy. Just subtract a star from the string, recursively make all the permutations you can make out of what’s left, and stick a star on the end of each one. Because this was going to be my approach in a couple of cases, I wrote a helper function called complement:

 

def complement(whole: str, s: str) -> str:
    '''
    Helper function for all_regex_permutations and all_combinations.
    Return a string equal to whole with all characters in s removed.
    '''

    for char in s:
        whole = whole.replace(char, '', 1)
    return whole

Now—and not mutually exclusively, that is, not with an elif but with an if—I had to deal with bracket variations.

My approach started off the same: use complement to subtract the brackets from the string, and then, with each operator, subtract it and make valid regexes out of what’s left.

The difficulty is that you have to split the string across the operator in every possible way in order to find every possible valid regex arrangement of them. (I’m sure there’s a better way, but I haven’t figured it out yet.) And that sounds like… yup… permutations.

Well, it’s actually a little better than permutations. It’s combinations, which means the order doesn’t matter. For example, permutations ’01’ and ’10’ are just a single combination, ’01’. So you do have many fewer options.

And specifically, I needed combinations of every length. That is, a base string ‘123’ should yield length 1: ‘1’, ‘2’, ‘3’, length 2: ’12’, ’13’, ’23’, length 3: ‘123’. Then, I could put each possible combination on the left side and use my complement function again to find what would go on the right side for each combination.

So the question was: how could I efficiently find all combinations?

And that was what was so hard, for some reason. I kept trying to build it through iterators. My idea was: take every length 1 string, then add each remaining character once. Take all these length 2 strings, and add each remaining character once. Take all these length 3 strings… etc. But I wasn’t quite sure how to do that. I had multiple iterators iterating through it with inner loops. But I realized that this only worked for string of certain lengths. Once the string was long enough, I needed another iterator to capture everything.

Now… full disclosure: I hadn’t written complement yet, in fact! I had been replicating its effects with longer code, but it was through this process that I eventually realized a lot of what I was doing was solvable with complement.

I actually grabbed someone else’s code from stackoverflow.com, since I’d seen on the board that we can use others’ code if we cite it. I found one that generated all combinations of length n:

 

def _combinations_of_length(s: str, n: int) -> 'generator':
    '''
    Yield all combinations of s of length n.
    '''

    ## Combinations generation adapted from code found at
    ## http://stackoverflow.com/a/2837693
    for i in range(len(s)):
        if n == 1:
            yield s[i]
        else:
            for rest in _combinations_of_length(s[i+1:len(s)], n-1):
                yield s[i] + rest

Then I built a wrapper function to feed every length into this function and return a set of all of them:

def all_combinations_theirs(s: str) -> set:
    '''
    Helper function for all_regex_permutations.
    Return a set of all combinations of s of len 1 < len < len(s).
    '''

    combinations = []

    for n in range(len(s)+1):
        combinations.extend(_combinations_of_length(s, n))

    return set(combinations)

This works wonderfully fast, but I don’t fully understand what it does. Mostly because we haven’t been taught what yield is. (It has something to do with a generator object… it’s odd. If you look at the actual object return by a function that yields instead of returns, you get <generator …>. But if you cast that in a list, you get what you’d expect. Need to do more reading.)

Later on, I felt like I could in fact understand what the other person’s code was doing, and I realized something: in order to generate every combination of length n, it’s recursively generating everything up to that length anyway! But it’s throwing it out because the person designed that function only to produce those of length n. So I monkeyed around with it a bit, trying to alter it to retain the smaller lengths as well. But I couldn’t do it, and (again) I’m pretty sure it relates to yield. So I decided to write my own version, because I felt like if I could capture those smaller lengths as I generate them, I could actually be faster.

Well, I wasn’t quite right. I can’t fight the yield. But I did produce my own version at last, after having spent days on it and then given up. I wrote complement in order to do more or less what I described above, and produced this:


def all_combinations(s: str) -> set:
    '''
    Helper function for all_regex_permutations.
    Return a set of all combinations of s of len 1 < len < len(s).
    '''

    def _helper(s: str, n: int, combinations: list=[['']]) -> list:
        '''
        Return a list of lists of combinations of s of length 1...n.
        '''

        while len(combinations) < n+1:
            new_length = []
            for combo in _helper(s, n-1, combinations)[-1]:

                for new_char in complement(s, combo):

                    if not any(''.join(sorted(combo+new_char)) ==
                      ''.join(sorted(old)) for old in new_length):
                        new_length.append(combo+new_char)

            combinations.append(new_length)

        return combinations

    result = set()
    for length in _helper(s, len(s)):
        for combo in length:
            result = result.union({combo})
    return result

Not too pretty but it gets the job done.

A couple of clarifications. One, note that the embedded helper here separates the overall list into sublists, one for each length. This is just to help identify the anchors to build the combinations that are one character longer than the previous one. Then they’re all joined in the last bit of code in the main function.

Two, note this “any” business. The problem with this code is that it essentially overgenerates, almost generating all permutations (which is bad! bad! bad!). If a character is doubled, as you will often have in regexes, you’ll have variations of the same combination, i.e. permutations. So I used the signature technique zingarod showed us in lecture: I alphabetize the combination and see if any other combinations already generated match it when alphabetized, and I only add it if it’s unique. At first this might just seem like a filter that doesn’t increase efficiency. It would be — if we only ran it at the end. But if at each stage we cut the permutations down to the unique combinations, then we won’t have as many in our anchor and therefore won’t generate as many permutations in the next stage, so the total number of permutations generated and filtered will be far fewer.

Since I had a working alternative to the borrowed code, I decided to ask zingarod if it was okay that I was using this much cited code. He said he’d prefer if I used my own. So I had to suck up the fact that my code was not fast. But at least it was my own solution!

There was one point of slowdown that I could address, though. After generating all these combinations, remember that my solution for binary regexes in all_regex_permutations (oh yeah… all the way back up there!) required that I find all permutations and stick them on the left and right sides of the operator. Running all_regex_permutations recursively on a large list of combinations was going to be very slow. So I did a final optimization by putting many short-circuits in all_regex_permutations, i.e. quick heuristics that allow me to predict when a string will never make a good regex no matter how many permutations I make of it. These included: Does it lack at least one leaf? Is the number of leaves anything but the number of operators plus one? Is the number of brackets not exactly double the number of operators? And so on. This did save a lot of time.

So… long story short: my solution does essentially involve generating lots and lots of permutations, despite my best efforts, but it does cut down on them notably by (1) focusing on combinations rather than permutations, (2) using a signature to eliminate extras, and (3) short-circuiting most of the recursions of all_regex_permutations on the final list of combinations. Yay, compromise!

 

P.S. Sorry no week 10. But I did so much reading for week 9’s list of slogs I like, that, you know…. ah, what am I saying. There’s no excuse.

Advertisements

3 thoughts on “Week 11

  1. That’s a nice approach to solve all_regex_permutations although I prefer shorter code instead of hardcoding some exceptions. You can also check the way I solved this problem at my blog.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s