Removing latex commands using Python “re” module

Recently I had to sanitize lines in a .tex file where a \textcolor command had been used.
The command was being used the following way: {\textcolor{some_color}{text to color}}.

The main problem was that the command could have appeared any number of times in a line, so I couldn’t apply the command a set number of times.
Also, given any color could have been used, a simple “blind replace” was clearly not a good weapon in this case.

I therefore resorted to applying a reg ex recursively until the line was cleaned of any \textcolor command.

In a nutshell:

def discolor(line):
    regex = re.compile('(.*?){\textcolor\{.*?\}(\{.*?\})\}(.*)')
    while True:
        try:
            line = ''.join(re.search(regex, line).groups())
        except AttributeError:
            return line

The key part here is that we match not only the text inside the \textcolor command, but also what comes before and after (the two (.*?) blocks). We return them all until there are any left: when that happens, accessing .groups() will raise an AttributeError, which we catch and use as sentinel to know when to return.

sort -k

TIL, when you call sort -k3, you’re not just sorting by the third field, but by whatever the value between the third field up to the end of the line is.
Not only that, in the case of ties, by default it will use also the first field.

Consider this example.

$ cat data
theta AAA 2
gamma AAA 2
alpha BBB 2
alpha AAA 3

Sorting with -k2 gives:

$ sort data -k2 --debug
sort: using simple byte comparison
gamma AAA 2
     ______
___________
theta AAA 2
     ______
___________
alpha AAA 3
     ______
___________
alpha BBB 2 
     _______
____________

Notice I’ve also added --debug, to show which parts are used in the comparisons.
So, first comes “AAA 2”, then “AAA 3”.
Also, for the two lines that have “AAA 2”, the first field is used, so “gamma” comes before “theta”.

Forget about the ties for now.
To consider field 2 only, rather than field 2 and all following fields, you need to specify a stop. This is done by adding “,2” to the -k switch. More in general, -km,n means “sort by field m up to n, boundaries included”.

$ sort data -k2,2 --debug
sort: using simple byte comparison
alpha AAA 3
     ____
___________
gamma AAA 2
     ____
___________
theta AAA 2
     ____
___________
alpha BBB 2 
     ____
____________

As you can see, field 2 only is taken into account at first.
“AAA 3” comes before “AAA 2” because, being a tie, the first field is used as a second comparison.

Taking this a step further, to actually only consider field 2 and resort to the original order in case of ties, that is, to have a stable sort, you need to pass the -s switch.

$ sort data -k2,2 -s --debug
sort: using simple byte comparison
theta AAA 2
     ____
gamma AAA 2
     ____
alpha AAA 3
     ____
alpha BBB 2 
     ____

This look similar to the first snippet, but actually the first two lines in the output are swapped. Here they appear in the original order.

Timezones and DST in Python

It’s incredible how fiddly it is to work with timezones.

Today, 14th of June—and this is important—I was trying to convert a made-up datetime from “Europe/London” to UTC.

I instinctively tried out this:
>>> almostMidnight = datetime.now().replace(hour=23, minute=59, second=59, microsecond=999999, tzinfo=pytz.timezone('Europe/London'))
>>> almostMidnight
datetime.datetime(2017, 6, 14, 23, 59, 59, 999999, tzinfo=<DstTzInfo 'Europe/London' GMT0:00:00 STD>)

At this point you will notice it didn’t take into account the DST offset (it should read BST).

As a further confirmation, converting to UTC keeps the same time:
>>> pytz.UTC.normalize(almostMidnight)
datetime.datetime(2017, 6, 14, 23, 59, 59, 999999, tzinfo=<UTC>)

Notice this result would be fine during the winter, so depending how much attention you devote and when you write the code you might miss out on this bug – which is why I love having the same suite of tests always running on a system that lives right after the upcoming DST change.

Even more subtler, if you were to try and convert to a different timezone, a geographical timezone that observes DST, you would see this:
>>> almostMidnight.astimezone(pytz.timezone('Europe/Rome'))
datetime.datetime(2017, 6, 15, 1, 59, 59, 999999, tzinfo=<DstTzInfo 'Europe/Rome' CEST+2:00:00 DST>)

Interesting. Now DST is accounted for. So converting to geographical timezones might also mask the problem.

Long story short, the correct way *I believe* to convert the timezone of a datetime object to UTC is to create a naive datetime object (no timezone info attached) representing localtime, and then call the “localize” of the timezone of interest. In code:
>>> almostMidnight = datetime.now().replace(hour=23, minute=59, second=59, microsecond=999999)
>>> almostMidnight
datetime.datetime(2017, 6, 14, 23, 59, 59, 999999)
>>> pytz.timezone('Europe/London').localize(almostMidnight).astimezone(pytz.UTC)
datetime.datetime(2017, 6, 14, 22, 59, 59, 999999, tzinfo=<UTC>)

There’s a very nice read on timezones by Armin Ronacher, which I recommend.

Asserting that an exception has not been raised

A quick way to test that some method does not raise an exception is to try and raise that exception. Sounds logic – and it is – but I keep forgetting it. So here it is for my future me.

Suppose you have a method that raises an exception if it can’t open a file.

def read_first_line(self, filename):
    try:
        with open(filename, 'r') as f:
            self.first_line = f.readline()
    except IOError:
        print("No such file {0}".format(filename), file=sys.stderr)

Testing that it works correctly when a file does exist is quite simple at this point:

def test_no_exception_is_raised_if_file_exists(self):
    # suppose your instance is in self.sut
    try:
       sut.read_first_line('file_that_exists.txt')
    except IOError:
        self.fail("IOError raised but should not have")

Graceful Harakiri

In the past few days I’ve been trying to overcome a problem we saw on our CI environment: tests being abruptly cut if they hang.

If a build takes too long, our CI tool stops it by sending it a SIGTERM signal. That’s fine in general, but if it’s a test (run by the nosetests driver) that’s taking too long to finish, a SIGTERM would cause it to immediately stop, without leaving any trace on the output where it hanged.

What I coded was a plugin, Graceful Harakiri, that intercepts the SIGTERM signal, converts it to an interrupt and, in the process, prints out the current frame, giving away some information about where the test got stuck.

The code is on GitHub – have a look at the description and use it. Feedback is most welcome.

Pdb cheat sheet

I often need a table with all the commands for the Python debugger (Pdb) but so far I couldn’t find a comprehensive one. So I wrote one. Enjoy!

h(elp) or ? print available commands
h(elp) command print help about command
q(uit) quit debugger
! stmt or exec stmt execute the one-line statement stmt
p(rint) expr print the value of expression
pp expr pretty-print the value of expression
whatis obj print the type of obj
bt or w(here) print stack trace from oldest frame to newest
l(ist) list 11 lines of source code around the current line
l m, n list from line m to line n
args print the arguments of the current function
alias [name [command [parameter parameter ...] ]] create an alias called name that executes command
unalias name remove alias
<ENTER> repeat the last command entered
n(ext) execute the current statement (step over)
s(tep) step into function
j(ump) n set line n the next line to be executed
r(eturn) continue execution until the current function returns
c(ontinue) continue execution until a breakpoints is encountered
unt(il) continue execution until reaching the line greater than the current one or the current frame returns
run or restart (re-)run the program
u(p) move one level up in the stack trace
d(own) move one level down in the stack trace
b(reak) show breakpoints
tb(reak) [filename:]n set a temporary breakpoint at line n of file filename
b [filename:]n set a breakpoint at line n of file filename
b fun set a breakpoint at function fun
disable bn1 [bn2, ...] disable breakpoint(s) bn1, bn2, …
enable bn1 [bn2, ...] enable breakpoint(s) bn1, bn2, …
clear [bn] clear breakpoint number bn; if no number is specified, clear them all
commands [bn] specify commands to run at breakpoint bn
condition bn expr expression expr must evaluate to true in order for breakpoint bn to be hit. If no expression is specified, the breakpoint is made unconditional

String-building performance – a small experiment

I wanted to create a collection of all ASCII letters (both upper- and lowercase) and single digits in Python.

The first (IMO most Pythonic) way that I came up with uses the standard string concatenation operator:

import string
a = string.ascii_letters + string.digits

Then I wondered, would it be faster to use the usual methods to format strings?
Say, a = "%s%s" % (string.ascii_letters, string.digits) or a = "{0}{1}".format(string.ascii_letters, string.digits)?

Only timeit.timeit can tell.

>>> timeit("a = string.ascii_letters + string.digits", setup="import string", number=10000000)
1.4662139415740967
>>> timeit("a = \"%s%s\" % (string.ascii_letters, string.digits)", setup="import string", number=10000000)
2.8996360301971436
>>> timeit("a = \"{0}{1}\".format(string.ascii_letters, string.digits)", setup="import string", number=10000000)
12.925632953643799

Interesting, innit. Now, a word of caution: these results are limited to this particular example. Results may vary (and probably do) if the “domain of the experiment” changes – that is, if you use different strings in terms of number, length, characters, etc.

Nosetests and the docstrings

For some reason when you use the -v switch in nosetests, instead of using the test name, the runner uses the docstring (and only first line, for that matter).

Solution: install the “nose-ignore-docstring” plugin:
sudo easy_install nose-ignore-docstring
and then enable it by adding:
[nosetests]
with-ignore-docstrings=1

to ~/.noserc

For additional information, I’ll point you to the plugin web page on PyPy.

QA bugs

A couple of days ago, I was discussing a bug raised by a colleague of mine with a developer while at some point the product owner came by and asserted that “It’s a QA bug”: the triage session had started.

It had been a while since I last heard that name and, to be honest, it caught me of guard.
The definition varies depending on who you ask, but more or less a “QA bug” can be regarded as a bug that only a tester can find, either because the steps are too convoluted (many and complex) and/or the issue is not severe. As I see it, it could be the scratch on the inside of your new car, well hidden by the seat belt and only visible at times: it’s not the end of the world, but you know it’s there and it bothers you.
While I think it’s OK not to make a fuss about such kind of bugs, I also believe it’s stupid to dismiss them as “useless defects that will never be found by clients” or that are simply “too minor to be bothered fixing them”. I’m always keen on digging deeper on and around them. To me they’re alarm bells: warnings that something is just not about right and that even worse issues could be present in the same area. In my experience, this (luckily) often was not the case, but the opposite happened a few times as well.
Next time you triage bugs, speak up against who naively wants this type of bugs not to be fixed and suggest that QAs can spend some time analysing the whole situation a bit more in depth. It will be time well spent.