Thursday, September 4, 2014

Regex and backward searching

In the seemingly endless process of getting the Find/Replace panel done (now 3 weeks into its 1-week planned schedule) I have finally gotten to the part where I code some actual finding, and today had a bit of a surprise.

The Find panel offers four buttons, First, Next, Prior and Last. This is my design response to whether find should "wrap around" at end of document. To me it makes much more sense to find the First match and proceed to the Next, and the Next..., or to find the Next after the current cursor and then the Next, and the Next..., or to find the very Last match and walk backward to the Prior, and Prior...

These have key equivalents, and in my most common use case, it's click Next and then (because the focus is pushed to the document on a match) ctl-g, ctl-g, ctl-g... walking forward through the document match by match. Or, click Last and ctl-shift-g, ctl-shift-g, backward through the document. And not infrequently, alternating ctl-g and ctl-shift-g to bounce backward and forward between two (or more) matches.

It wasn't hard to get this working for non-regex finds. This just uses the QTextDocument find() method. This takes a string to match and a starting point (expressed as a QTextCursor), and optionally a flag word with optional Ignore Case, Whole Word, and Backward bits. It searches forward or backward from that cursor's position. This was pretty easy to set up, and it works nicely. I can set the Respect Case and Whole Word check-boxes, click First, and then hold down ctl-g and it rattles through the document from match to match as fast as the key repeats. Same for Last and ctl-shift-g.

Adding regex was a bit more complex. If the Regular Expression checkbox is on, the current value of the Find text field gets compiled as a regex while it is being edited. So when starting a search, if that regex is not None, the find string is a syntactically-valid regex. But the backward part...

As I mentioned some time back, I'm using the superb regex module, not the Python default re module and not Qt5's new QRegularExpression class. One advantage of regex is that is supports backward searching, which neither re nor QRegularExpression do.

However, it supports it as a global flag, similar to regex.IGNORECASE. That means it needs to be specified at compile time. But the compile is happening while the user is editing the pattern string. By the time she clicks Last or Prior, asking for a backward search, the compile is finished. The code that recompiles the regex every time the user edits it (in order to turn the input field pink if the syntax is bad) cannot know if the regex will be used in the forward direction, or reverse.

So in the actual find code, in the case that the regex checkbox is on, I have to ask, is this a reverse search? And if so, recompile the regex with the REVERSE flag. Well, that would be a waste when Prior is being clicked or ctl-shift-g pressed repeatedly. So I had to cache the recompiled regex and only recompile it if needed.

But the first time I hit the Prior button (search backward from the current location), I was surprised that the match occurred at the end of the document, instead of just left of the cursor. As if I'd clicked Last, but I hadn't. What?

The problem was that regex does something reasonable but not what I expected. The QTextDocument find() method, if asked to go backward, goes backward from the position of the given cursor. The equivalent argument for a regex.search() is the pos argument: regex.search(string,pos). That works for a forward search; the search starts at offset pos in string and goes forward. I supposed that an re compiled with REVERSE would be like find(), it would go backward from pos.

Nunh-unh. A reverse search goes backward from the end of the string. Which is why a "Prior" search acted like a "Last" search. Well, how to make it do what I wanted?

Careful reading of the re documentation reminds one that the regex.search() method has two optional arguments, pos and endpos. Not unreasonably, regex doing a backward search starts at the endpos offset (default end of string), and scans backward toward pos (default start of string). So my regex find code ends up with this:

                if flag & FindPanel.SEARCH_BACKWARD :
                    # reverse search begins at "endpos"
                    fp.match = re.search(text,0,start_tc.position())
                else : # forward search begins at "pos" argument
                    fp.match = re.search(text,start_tc.position())
                if fp.match : # is not None, we have a match...

I started to file an Issue with the developer, but thought better of it. What he'd doing makes sense. It merely failed to match my preconceptions.

No comments: