Tuesday, April 29, 2014

Faster, faster!

In the preceding post I described the fairly naive algorithm I've been using to find the white borders of a scanned page's image, in order to automatically scale it to fill the image display window. The time taken to scan about a million white pixels was rather distressingly long. In fact it was worse than I described there.

A New Baseline

I was testing the find_image_margins() function by calling it through the profiling function: cProfile.run('find_image_margins(qimage)', 'profdata') but I hadn't actually looked at the margins it was returning. When I did look, I discovered that the left margin was only 2, when to the eye it should be at least an inch-worth, 150 or more. So I looked closer at the test image and found there was a little patch of black at the extreme lower left corner. The test image originally had some black crud on the left side and I'd cleaned it up in Photoshop but had missed this little patch.

As a result, at the end of the first inner loop, from the middle of the left side down, it found an unrealistically small left margin of 2. Then the second inner loop, from the top to the middle, never looked past pixel 2, which made it unrealistically fast.

After erasing the speck on the image, I made one logical change to the inner_loop code. When it stops, it has found a black patch 3 pixels wide, and its margin variable indexes the innermost pixel of that patch. It was returning that value, but it ought to return the index of the outermost pixel of the three. So it now read:

        pa, pb = 255, 255 # virtual white outside column
        for row in row_range:
            for col in range(col_start, margin, col_step):
                pc = color_table[ ord(bytes_ptr[row+col]) ]
                if (pa + pb + pc) < 24 : # black or dark gray trio
                    margin = col # new, narrower, margin
                    break # no need to look further on this row
                pa, pb = pb, pc # else shift 3-pixel window
        return margin - (2*col_step) # allow for window width

With that change and a realistic test image, cProfile now returned the following numbers:

         1171608 function calls in 2.348 seconds
   Ordered by: internal time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    2.282    0.571    2.348    0.587 imagetest.py:20(inner_loop)
  1171591    0.066    0.000    0.066    0.000 {built-in method ord}
        1    0.000    0.000    2.348    2.348 imagetest.py:12(find_image_margins)

Bottom line: 2.35 seconds to examine 1.17 million pixels.

Getting rid of ord()

One thing that bugged me about the above code is the need to take the ord() of the pixel byte, in order to use it as a list index. This is because Python, for reasons best known to itself, gives an error if you try to use a byte value as an index (and not an IndexError, either, but a typeError; try it: [1,2][b'0']). Well, what structure will accept a byte as an index? A dictionary. I changed the list comprehension that created the color table, into a dict comprehension:

    color_table = { bytes([c]): int((image.color(c) >> 8) & 255)
                     for c in range(image.colorCount()) }

The bytes() function requires an iterable, hence it is necessary to write bytes([c]), converting the scalar integer c into a list so that bytes() will make it into a scalar byte. But whatever; the extra code at this point is executed only once per color. The overhead is trivial compared to code that is executed once per pixel. That code could now read:

                pc = color_table[ bytes_ptr[row+col] ]

Using the byte returned by the voidptr directly to get a color. Did it save any time?

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    2.179    0.545    2.179    0.545 imagetestB.py:20(inner_loop)
        1    0.000    0.000    2.179    2.179 imagetestB.py:12(find_image_margins)
        1    0.000    0.000    2.179    2.179 {built-in method exec}

Yes, a little. Total time dropped from 2.348 to 2.179, a saving of about 7%. I thought and thought about some way to code the three-pixel window scan in a better way, and could not. If I've missed something, please tell me in a comment! But now I turned my attention to the other attack, reducing the number of pixels examined.

Skipping rows

To look at single pixels is to examine a page image at extremely fine detail. Is there a valid character that is less than three pixels tall? No. So why am I looking at every row? To look at every row of pixels means looking at every line of characters at least four times, more likely eight or ten times. Let's skip a few!

It turned out to be trivially easy to look at every second row of the image. Recall that the call to the inner_loop passes a range iterator:

    left_margin = inner_loop(
                    range(int(rows/2)*stride, (rows-1)*stride, stride),
                    0, int(cols/2), 1

The first argument to range is the starting value, in this case, the byte-offset to the middle row of the image. The second is the end value that the range output will never exceed. In this example, that's the offset to the last row of the image. The third is the step value, the number of bytes from one row of pixels to the next. In order to look at only every second row, I added two characters to that statement:

    left_margin = inner_loop(
                    range(int(rows/2)*stride, (rows-1)*stride, stride*2),
                    0, int(cols/2), 1

This had a good effect on the cProfile stats:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    1.125    0.281    1.125    0.281 imagetestB2.py:20(inner_loop)
        1    0.000    0.000    1.125    1.125 imagetestB2.py:12(find_image_margins)
        1    0.000    0.000    1.125    1.125 {built-in method exec}

From 2.179 seconds down to 1.125, a reduction of 49%. Not a surprise: do half the work, take half the time; but still, nice. And the returned margin values were almost the same.

Shrinking the image

It would be easy to try skipping three of every four rows, but that might result in missing something like a wide horizontal rule. Instead, I thought, what about scaling the image down by half, using a smooth translation? That would have something like the effect on the eye of holding the page at arm's length: shrink it and blur it but retain the outline. To run the inner loops on a half-size image would mean looking at 1/4th the pixels (half the columns of half the rows). The returned margins could be scaled up again.

I added the following code to the setup:

    scale_factor = 2
    orig_rows = image.height() # number of pixels high
    orig_cols = image.width() # number of logical pixels across
    image = image.scaled(
        QSize(int(orig_cols/scale_factor),int(orig_rows/scale_factor)),
        Qt.KeepAspectRatio, Qt.SmoothTransformation)
    image = image.convertToFormat(QImage.Format_Indexed8,Qt.ColorOnly)

I found that the QImage.scaled() method could change the format from the Indexed8 that it started with, so it was necessary to add the convertToFormat() call to restore the expected byte-per-pixel ratio. (Which meant, it was no longer necessary to enforce that format before calling this find_image_margins function.)

The rest of the setup was just as before, setting the row and column counts and the stride, but for this reduced image. The final line was no longer return left_margin, right_margin but this:

    return left_margin*scale_factor-scale_factor, right_margin*scale_factor+scale_factor

This worked, and returned almost the same margin values as before, different by only a couple of pixels, much less than 1%. And the total execution time was now 0.333 seconds, distributed as follows:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    0.294    0.074    0.294    0.074 imagetestC.py:20(inner_loop)
        1    0.032    0.032    0.032    0.032 {built-in method scaled}
        1    0.006    0.006    0.006    0.006 {built-in method convertToFormat}
        1    0.000    0.000    0.333    0.333 imagetestC.py:12(find_image_margins)

That was such a success, running in 30% of the previous best time, that I tried increasing the scale_factor to 4, with this result:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    0.167    0.042    0.167    0.042 imagetestC.py:20(inner_loop)
        1    0.031    0.031    0.031    0.031 {built-in method scaled}
        1    0.001    0.001    0.001    0.001 {built-in method convertToFormat}
        1    0.001    0.001    0.001    0.001 imagetestC.py:52(<dictcomp>)
        1    0.000    0.000    0.200    0.200 imagetestC.py:12(find_image_margins)

Total time of 0.2 seconds. That's a reduction of only 1/3 from the scale factor of 2, so clearly we are having diminishing returns. But the total is now only 8% of the execution time of the original algorithm. Not too shabby! The returned margins were still the same. The total time is now small enough that the contribution of the dictionary comprehension for the color table is noticeable. And the largest component, after the inner loop, is the QImage.scaled() call.

This is fast enough that there should be no objectionable delay on clicking the To Width button, even on slow hardware. I will proceed to integrate this into the imageview module and its unit-test. When that's done, I will be able to proceed to a preliminary version of the main window!

No comments: