Monday, April 28, 2014

Scanning the pixels

Last post I described how you can get a "voidptr" through which you can PEEK the bytes of an image. This is needed to implement zoom-to-width and zoom-to-height buttons on the imageview. The code that I wrote for this in V.1 never felt right. It's a pure CPU-burning FORTRAN-style loop process; even more so than the character or word census of a large text. The scan image for a typical page is several megapixels (the test image used below is about 1800x2700, or 5Mpx) and the only process I could find for determining the margin has to access a good fraction of them. Clicking "To Width" or "To Height" on an older laptop incurs a noticeable pause of a second or more. I want to shorten that.

(Note that in what follows, the code for to-width and to-height is virtually the same, just substituting "top" and "bottom" for "left" and "right". I'm writing about to-width; whatever lessons I learn are immediately applicable to to-height.)

Starting point

Let's look at the code as it is now. The point is to find out how much of the left and right margins of the image are all-white, then scale and center the image to exclude them. To find the left margin, I look at the pixels of each row from left to right, stopping at a black spot. Once I've found a black spot, I never have to scan farther to the right than that point. I keep going to successive rows, hoping to find a black spot even more to the left. After looking at all rows I know the pixel count to the left of the leftmost black spot. The same logic applies to the right margin: look at all rows from right to left, in order to find the rightmost black spot.

An early complication was that finding a single black pixel threw up false positives. Scan images often have one-pixel, even two-pixel "fly specks" outside the text area. So I modified the code to stop only on a dark spot of at least 3 pixels, which added some overhead. I added a heuristic. Noting that many pages have large white areas at the top and bottom, I started the scan with the middle row, hoping to quickly find some black pixels nearer the left margin.

The following code is heavily rewritten and refactored. You can view the original if you like, but don't bother. The original has four nearly-identical loops for scanning the left margin, middle to end then top to middle, then the right margin the same. These loops—I realized yesterday—differ only in the start, stop and step values of their controlling ranges. So I factored the inner loops out to this:

    def inner_loop(row_range, col_start, margin, col_step):
        '''
        Perform inner loop over columns of current margin within rows of
        row_range. Look for a better margin and return that value.
        '''
        pa, pb = 255, 255 # virtual white outside column
        for row in row_range:
            for col in range(col_start, margin, col_step):
                pc = color_table[ ord(bytes_ptr[row+col]) ]
                if (pa + pb + pc) < 24 : # black or dark gray trio
                    margin = col # new, narrower, margin
                    break # no need to look further on this row
                pa, pb = pb, pc # else shift 3-pixel window
        return margin

The key statement is pc = color_table[ ord(bytes_ptr[row+col]) ]. From inside, out: bytes_ptr[row+col] peeks the next pixel value which is an index into the color table. ord() is needed because the voidptr access returns a byte value and (for reasons that escape me) Python will not permit a byte type as a list index. The color_table is a list of the possible GG values from RRGGBB, smaller numbers being darker. I'll talk about it in a minute.

The operation of this code should be fairly clear: slide a 3-pixel window along the pixels of one row up to a limit; when they comprise a dark spot, break the loop and set a new, more stringent limit. Return the final limit value.

Now let's look at the code that sets up for and calls that inner loop.

def find_image_margins(image):
    '''
    Determine the left and right margin widths that are (effectively)
    all-white in an image, returning the tuple (left,right). The image is
    presumed to be a scanned book page in the Indexed-8 format, one byte per
    pixel, where a pixel value is an index into a color table with 32-bit
    entries 0xAARRGGBB (AA=alpha channel).
    '''
    rows = image.height() # number of pixels high
    cols = image.width() # number of logical pixels across
    stride = (cols + 3) & (-4) # scan-line width in bytes
    bytes_ptr = image.bits() # uchar * a_bunch_o_pixels
    bytes_ptr.setsize(stride * rows) # make the pointer indexable
    # Get a reduced version of the color table by extracting just
    # the GG values of each entry. If the image is PNG-1, this
    # gives [0,255] but it could have 8, 16, even 256 elements.
    color_table = [ int((image.color(c) >> 8) & 255)
                     for c in range(image.colorCount()) ]

The first five lines of code set up the voidptr to the image bytes. The stride value is used because the Qt documentation notes that every pixel row occupies a whole number of 32-bit words.

Note the remark about the "presumed" image format? In early testing I forgot to force the image to Indexed-8. The calculation stride * rows yielded about 5MB but that was much larger than the true memory size of a compressed 1-bit PNG file. The result was that the first time the inner loop tried to access a byte of a middle row—Python seg-faulted. Yes, you can bring down, not just your own app, but the whole interpreter, by mis-using a voidptr.

If the image file is really monochrome and stored as PNG-1, the color table will have two entries. But I can't require that or assume that. It will be a PNG but it might be a PNG-8 with actual colors. So make a color table of just the GG values (Green is commonly used as a proxy for pixel brightness) and index it by the pixel value.

Now, to work:

    # Some pages start with many lines of white pixels so in hopes of
    # establishing a narrow margin quickly, start at the middle, go to
    # the end, then do the top half. Begin: left side from the middle down.
    left_margin = inner_loop(
                    range(int(rows/2)*stride, (rows-1)*stride, stride),
                    0, int(cols/2), 1
    )
    # With hopefully narrower margin, scan from the top to the middle:
    left_margin = inner_loop(
                    range(0, int(rows/2)*stride, stride),
                    0, left_margin, 1
                    )
    # Now do exactly the same but for the right margin, taking columns
    # from the rightmost, inward.
    right_margin = inner_loop(
                    range(int(rows/2)*stride, (rows-1)*stride, stride),
                    cols-1, int(cols/2), -1
                    )
    right_margin = inner_loop(
                    range(0, int(rows/2)*stride, stride),
                    cols-1, right_margin, -1
                    )
    return left_margin, right_margin

Each call to the inner loop passes a range of rows to examine—in the form of pre-calculated byte offsets—and the three factors for a range of columns. The end result is the width of the white margins of the page.

Timings of this code

When I use the ctime package to time one application of find_image_margins() to a representative page image, I get these not awfully helpful stats:

         871938 function calls in 1.921 seconds
   Ordered by: internal time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    1.815    0.454    1.921    0.480 imagetest.py:20(inner_loop)
   871922    0.106    0.000    0.106    0.000 {built-in method ord}
        1    0.000    0.000    1.921    1.921 {built-in method exec}
        1    0.000    0.000    1.921    1.921 imagetest.py:12(find_image_margins)

The remaining lines are all-zero times. So: running under cprofile, this takes almost two seconds to run. On a 2.8GHz CPU, that's roughly six billion instructions. Wow! This would have been completely infeasible on, say, my old Mac Plus where it would have taken something like 20 minutes to complete on its 8 megahertz CPU. On my current Macbook Pro, it's a minor annoyance.

The count of 871,922 calls to ord tells how many pixels were examined in the inner loop. The total time of 1.815 tells how much time the inner loop spent on those pixels, or about 108,000,000 microseconds on 900,000 pixels giving 120 microseconds per pixel examined.

Optimizing

I see two approaches to optimizing this program. One: reduce the 120μsec spent in the inner loop on each pixel. Two: reduce the number of pixels examined in the inner loop. Reducing either number by half would halve the program execution time.

I believe I will spend a bit of time exploring both approaches, and report in the next post.

No comments: