Blog

PDF Collating/Merging

At home I have a wonderful Fujitsu ScanSnap duplex sheet-fed scanner. I can put a stack of double-sided pages into it, and it creates a perfect PDF.

At my office I have an HP all-in-one printer/scanner/copier. It's not quite so nice. It does have a sheet-fed scanner, but it can only do single-sided scans. If I wanted to scan any double-sided documents, I more or less had to take them home and do it there.

I got sick of doing this and decided to write a little Python script that can take two PDF files, and collate/merge them. The first file represents all the odd-numbered pages in the stack of paper, and the second file is the even numbered (i.e. the backs of the odd-numbered pages). My process now is to scan the stack of paper to produce the odd-numbered-pages PDF. Then I flip the stack over and scan again, generating the even-numbered-pages PDF. This PDF is in reverse order (when you flip the stack over, the last page gets scanned first), so my program knows to go backwards through the even-numbered document. I pass these two files to my program and it collates them and outputs a new PDF. The end result is the same as if I had used my ScanSnap at home, just with a little more effort. Better than schleping the documents back and forth from the office every time.

You can find it in all its glory here: https://github.com/parlarjb/PDF-Merge

Starting cProfile in the middle of a run

I recently asked on Twitter if it was possible to tell cProfile to start recording in the middle of a run, rather when a function first starts.

In my particular case, I have a function (generator, actually) that runs beautifully fast for the first 2.3 million items it returns, but then hits some strange bottleneck. I tried debugging with pdb, but on data sets as large and as complicated as I'm dealing with, I couldn't quite figure out what was going on.

My friend Chris pointed out that the runcall method of Profile has references to enable and disable methods. I've never seen these before, and to be frank, I've never directly used the Profile class. The run function of the cProfile module has always been good enough for me.

With a bit of playing around, I was able to use these methods to do what I needed. In this example, chains is a generator. I simply disable the profiler immediately, and start it up at the point where I know things go wrong.

from cProfile import Profile

profiler = Profile()

def write_chains():

    chain_count = 0

    for chain in chains:

        chain_count += 1

        print chain_count

        if chain_count == 2305838:

            profiler.enable()

try:

    write_chains()

except KeyboardInterrupt:

    profiler.dump_stats('prof_output2')

Update:


Chris pointed out that I don't need to wrap write_chains() in a call to run(), which I was originally doing. Also, that dump_stats() causes disable() to be called, so I don't need to worry about that either.