Big Numbers in Perl

A simple numerical overflow had been causing an error in my code for ages, and I just found it.  The problem arose from using numbers greater than 2^32 in Perl. I’ve become so used to 64-bit systems that I’d forgotten to check for it.

I use powers of 2 to encode values such as file formats into a single numerical field in MySQL (which briefly was SunSQL and is now OracleSQL?).  So I have code that looks like:
our %cat_formats = (
  2**1  => ['DICOM'],
  2**2  => ['NEMA'],
  2**3  => ['Analyze'],
and so on, adding the numbers together gives me a unique value, encoding combinations of values, which I can store in a BIGINT field in MySQL.  This was fine while I was listing 32 file formats or less, but I now list 36.  The test I had been using for matching a format-encoding value against the stored MySQL value is something like this, if I were testing for a value of 2.
  $ret = (($readfmt * 1) & 2) ? 1 : 0;
I multiply the value by 1 to be sure it’s in a numerical context, and I pedantically set the return value to 1 or 0 for consistency with cases where I might want to set it to something else. 
Problem was, this was returning true whenever the stored value was greater than 2^32, and I use the value 2^34 to denote NIFTI format.  So whenever I was testing for a value of 2 (denoting DICOM, common), I was returning true for any program that could read NIFTI format (somewhat rarer).  Which lead to the FSL program being listed as the second highest ranked DICOM viewing program.  Now FSL is a fine program, but sadly it cannot read DICOM.
All I had to do to fix it was enable big numbers, and all was well.
use bignum;

Sed Cleverness

I got sed to do something clever today, though sadly the cleverness was not mine.  I tried to solve the problem myself and although in the process I learned a great deal about sed, I had to resort to copying the answer.

I want to add Google Analytics code to my sister’s website, which she’s writing using iWeb.  Analytics is enabled by including in your HTML a javascript snippet that Google gives you.  iWeb does a really nice job, but you have to do things their way, and that means no javascript.  Fair enough, I guess, Apple want to be ensure that websites produced using their software will always work, and introducing a programming language pretty much ensures that things frequently won’t work.

cyberduck.png

The website is hosted on a Linux server, not an Apple account, so to publish the site we export its contents to a directory, then FTP the directory contents to the server.  Initially we used Filezilla but it does not have incremental directory synchronization, so we switched to the awesome Cyberduck.
So my first thought was, well perhaps we can modify the HTML files before they leave her Mac.  There are a couple of approaches available, one is an iWeb add-on which looked more complex than needed, and you have to buy it.  Another was an Automator action you can download that will insert the Google javascript into the HTML files.  That sounded good but was one more action to perform, and I’ve never used Automator.
sedawk.gif
So I thought, I’ll knock up a little script that the web server can run as an hourly cron job, and edit any HTML files that don’t contain the Google code.  Ha!  Little script though it is, it took a while.  I got a lot of help from Bruce Barnett’s sed guide, as I don’t have a good shell book with me right now.  I should buy O’Reilly’s ‘Sed & Awk’ book, a classic if ever there was one, and I believe may even have been their first ever book published. I remember it in print in the early 90′s, and they even had a T shirt of the cover, which I dearly wish I’d bought.


The tricky part was, the Google code is supposed to be included immediately before the </body> tag in each HTML page.  Two problems: I couldn’t be sure that the </body> tag would be on a line by itself, and sed file inclusion acts after the matched pattern, not before.
Problem 1 was addressed using a simple substitution with newlines:

sed -e ‘

s|</body>|

&

|

‘ 

I used pipe-character delimiters.  The substitution is of the string </body>, and the newlines are inserted literally.  So the line-continuation backslashes continue the substitution pattern.  The ampersand is the matched string, so this substitution puts a newline before and after the </body> tag, to ensure it’s on its own line.
Problem 2 was harder.  I didn’t know about the file-insertion operator till today, though I figured that sed would have one.  It does, but it inserts after the matched pattern.  My initial approach to insert the file Google Analytics Javascript code, ga.js was:

sed -e ‘                                                                                       
/</body>/ {                                                                                       

r ga.js’

}   

But this inserted the file after the </body> tag, which wasn’t allowed.  
Next thought was to take a two-way approach.  I’d print every line not matching the </body> pattern, and in a separate rule matching the </body> pattern, delete the pattern, insert the file, then print the pattern.

sed -e ‘

/</body>/ !{
p
}
/</body>/ {
r ga.js

d
p
}’                                                                                                 

Not to be.  The pattern matching quits at the delete as explained by Barnett, so the print command is never executed.
At this point after some hours of learning sed, I looked for an answer to inserting a file before a sed pattern, and found one.  At least by this point I knew enough to understand it (sort of).  This post by Tapani Tarvainen gave me a very succinct answer for the second pattern action:

/</body>/ {

r ga.js

‘ -e N -e ‘}’


OK I wouldn’t have thought of that one.  As he explains,
 

The ‘r’ command actually outputs the file just before reading a new line to the pattern buffer (or at EOF). That can be forced in mid-script by ‘n’ or ‘N’, and while ‘n’ will also print the
buffer before ‘r’ does its thing.

He goes on to cover more general cases.  There is also another approach described in the same forum which employs use of the hold command, h, and the swap operator, x.
Having tested the code that does the file insertion, I put it into a shell script that finds all *.html files, checks to see if they have the GA code in them already (you’re only allowed to put it in once), and if not, performs the insertion.
Next problem…the ISP’s server doesn’t offer cron!  I need this to run the sed script on the HTML files without having to ssh in to run it.  Aargh.  Neither can I execute an arbitrary command on the server using my FTP client.  I might try making a passwordless ssh script (using id_rsa.pub keys) and then see if I can get Cyberduck (FTP client) to run that after it performs the transfer.
That’s enough for one day, though.  I learned a lot about sed, and a lot about how much I don’t know.

Perl Hashes Ate My Workstation

Perl is not noted for its leanness but today I finally ran some little tests to see just how much memory it was devouring.  I use some OO Perl code to process image files, there is a base class Image::Med from which are derived Image::Med::DICOM, Image::Med::Analyze, and a few others.  I store each DICOM element in an object instantiated as a hash; it’s of class Image::Med::DICOM::DICOM_element which is derived from a base class Image::Med::Med_element.  The inheritance works quite well and I’m able to move most of the functionality into the base classes, so adding new subclasses for different file formats is reasonably easy.

Perl hashes are seductive, it’s so easy to add elements and things tend to just work.  So my derived DICOM element class ends up having 13 elements in its hash, of which 10 are in the base class (‘name’, ‘parent’, ‘length’, ‘offset’ and so on) and three are added in the derived class (‘code’, ‘group’, ‘element’) as being DICOM-specific.
As mentioned, I never claim Perl is svelte (or fast) but today I was sorting about 2,000 DICOM files.  I like to keep them all in memory, for convenience and sorting, before writing or moving to disk.  Heck we’re only talking about a few thousand things here and computers work in the billions…all too easy to forget about memory usage.
I was unpleasantly surprised to find that each time I read in a DICOM file of just over 32 kB (PET scans are small, 128 x 128 x 2 bytes), I was consuming over 300 kB of memory.  So my full dataset of only 70 MB was using up almost a GB of RAM.  And that was for only 2,100 files, whereas I have one scanner that generates over 6,500 DICOM files per study.  I have the RAM to handle it, but my inner CS grad has a problem with a tenfold usage of memory.
I used the Perl module Devel::Size to measure the size of hashes and the answers aren’t pretty: on my 64-bit Linux workstation each hash element is consuming 64 bytes in overhead.  Crikey!  So 64 bytes, times 13 fields per DICOM element, times 200-odd DICOM elements per object, that’s over 200 kB per DICOM object before I even put any data into it.
On my 64-bit Mac with perl 5.8.8 it’s not much better at 39 bytes per minimal element.  I compared it with an array, which turned out to use 16 bytes per minimal element.
#! /usr/local/bin/perl -w                                                                                                      
use Devel::Size ‘total_size’;
my %h = ();
print “0 hash elements, size = ” . total_size(%h) . “n”;
$h{‘a’} = 1;
print “1 hash elements, size = ” . total_size(%h) . “n”;
$h{‘b’} = 2;
print “2 hash elements, size = ” . total_size(%h) . “n”;
my @a = ();
print “0 array elements, size = ” . total_size(@a) . “n”;
$a[0] = 1;
print “1 array elements, size = ” . total_size(@a) . “n”;
$a[1] = 2;
print “2 array elements, size = ” . total_size(@a) . “n”;

[widget icon] 167% ~/tmp/hashsize.pl
0 hash elements, size = 92
1 hash elements, size = 131
2 hash elements, size = 170
0 array elements, size = 56
1 array elements, size = 88
2 array elements, size = 104

I know the answer is, don’t use giant hashes in Perl, or perhaps it is, don’t use Perl when you’re manipulating 2,000 x 200 x 13 elements.  But I like Perl, it’s so convenient.  Perhaps I’ll reimplement the whole thing as an array (ugh), and/or cut down the number of elements per DICOM field (indexing a 13-element array, not fun).