Sed Cleverness

I got sed to do something clever today, though sadly the cleverness was not mine.  I tried to solve the problem myself and although in the process I learned a great deal about sed, I had to resort to copying the answer.

I want to add Google Analytics code to my sister’s website, which she’s writing using iWeb.  Analytics is enabled by including in your HTML a javascript snippet that Google gives you.  iWeb does a really nice job, but you have to do things their way, and that means no javascript.  Fair enough, I guess, Apple want to be ensure that websites produced using their software will always work, and introducing a programming language pretty much ensures that things frequently won’t work.

cyberduck.png

The website is hosted on a Linux server, not an Apple account, so to publish the site we export its contents to a directory, then FTP the directory contents to the server.  Initially we used Filezilla but it does not have incremental directory synchronization, so we switched to the awesome Cyberduck.
So my first thought was, well perhaps we can modify the HTML files before they leave her Mac.  There are a couple of approaches available, one is an iWeb add-on which looked more complex than needed, and you have to buy it.  Another was an Automator action you can download that will insert the Google javascript into the HTML files.  That sounded good but was one more action to perform, and I’ve never used Automator.
sedawk.gif
So I thought, I’ll knock up a little script that the web server can run as an hourly cron job, and edit any HTML files that don’t contain the Google code.  Ha!  Little script though it is, it took a while.  I got a lot of help from Bruce Barnett’s sed guide, as I don’t have a good shell book with me right now.  I should buy O’Reilly’s ‘Sed & Awk’ book, a classic if ever there was one, and I believe may even have been their first ever book published. I remember it in print in the early 90′s, and they even had a T shirt of the cover, which I dearly wish I’d bought.


The tricky part was, the Google code is supposed to be included immediately before the </body> tag in each HTML page.  Two problems: I couldn’t be sure that the </body> tag would be on a line by itself, and sed file inclusion acts after the matched pattern, not before.
Problem 1 was addressed using a simple substitution with newlines:

sed -e ‘

s|</body>|

&

|

‘ 

I used pipe-character delimiters.  The substitution is of the string </body>, and the newlines are inserted literally.  So the line-continuation backslashes continue the substitution pattern.  The ampersand is the matched string, so this substitution puts a newline before and after the </body> tag, to ensure it’s on its own line.
Problem 2 was harder.  I didn’t know about the file-insertion operator till today, though I figured that sed would have one.  It does, but it inserts after the matched pattern.  My initial approach to insert the file Google Analytics Javascript code, ga.js was:

sed -e ‘                                                                                       
/</body>/ {                                                                                       

r ga.js’

}   

But this inserted the file after the </body> tag, which wasn’t allowed.  
Next thought was to take a two-way approach.  I’d print every line not matching the </body> pattern, and in a separate rule matching the </body> pattern, delete the pattern, insert the file, then print the pattern.

sed -e ‘

/</body>/ !{
p
}
/</body>/ {
r ga.js

d
p
}’                                                                                                 

Not to be.  The pattern matching quits at the delete as explained by Barnett, so the print command is never executed.
At this point after some hours of learning sed, I looked for an answer to inserting a file before a sed pattern, and found one.  At least by this point I knew enough to understand it (sort of).  This post by Tapani Tarvainen gave me a very succinct answer for the second pattern action:

/</body>/ {

r ga.js

‘ -e N -e ‘}’


OK I wouldn’t have thought of that one.  As he explains,
 

The ‘r’ command actually outputs the file just before reading a new line to the pattern buffer (or at EOF). That can be forced in mid-script by ‘n’ or ‘N’, and while ‘n’ will also print the
buffer before ‘r’ does its thing.

He goes on to cover more general cases.  There is also another approach described in the same forum which employs use of the hold command, h, and the swap operator, x.
Having tested the code that does the file insertion, I put it into a shell script that finds all *.html files, checks to see if they have the GA code in them already (you’re only allowed to put it in once), and if not, performs the insertion.
Next problem…the ISP’s server doesn’t offer cron!  I need this to run the sed script on the HTML files without having to ssh in to run it.  Aargh.  Neither can I execute an arbitrary command on the server using my FTP client.  I might try making a passwordless ssh script (using id_rsa.pub keys) and then see if I can get Cyberduck (FTP client) to run that after it performs the transfer.
That’s enough for one day, though.  I learned a lot about sed, and a lot about how much I don’t know.

Comments are closed.