Wednesday, December 31, 2008

Using SGMLParser With IronPython


Mark Pilgrim's excellent Dive Into Python has a section on using SGMLParser and having seen nothing similar (and imagining its many uses!) I thought I'd give it a whirl in IronPython. I thought a good proof of concept would be creating a database out of link heavy sites.  Since I visit Arts & Letters Daily every so often and the closet intellectual in me likes to hang onto what I find there, I thought I'd target it:

import urllib2
import sgmllib
from sgmllib import SGMLParser

import clr
from System import *
from System.Data import *
from System.Net import *

class AlReader(SGMLParser):
def reset(self):
self.urls = []
self.pieces = []
self.track = 0
self.prePend = "No Category"
self.counter = 0

def start_a(self, attrs):
href = [v for k,v in attrs if k == "href"]
key = [v for k,v in attrs if k == "name"]
if href:
self.track = 1
elif key:
self.prePend = attrs[0][1]

def handle_data(self, text):
if self.track:
self.pieces.append("|".join([self.prePend, text]))
self.counter = self.counter + 1

def end_a(self):
self.track = 0

def get_links(self):
links = []
for i in range(0, len(self.urls)):
links.append("|".join([self.pieces[i], self.urls[i]]))
return links
#print "%s %s" % (self.counter, "Total links")

def get_link_datatable(self):
d = DataTable()
d.Columns.Add(DataColumn("Category", Type.GetType("System.String")))
d.Columns.Add(DataColumn("Site", Type.GetType("System.String")))
d.Columns.Add(DataColumn("Url", Type.GetType("System.String")))

for text in self.get_links():
newRow = d.NewRow()
newRow["Category"], newRow["Site"], newRow["Url"] = text.split("|")

return d

response = urllib2.urlopen("")
a = AlReader()
linkdata = a.get_link_datatable()
# write it out to prove we got it.
ds = DataSet()
ds.WriteXml("c:\\temp\\arts and letters links.xml")

If you find tihs interesting do make sure you look at Pilgrim's chapter on HTML Processing


Saturday, December 20, 2008

Parameterized IN Queries


I haven't listened to the podcast yet but saw a cool trick from Joel Spolsky on approaching parameterized IN queries. Purists will bemoan its lack of premature optimization but I think it's novel enough to study because of the approach: using the SQL LIKE operator on your parameter rather than a field, which is what people like me are used to. There's code on the StackOverflow post but I thought I'd paste some of the poking around I did in Sql Management Studio:

-- setup
FirstName VARCHAR(50) NULL

-- some data
INSERT INTO Person VALUES('Jonathan')

-- here's the magic
SET @FirstName = '|David|Trilby|'
SELECT * FROM Person WHERE @FirstName like '%|' + FirstName + '|%'

-- ported to a proc
CREATE PROC uspPersonSelector
@FirstNames VARCHAR(500)
SELECT * FROM Person WHERE @FirstNames like '%|' + FirstName + '|%'

-- showing it works
uspPersonSelector '|David|Trilby|'

-- somewhere in the netherworld of C#:
string[] names = {"David", "Trilby"};
SqlCommand cmd = GetACommandFromSomewhere();
cmd.Parameters.AddWithValue("@FirstNames", "|".Join(names));

Drop Table Person



Friday, December 19, 2008

Programmers as Goalkeepers


The 8th annual New York Times magazine Year in Ideas featured a section on Goalkeeper Science profiling this paper by some Israeli scientists called Action bias among elite soccer goalkeepers: The case of penalty kicks. In looking at the approach of keepers in some 286 penalty kicks they found that though 94 percent of the time they dived to the right or left, the chances of stopping the kick were highest when the goalie stayed in the center. The researchers theorized that the reason keepers behaved in this way was that they were afraid of appearing that they were doing nothing.

Immediately I remembered a blurb from an Paul Graham's What Business Can Learn from Open Source  essay where he expressed a similar dynamic for programmers:

"The other problem with pretend work is that it often looks better than real work. When I'm writing or hacking I spend as much time just thinking as I do actually typing. Half the time I'm sitting drinking a cup of tea, or walking around the neighborhood. This is a critical phase-- this is where ideas come from-- and yet I'd feel guilty doing this in most offices, with everyone else looking busy."

I wonder what Paul would say about the IBM commercial on ideating in which concludes that people should "start doing" after showing an image of people laying inert on an office floor, a stark portrayal of how a manager at IBM might see someone like Paul Graham.

As programmers much of what we should do may not appear to be work for the nonprogrammer and as a result many of us end up doing it at home. I spend a lot of time at home exploring different technologies in a kind of tangential approach that wouldn't look like "working" at work but often my best ideas and solutions come from here.  I also spend a lot of time reading technical books and blogs.

I'm wondering what it would look like if we could step back and look in a quantitative way at the performance deficits resulting from the desire to look busy at work. What would the workday look like? I'm wondering what an hour for reading, a few hours for exploratory/research programming, and the rest as project time would do for my own productivity.

If programming was goalkeeping was programming, Edwin Van der Sar would be quite the Python hacker. 


Thursday, December 18, 2008

Windows Forms + Web, WIB part II


A while back I made the case for applications that put together the strengths of Windows Forms and Web technologies (I thought of the catchy "WIB" as a name for this approach). The example I’d given then was a Windows Forms hosted Web Browser for local images that one could use for annotation that leveraged Windows for local file storage and a Web technology like jQuery for doing transitions in the user interface.

Today I thought of another use for this approach that wrapped itself nicely into a tool I've been using for some time to download mp3s from a given website1,2. I call the tool "Fortinbras" and if you find it useful I'd be delighted.

So how was Fortinbras changed?

Parsing the HTML for mp3 files was a little tricky. My initial approach was to use a regular expression against the text of the document which, truth be told, is a brittle approach. Part of why I never trumpted the tool was because I never completely perfected this tactic (while it worked well enough for me personally). My code looked as follows:

	WebClient wc = new WebClient();
string pageText = wc.DownloadString(browser.Url.ToString());
Regex re = new Regex("href=\"(?<url>.+?mp3)\"", RegexOptions.IgnoreCase);
Match mp3Matches = re.Match(text);
while (mp3Matches.Success)
string matchUrl = mp3Matches.Groups["url"].Value.ToString();
AddMp3(browser.Url.ToString(), data, matchUrl);
mp3Matches = mp3Matches.NextMatch();

Today my epiphany was that I didn't need to use a regular expression when I could use the DOM from Windows Forms to pull out the anchors that have mp3 destinations. Here's what that looks like:

	while (browser.Document.Body == null)Application.DoEvents();
HtmlElementCollection anchors = browser.Document.Body.GetElementsByTagName("a");
foreach (HtmlElement anch in anchors)
string linkUrl = anch.GetAttribute("href");
if (linkUrl.ToLower().EndsWith("mp3"))
AddMp3(browser.Url.ToString(), data, linkUrl);

As usual feedback is welcome - you can download a copy of the Fortinbras project here.

1I am aware of the Firefox extensions that do this but someday (imagine a pie in the sky look on my face) I was hoping to incorporate a "favorites" list with URLs / locations so that this would be a one stop shop for my downloading and organizing of podcasts. My goal here is embarrasment driven development so I'll probably be bummed enough about the code I've just posted to put in some enhancements as time permits.

2My friend's music blog is a great stop, try a couple of tracks at The Look Back.


Wednesday, December 17, 2008

Growing Open Source Community


Kevin Dangoor recorded an interesting screencast on some of the essentials of getting an open source project more widely circulated. Essentially Dangoor explains that having a successful open source project is not just about code, it's about good product management. I wanted to title this post with Dangoor's most quotable quote: "Rails is not where it is because of great code." But there are a lot of people who would take that as an opener for a religious war without seeing his real intent of highlighting marketing and management of a project.

I won't rehash it, it's worth watching it for yourself.