Tuesday, February 19, 2008

Iron Python DSL for ETL

{

Hanselman wrote about "fluent" interfaces recently which confirmed something I'd been thinking for a while. What if rather than massive tools we just had a language to express ourselves when we wanted to move data around?  What if we could say something like:

database("connection string").table("table").copyto.csv("C:\\spam\\eggs.csv")
or
database("connetion string").table("table").copyto.database("connection string 2")

In the spirit of releasing early and often, here's an implementation of the above language for sql server with Python. I'll be extending the flexibility quite a bit, and my Python skills are still those of a learner, but here it is for comment.  Before I post the whole script, here are a few goals:
1. XML data:
database("connection string").table("table").copyto.xml("C:\\spam\\eggs.xml")
2. Transforms:
transformDictionary = {'column1':'upper', 'column2':'some regex'} #these could become a fluent interface of their own.
database("connection string').table("table").withTransform(transformDictionary).copyto.csv("C:\\spam\\eggs.csv")

So here's that rather unrefined code download the latest implementation

Below was just some scratchwork from a Friday afternoon when I started sketching:

# BORRE DATABASE OBJECT
# 02/08/2008 - David Seruyange - @FRIDAY BURNOUT
import clr
clr.AddReference("System.Data")
from System import *
from System.Data import *
from System.Data.SqlClient import *
from System.IO import *

class registrar:
tableRegister = ""
databaseRegister = ""

class database:
"Model of a database somewhere"
def __init__(self, connectString):
"Initialize the rectangle"
self.connect = connectString
registrar.databaseRegister = connectString
self.copyto = copyto()

def table(self, t):
registrar.tableRegister = t
return self

class copyto:
"Model of a copy of some object"
def csv(self, path):
""" Save the result to a CSV file """
cn = SqlConnection(registrar.databaseRegister)
cn.Open()
cmd = SqlCommand("select * from " + registrar.tableRegister, cn)
r = cmd.ExecuteReader()
fileOut = open(path, 'w')
while r.Read():
sOut = ""
for i in range(r.FieldCount):
sOut += str(r[i]) + ","
sOut += "\n"
fileOut.write(sOut)
cn.Close()
fileOut.close()

def database(self, connect):
(ddl, colInfo) = buildDDL(registrar.tableRegister)
cnTarget = SqlConnection(connect)
cnTarget.Open()
cmdBuildTargetTable = SqlCommand(ddl, cnTarget)
cmdBuildTargetTable.ExecuteNonQuery() #builds the target table
cnSource = SqlConnection(registrar.databaseRegister)
cnSource.Open()
cmdSourceIter = SqlCommand("select * from " + registrar.tableRegister, cnSource)
sourceReader = cmdSourceIter.ExecuteReader()
insertCode = ""
while(sourceReader.Read()):
insertTemplate = "INSERT INTO " + registrar.tableRegister + " VALUES("
for i in range(sourceReader.FieldCount):
insertTemplate += "'" + str(sourceReader[i]) + "',"
insertCode += insertTemplate[:-1] + ")\n"
cmdExecuteInsert = SqlCommand(insertCode, cnTarget)
cmdExecuteInsert.ExecuteNonQuery()
# "save: %s\n table: %s\n to path: %s" % (registrar.databaseRegister, registrar.tableRegister, connect)

def buildDDL(tName):
query = """
select
c.[name], c.max_length, c.[precision], c.scale,
ty.[name] as typename
from
sys.columns c
join sys.tables t on c.object_id = t.object_id
join sys.types ty on c.system_type_id = ty.system_type_id
where
t.[name] = '__table__'
"""
query = query.replace('__table__', tName)
cn = SqlConnection(registrar.databaseRegister)
cn.Open()
cmd = SqlCommand(query, cn)
r = cmd.ExecuteReader()

ddl = """
CREATE TABLE __xyz__(
__columns__)
"""

columnsDefinition = ""
colTypeInfo = []
while r.Read():
#todo: support for other column types such as numeric or decimal
columnTemplate = " COLUMN_NAME DATATYPE LENGTH NULL,"
columnTemplate = columnTemplate.replace('COLUMN_NAME', str(r[0]))
columnTemplate = columnTemplate.replace('DATATYPE', str(r[4]))
colTypeInfo.append(str(r[4]))
if str(r[4]) == "varchar":
columnTemplate = columnTemplate.replace('LENGTH', "(" + str(r[1]) + ")")
elif str(r[4]) == "char":
columnTemplate = columnTemplate.replace('LENGTH', "(" + str(r[1]) + ")")
else:
columnTemplate = columnTemplate.replace('LENGTH', "")
columnsDefinition += columnTemplate + "\n"

r.Close()
cn.Close()
ddl = ddl.replace('__xyz__', tName).replace('__columns__', columnsDefinition).replace(",\n)", "\n\t)")
return [ddl, colTypeInfo]


cn = "Data Source=.\\sqlexpress;Initial Catalog=MVCBaby;Integrated Security=SSPI;"
cn2 = "Data Source=.\\sqlexpress;Initial Catalog=Irvine;Integrated Security=SSPI;"
# jepp
#database(cn).table("Photos").copyto.csv("c:\\foo.csv")
#print "---"
database(cn).table("Photos").copyto.database(cn2)

}

3 comments:

Michael Foord said...

You could get a lot of mileage with judicious use of '__getattr__' for some of these methods...

AJ said...

Nice David! I thought the fluent stuff was pretty interesting too, I wrote about my experiments with it here.

David Seruyange said...

@fuzzyman - that must be the "method missing" type functionality in Python? I'll have to dig into that for sure.

@aj - that is cool stuff, i wonder if the java compiler is smarter when it comes to chaining methods over the C# one?