Sunday, July 29, 2007

Remove Duplicate Lines In Python

{

I had posted about the set operator in Python with some questions. All that changed today when I wrote a little script to remove duplicate lines from a file. The set operator takes a list and automatically gets rid of duplicate items. Very useful for situations like this:


#!/usr/bin/env python

f = open("c:\\temp\\Original.txt")
f2 = open("c:\\temp\\Unique.txt", "w")
uniquelines = set(f.read().split("\n"))
f2.write("".join([line + "\n" for line in uniquelines]))
f2.close()


}

4 comments:

Unknown said...

Superb!

I had a database in filemaker with 250000 records and had written a script to delete duplicates. I was looking at around 24 hours to run that in Filemaker.

Exported it as a CSV and ran your code in Python. Took less than 3 seconds and spat out a CSV that I imported back to filemaker.

Thank you!

G.T. Rajpurohit said...

it is great.
but it alter the sequence of file in it

Covert Assassin said...

Hi.. I just used this one to eliminate duplicates in my file. I would like to know more about this function set. I'm seriously left wondering how did such few LOC do that perfectly?! Any thoughts?

Ramen said...

But the problem is the set operator automatically sort after removing the duplicate...What if we don't want to change the order...