PostgreSQL下查找与去除重复数据
有时候难免会导入重复的数据(一行所有的列都重复),问题是如何找出来呢?
假设有表 tbl,结构如下,
CREATE TABLE tbl ( id serial, href text NULL, title text NULL, image text NULL ) WITH ( OIDS=FALSE );
如何判断是否存在id, href,title和image都重复的数据呢(这种情况一般发生在导入重复)?
select * from ( SELECT id, ROW_NUMBER() OVER(PARTITION BY id, href, title, image) AS Row FROM tbl ) dups where dups.Row > 1
如何判断是否存在重复href,title和image, 但id不同的数据呢(这种一般是OLTP的新增重复)?
select * from ( SELECT id, ROW_NUMBER() OVER(PARTITION BY href, title, image ORDER BY id asc) AS Row FROM tbl ) dups where dups.Row > 1
接下来就是这样的数据如何只保留最早插入的数据,而删除后面重复插入的数据?
DELETE FROM tbl WHERE id IN (SELECT id FROM (SELECT id, ROW_NUMBER() OVER (partition BY href, title, image ORDER BY id) AS rnum FROM tbl) t WHERE t.rnum > 1);
reference,
https://wiki.postgresql.org/wiki/Deleting_duplicates