PostgreSQL下查找与去除重复数据

有时候难免会导入重复的数据(一行所有的列都重复),问题是如何找出来呢?

假设有表 tbl,结构如下,

CREATE TABLE tbl (
 id serial,
 href text NULL,
 title text NULL,
 image text NULL
)
WITH (
 OIDS=FALSE
);

如何判断是否存在id, href,title和image都重复的数据呢(这种情况一般发生在导入重复)?

select * from (
  SELECT id,
  ROW_NUMBER() OVER(PARTITION BY id, href, title, image) AS Row
  FROM tbl
) dups
where 
dups.Row > 1

如何判断是否存在重复href,title和image, 但id不同的数据呢(这种一般是OLTP的新增重复)?

select * from (
  SELECT id,
  ROW_NUMBER() OVER(PARTITION BY  href, title, image ORDER BY id asc) AS Row
  FROM tbl
) dups
where 
dups.Row > 1

接下来就是这样的数据如何只保留最早插入的数据,而删除后面重复插入的数据?

DELETE FROM tbl
WHERE id IN (SELECT id
              FROM (SELECT id,
                             ROW_NUMBER() OVER (partition BY href, title, image ORDER BY id) AS rnum
                     FROM tbl) t
              WHERE t.rnum > 1);

 

 

reference,

https://wiki.postgresql.org/wiki/Deleting_duplicates