PostgreSQL下查找与去除重复数据

October 13, 2016

alitrack PostgreSQL ETL, 数据去重

PostgreSQL下查找与去除重复数据

有时候难免会导入重复的数据（一行所有的列都重复），问题是如何找出来呢？

假设有表 tbl，结构如下，

CREATE TABLE tbl (
 id serial,
 href text NULL,
 title text NULL,
 image text NULL
)
WITH (
 OIDS=FALSE
);

如何判断是否存在id, href，title和image都重复的数据呢(这种情况一般发生在导入重复）？

select * from (
  SELECT id,
  ROW_NUMBER() OVER(PARTITION BY id, href, title, image) AS Row
  FROM tbl
) dups
where 
dups.Row > 1

如何判断是否存在重复href，title和image，但id不同的数据呢（这种一般是OLTP的新增重复）？

select * from (
  SELECT id,
  ROW_NUMBER() OVER(PARTITION BY  href, title, image ORDER BY id asc) AS Row
  FROM tbl
) dups
where 
dups.Row > 1

接下来就是这样的数据如何只保留最早插入的数据，而删除后面重复插入的数据？

DELETE FROM tbl
WHERE id IN (SELECT id
              FROM (SELECT id,
                             ROW_NUMBER() OVER (partition BY href, title, image ORDER BY id) AS rnum
                     FROM tbl) t
              WHERE t.rnum > 1);

reference,

https://wiki.postgresql.org/wiki/Deleting_duplicates

PostgreSQL下查找与去除重复数据

PostgreSQL下查找与去除重复数据

Written by:

alitrack

Leave a Reply
Cancel reply

PostgreSQL下查找与去除重复数据

Written by:

alitrack

Leave a Reply Cancel reply

Leave a Reply
Cancel reply