mysql删除重复数据，保留一条

场景

mysql数据表中有多条重复数据记录，现在想删除删除部分重复数据，保留最后一条更新或者插入的数据。

以学生表为例，我们创建一个简单的数据表来做实验：

create table student(
    id int ,
    user_name varchar(20),
    insert_time datetime
);

往表里面插入一些实验数据：

INSERT INTO `student`(`id`, `user_name`, `insert_time`) VALUES (1, '张三', '2021-04-08 11:57:46');
INSERT INTO `student`(`id`, `user_name`, `insert_time`) VALUES (1, '张三', '2021-04-08 12:57:46');
INSERT INTO `student`(`id`, `user_name`, `insert_time`) VALUES (1, '张三', '2021-04-08 13:57:46');
INSERT INTO `student`(`id`, `user_name`, `insert_time`) VALUES (1, '张三', '2021-04-08 14:57:46');
INSERT INTO `student`(`id`, `user_name`, `insert_time`) VALUES (1, '张三', '2021-04-08 15:57:46');
INSERT INTO `student`(`id`, `user_name`, `insert_time`) VALUES (2, '李四', '2021-04-08 12:57:46');
INSERT INTO `student`(`id`, `user_name`, `insert_time`) VALUES (2, '李四', '2021-04-08 13:57:46');
INSERT INTO `student`(`id`, `user_name`, `insert_time`) VALUES (2, '李四', '2021-04-08 14:57:46');
INSERT INTO `student`(`id`, `user_name`, `insert_time`) VALUES (2, '李四', '2021-04-08 15:57:46');
INSERT INTO `student`(`id`, `user_name`, `insert_time`) VALUES (3, '王五', '2021-04-08 11:57:46');
INSERT INTO `student`(`id`, `user_name`, `insert_time`) VALUES (3, '王五', '2021-04-08 15:57:46');
INSERT INTO `student`(`id`, `user_name`, `insert_time`) VALUES (4, '赵六', '2021-04-08 15:57:46');

image.png

解决思路一

我们可以根据分组查询先将重复数据查询出来，同时也可以获取到最后的更新时间，然后再与原表联表查询小于最大时间的数据，将查询出来的数据删除。
------先来慢慢消化-------

1、分组统计重复数据

在做删除前，我们可以先看看有哪些数据是有重复的：

SELECT id,user_name,COUNT(1) FROM `student` GROUP BY id,user_name HAVING(COUNT(1)>1);

image.png

可以看到张三，李四，王五的数据是有重复的，赵六没有重复，下面我们查找最后更新的记录。

2、查询最后更新的重复记录

SELECT id,user_name,MAX(insert_time) as max_time FROM `student` 
    GROUP BY id,user_name HAVING(COUNT(1)>1);

image.png

我们已经获取到最后更新的重复记录了，那么只要把除去这一部分的数据之外的记录删除即可，通过与原表关联查询，取出old.insert_time<new.max_time的数据进行删除即可。

3、先查询出待删除数据

SELECT a.* FROM student a ,(
    SELECT id,user_name,MAX(insert_time) as max_time FROM `student` 
        GROUP BY id,user_name HAVING(COUNT(1)>1)
) b WHERE a.id=b.id AND a.user_name=b.user_name AND a.insert_time<b.max_time;

image.png

可以看到，最后更新的数据为15:57:46的记录没有在结果中。

4、执行删除

DELETE a.* FROM student a ,(
    SELECT id,user_name,MAX(insert_time) as max_time FROM `student` 
        GROUP BY id,user_name HAVING(COUNT(1)>1)
) b WHERE a.id=b.id AND a.user_name=b.user_name AND a.insert_time<b.max_time;

image.png

可以看到重复记录已经被清理掉。
假如有两行记录是完全一样的，这个方法就不可行了，往表里面在跑一次数据插入：

image.png

执行删除计划：

image.png

可以看到表中完全一样的记录还在，这种情况下，该方法显然不足以满足需求。

解决思路二

创建一个临时表存放最后插入的一条数据（包含重复与没有重复的），然后清空原表，再将临时表的数据复制到原表中，最后把临时表删除。
这个很好理解，相当于ctrl+c，ctrl+v的操作，数据表如下：

image.png

1、创建临时表

CREATE TABLE student_temp LIKE student;

2、查询原表中存放的最后一条数据

SELECT id,user_name,MAX(insert_time) as max_time FROM `student` GROUP BY id,user_name;

image.png

3、将最后一条数据插入临时表中

INSERT INTO student_temp 
    SELECT id,user_name,MAX(insert_time) as max_time FROM `student` GROUP BY id,user_name;

image.png

4、清空原表

truncate student;

image.png

5、将临时表的数据复制到原表中

INSERT INTO student SELECT * FROM student_temp;

image.png

6、将临时表删除

DROP TABLE student_temp CASCADE;

这样数据去重就完成了，需要注意的是，如果表数据量很大，注意在group by 里面的字段建立索引，同时，生产环境注意好先进行数据备份操作。