如图 某天天气的变化曲线 某个时间 突然值很高影响统计 这样的异常数据如何通过Java过滤呢
package org.jeecg.modules.data.analysis;
import cn.hutool.core.util.ObjectUtil;
import com.alibaba.fastjson.JSON;
import lombok.extern.slf4j.Slf4j;
import org.apache.commons.math3.stat.descriptive.rank.Median;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
/**
* 列表通过boxplot箱线图数据清洗
* https://blog.csdn.net/zhongkaigood/article/details/113887879
*
* @author: Haiming Yu
* @createDate:2022/12/22
* @description:
*/
@Slf4j
public class BoxPlotFilter {
/**
*箱形图(Box-plot)又称为盒须图、盒式图或箱线图 样本大于等于4个
* @param data
* @param multiplierMax 最大值过滤区间 可以单独设置更大 方便过滤最高值
* @param multiplierMin 最小值过滤范围
* @return
*/
public static List<Double> boxPlotFilterListDouble(List<Double> data, Double multiplierMax, Double multiplierMin) {
List<Double> returnList = new ArrayList<>(data);
try {
//排序
log.debug("排序前data:" + data);
Collections.sort(data);
log.debug("排序后data:" + data);
double[] dataA = new double[data.size()];
for (int i = 0; i < data.size(); i++) {
dataA[i] = data.get(i);
}
//下四分位数
double q1 = dataA[dataA.length / 4 - 1] / 4 + dataA[dataA.length / 4] * 3 / 4;
//中位数
Median median = new Median();
double q2 = median.evaluate(dataA);
//上四分位数
double q3 = dataA[(dataA.length - dataA.length / 4) - 1] * 3 / 4 + dataA[(dataA.length - dataA.length / 4)] / 4;
//计算四分位距IQR
double iqr = q3 - q1;
// 默认乘 1.5 剔除过多正常值后 改成1.7
if (ObjectUtil.isEmpty(multiplierMax)) {
multiplierMax = 1.7;
}
// 默认乘 1.5
if (ObjectUtil.isEmpty(multiplierMin)) {
multiplierMin = 1.5;
}
double max = q3 + multiplierMax * iqr;
double min = q1 - multiplierMin * iqr;
log.debug("\n下四分位:" + q1 + " 中位数:" + q2 + " 上四分位:" + q3 + " \n最大值:" + max + " 最小值:" + min);
List<Double> errorData = new ArrayList<>();
for (Double vo : data) {
if (vo.compareTo(min) < 0 || vo.compareTo(max) > 0) {
double zero = 0.00;
//忽略零比较多的情况
boolean compTo1 = min.compareTo(max) == 0 && min.compareTo(zero) == 0;
boolean compTo2 = q1.compareTo(q2) == 0 && q2.compareTo(zero) == 0;
if (!(compTo1 || compTo2)) {
returnList.set(returnList.indexOf(vo), null);
errorData.add(vo);
}
}
}
if (errorData.size() > 0) {
log.info("使用boxplot准则进行过滤,该数组中的" + JSON.toJSONString(errorData) + "属于异常值! \n"+ data );
}
} catch (Exception e) {
log.error("error data:" + data);
e.printStackTrace();
}
return returnList;
}
public static void main(String[] args) {
//原始数组
// double[] data =new double[] {1,2,8,10,8,5,2,4,6,11,15,1,2,8,10,8,5,2,4,6,11,15,1000,1000};
// double[] data =new double[] {1,2,8,10,8,5,2,4,6,11,15,1,2,8,10,8,5,2,4,6,11,15,10,10};
double[] data = new double[]{0, 0, 0, 10, 0};
List<Double> dataList = new ArrayList<>();
for (double vo : data) {
dataList.add(vo);
}
List<Double> outliersList = boxPlotFilterListDouble(dataList, null, null);
log.debug(JSON.toJSONString(outliersList));
}
}
Connected to the target VM, address: '127.0.0.1:58441', transport: 'socket'
11:14:03.359 [main] DEBUG org.jeecg.modules.data.analysis.BoxPlotFilter - 排序前data:[]
11:14:03.418 [main] DEBUG org.jeecg.modules.data.analysis.BoxPlotFilter -
下四分位:4.5 中位数:7.7 上四分位:11.2
最大值:38.0 最小值:-122.79999999999998
11:14:03.556 [main] INFO org.jeecg.modules.data.analysis.BoxPlotFilter - 使用boxplot准则进行过滤,该数组中的[38.8]属于异常值!
11:14:03.872 [main] INFO org.jeecg.modules.data.analysis.WeatherBaseAnalysis - weather:{"clean":"F","cloud":"2","dataTime":"2022-11-17 18:49:02","dataType":"LGH","elavation":0.0,"gustSpeed":0.0,"huminity":15.0,"light":105.0,"pressure":90442.54,"rainFall":0.0,"remark":"{\"clean\":\"F\",\"cloud\":\"2\",\"dataTime\":\"2022-11-17 18:49:02\",\"dataType\":\"LGH\",\"elavation\":0.0,\"gustSpeed\":0.0,\"huminity\":15.0,\"light\":105.0,\"pressure\":90442.54,\"rainFall\":0.0,\"syncStatus\":1,\"temperature\":38.8,\"uv\":0.0,\"uvi\":0,\"windSpeed\":0.42}","syncStatus":3,"uv":0.0,"uvi":0,"windSpeed":0.42}
Disconnected from the target VM, address: '127.0.0.1:58441', transport: 'socket'
用法 成功过滤了38度以上数据 可以自行设置 multiplierMax 控制最大值范围 multiplierMin 最小值范围
List<Double> listTemperature = dataWeatherList.stream().map(DataWeather::getTemperature).collect(Collectors.toList());
List<Double> reTemperature = null;
if (temperature) {
reTemperature = BoxPlotFilter.boxPlotFilterListDouble(listTemperature, 4.0,19.0);
}
maven依赖
<!-- math3 -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-math3</artifactId>
<version>3.6.1</version>
</dependency>
参考
https://blog.csdn.net/zhongkaigood/article/details/113887879
并优化的方法可以控制最大值 最小值范围