java 列表通过boxplot箱线图数据清洗 过滤最大值 最小值

图片.png

如图 某天天气的变化曲线 某个时间 突然值很高影响统计 这样的异常数据如何通过Java过滤呢

package org.jeecg.modules.data.analysis;

import cn.hutool.core.util.ObjectUtil;
import com.alibaba.fastjson.JSON;
import lombok.extern.slf4j.Slf4j;
import org.apache.commons.math3.stat.descriptive.rank.Median;

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

/**
 * 列表通过boxplot箱线图数据清洗
 * https://blog.csdn.net/zhongkaigood/article/details/113887879
 *
 * @author: Haiming Yu
 * @createDate:2022/12/22
 * @description:
 */
@Slf4j
public class BoxPlotFilter {

    /**
     *箱形图(Box-plot)又称为盒须图、盒式图或箱线图 样本大于等于4个
     * @param data
     * @param multiplierMax 最大值过滤区间 可以单独设置更大 方便过滤最高值
     * @param multiplierMin 最小值过滤范围
     * @return
     */
    public static List<Double> boxPlotFilterListDouble(List<Double> data, Double multiplierMax, Double multiplierMin) {
        List<Double> returnList = new ArrayList<>(data);
        try {
            //排序
            log.debug("排序前data:" + data);
            Collections.sort(data);
            log.debug("排序后data:" + data);

            double[] dataA = new double[data.size()];
            for (int i = 0; i < data.size(); i++) {
                dataA[i] = data.get(i);
            }

            //下四分位数
            double q1 = dataA[dataA.length / 4 - 1] / 4 + dataA[dataA.length / 4] * 3 / 4;
            //中位数
            Median median = new Median();
            double q2 = median.evaluate(dataA);
            //上四分位数
            double q3 = dataA[(dataA.length - dataA.length / 4) - 1] * 3 / 4 + dataA[(dataA.length - dataA.length / 4)] / 4;
            //计算四分位距IQR
            double iqr = q3 - q1;
            // 默认乘 1.5 剔除过多正常值后 改成1.7
            if (ObjectUtil.isEmpty(multiplierMax)) {
                multiplierMax = 1.7;
            }
            // 默认乘 1.5
            if (ObjectUtil.isEmpty(multiplierMin)) {
                multiplierMin = 1.5;
            }
            double max = q3 + multiplierMax * iqr;
            double min = q1 - multiplierMin * iqr;
            log.debug("\n下四分位:" + q1 + " 中位数:" + q2 + " 上四分位:" + q3 + " \n最大值:" + max + " 最小值:" + min);
            List<Double> errorData = new ArrayList<>();
            for (Double vo : data) {
                if (vo.compareTo(min) < 0 || vo.compareTo(max) > 0) {
                    double zero = 0.00;
                    //忽略零比较多的情况
                    boolean compTo1 = min.compareTo(max) == 0 && min.compareTo(zero) == 0;
                    boolean compTo2 = q1.compareTo(q2) == 0 && q2.compareTo(zero) == 0;
                    if (!(compTo1 || compTo2)) {
                        returnList.set(returnList.indexOf(vo), null);
                        errorData.add(vo);
                    }

                }

            }
            if (errorData.size() > 0) {
                log.info("使用boxplot准则进行过滤,该数组中的" + JSON.toJSONString(errorData) + "属于异常值! \n"+ data );
            }
        } catch (Exception e) {
            log.error("error data:" + data);
            e.printStackTrace();
        }
        return returnList;
    }

    public static void main(String[] args) {
        //原始数组
//        double[] data =new double[] {1,2,8,10,8,5,2,4,6,11,15,1,2,8,10,8,5,2,4,6,11,15,1000,1000};
//        double[] data =new double[] {1,2,8,10,8,5,2,4,6,11,15,1,2,8,10,8,5,2,4,6,11,15,10,10};
        double[] data = new double[]{0, 0, 0, 10, 0};
        List<Double> dataList = new ArrayList<>();
        for (double vo : data) {
            dataList.add(vo);
        }
        List<Double> outliersList = boxPlotFilterListDouble(dataList, null, null);
        log.debug(JSON.toJSONString(outliersList));
    }
}

Connected to the target VM, address: '127.0.0.1:58441', transport: 'socket'
11:14:03.359 [main] DEBUG org.jeecg.modules.data.analysis.BoxPlotFilter - 排序前data:[]
11:14:03.418 [main] DEBUG org.jeecg.modules.data.analysis.BoxPlotFilter - 
下四分位:4.5 中位数:7.7 上四分位:11.2 
最大值:38.0 最小值:-122.79999999999998
11:14:03.556 [main] INFO org.jeecg.modules.data.analysis.BoxPlotFilter - 使用boxplot准则进行过滤,该数组中的[38.8]属于异常值! 

11:14:03.872 [main] INFO org.jeecg.modules.data.analysis.WeatherBaseAnalysis - weather:{"clean":"F","cloud":"2","dataTime":"2022-11-17 18:49:02","dataType":"LGH","elavation":0.0,"gustSpeed":0.0,"huminity":15.0,"light":105.0,"pressure":90442.54,"rainFall":0.0,"remark":"{\"clean\":\"F\",\"cloud\":\"2\",\"dataTime\":\"2022-11-17 18:49:02\",\"dataType\":\"LGH\",\"elavation\":0.0,\"gustSpeed\":0.0,\"huminity\":15.0,\"light\":105.0,\"pressure\":90442.54,\"rainFall\":0.0,\"syncStatus\":1,\"temperature\":38.8,\"uv\":0.0,\"uvi\":0,\"windSpeed\":0.42}","syncStatus":3,"uv":0.0,"uvi":0,"windSpeed":0.42}
Disconnected from the target VM, address: '127.0.0.1:58441', transport: 'socket'

用法 成功过滤了38度以上数据 可以自行设置 multiplierMax 控制最大值范围 multiplierMin 最小值范围

            List<Double> listTemperature = dataWeatherList.stream().map(DataWeather::getTemperature).collect(Collectors.toList());
            List<Double> reTemperature = null;
            if (temperature) {
                reTemperature = BoxPlotFilter.boxPlotFilterListDouble(listTemperature, 4.0,19.0);
            }

maven依赖

        <!-- math3 -->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-math3</artifactId>
            <version>3.6.1</version>
        </dependency>

参考
https://blog.csdn.net/zhongkaigood/article/details/113887879
并优化的方法可以控制最大值 最小值范围

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容