Java 通过Rserve调用R markdown 动态文档 定稿

  本篇是GPU Linux虚拟主机服务器安装测试实验的结篇了,将演示在Java应用程序中通过Rserve调用R markdown动态文档,生成前面墨尔本房价分析Shiny APP中的PDF分析报告,生成报告的Rmd脚本直接用同一个,只是数据准备部分的程序与Shiny APP的反应式结构稍有不同,是传统程序的顺序结构,更易阅读。Java Web APP中的结构与前文微博词云实例基本一样,数据预处理部分直接用前面文章的处理结果,所以这个例子一天就搞定了。虽然简单,但十分典型,展示了Java平台通过Rserve集成R语言的典型应用场景。

  对于发票货劳名称识别这样的应用,运行在重量级的深度学习预训练模型上,内存消耗比较大,加载耗时,所以不可能为每个浏览器客户端都在服务器端加载一次整个底层的模型,首先是并发上支撑不了,尤其是运行在GPU上时;其次是等待时间长,用户体验不好。前面的发票货劳名称识别Shiny APP主要是演示性质,实际应用时不需要显示语法树等,只需返回识别的文字结果。解决并发与响应速度的方案是系统启动时先初始化一定数量的Rserve连接,在后端预先加载几个模型实例,建立连接池,Java客户端通过队列从连接池申请连接,把识别请求压入一个队列,然后等待返回结果。形成客户数n>>队列数m=模型运行实例数k的缓冲队列结构,从而解决并发与响应速度的问题。

  所以Rserve集成的方式是有其适用场景的。轻量级的应用可以直接用IFRAME中嵌入Shiny APP的方式完成,那样用户的交互体验更好;不需要用户即时交互的或者发票货劳名称识别这样重量级的应用,生产环境中就适合用Rserve去集成。条条大路通罗马,有备选的路网更稳定可靠,更适合生产环境的应用。

  Rserve调用集成在服务器端执行,客户端不限于浏览器;Shiny APP IFRAME嵌入集成在浏览器端执行,由JavaScript窗口间通讯连接,这是它们的区别。

  R markdown动态文档的优点是集成提供了R语言整个生态系统的大数据分析能力及作图能力、连接能力等,非常强大。下面具体看看。

一、运行效果

1、输入参数。异常值阀值是算法模型预测值与实际值偏差的百分比,偏差超过阀值的判定为异常值。这些算法模型验证集上准确率最好的已经超过了90%,非常高了。

选择房价回归算法及异常值百分比阀值

2、调用R语言函数及R markdown脚本、Python模型生成PDF分析报告。

  设置这个中间结果页面是为了更好的演示调用的过程,实际上可以直接返回PDF报告到浏览器中打开。耗时比较长是因为调用Python模型时一次性计算了所有的模型,并生成各模型的性能比较数据。

返回分析报告的下载链接及调用耗时统计

3、打开分析报告。

  有R语言整个生态近2万个软件包可用,报告怎么写就海阔天空,自由发挥了。

分析报告图文并茂,根据传入参数动态生成报告内容。

二、R语言部分

1、R语言函数,melbourneReport.R。

  这个函数提供给Java客户端等调用,它调用Python模型去完成算法模型的计算,因为当时是用Python完成房价回归模型的研究。

# 该包提供了到Python语言的连接
library(reticulate)

# 确定各回归模型Python脚本的目录位置
path<- "/home/ubuntu/scripts/Melbourne_Regress.py"
# 装入脚本,用经过预处理的数据集,在装入时一次性加载所有回归模型并完成训练集与验证集的拟合。
# 模型超参数调优另外用Python脚本完成。
# 这些脚本都可以用R语言完成,不过原来已经用Python写好了,通过reticulate包调用也可以,就不改了。
t3<- proc.time()
source_python(path)
t4<- proc.time()
# 总耗时、系统耗时、用户耗时,...,...
# An object of class "proc_time" which is a numeric vector of length 5, 
# containing the user, system, and total elapsed times for the currently running R process,
# and the cumulative sum of user and system times of any child processes
# spawned by it on which it has waited.
seconds<-t4-t3
# 各模型训练与验证的结果,封装为data frame
train<- PredictTrain()
valid<- PredictValid()
# 各模型的性能数据
perf<-  performance()
# 回归算法列表
algos<-names(train)
# 这个是显示在下拉列表中供选择的算法名称,作为命名列表的名字,它的内容是dataframe的列名。
# 从第2项开始。
names(algos)<-c("Origin","SVM","RandomForest","GBR","XGB","LigthGBM","CatBoost","Blend")

# 函数两个参数的默认值
algo<-"CatBoost"
threshold<-30

# 对外提供调用的函数。
pdfReport<-function(algo, threshold){
  # 训练集拟合数据
  colName<- algos[algo]
  data<- data.frame(origin = train["origin"], predict = train[colName])
  names(data)<-c("origin","predict")
  data<-data[order(data["origin"]),]
  data["index"]<-1:nrow(data)
  data["upper"]<-data["origin"]+log(1+threshold/100)
  data["lower"]<-data["origin"]+log(1-threshold/100)
  trainSelected<-data
  
  # 验证集拟合数据
  colName<- algos[algo]
  data<- data.frame(origin = valid["origin"], predict = valid[colName])
  names(data)<-c("origin","predict")
  data<-data[order(data["origin"]),]
  data["index"]<-1:nrow(data)
  data["upper"]<-data["origin"]+log(1+threshold/100)
  data["lower"]<-data["origin"]+log(1-threshold/100)
  validSelected<-data
  
  # 异常值数据。
  colName<- algos[algo]
  data<- data.frame(origin = valid["origin"], predict = valid[colName])
  names(data)<-c("origin","predict")
  data["outlier"]<- FALSE
  data[which(data["predict"] < data["origin"]+log(1-threshold/100)),"outlier"] <-TRUE
  data[which(data["predict"] > data["origin"]+log(1+threshold/100)),"outlier"] <-TRUE
  valid_X3<- valid_X2[,c("Lattitude","Longtitude","Distance","BuildingArea","Landsize","YearBuilt","Year")]
  valid_X3["Type"]<-valid_X2["Type_h"]*4+valid_X2["Type_t"]*2+valid_X2["Type_u"]
  valid_X3[which(valid_X3["Type"]==4),"Type"]<-"H"
  valid_X3[which(valid_X3["Type"]==2),"Type"]<-"T"
  valid_X3[which(valid_X3["Type"]==1),"Type"]<-"U"
  # 转换为原量纲
  valid_X3["Origin"]<-round(exp(data["origin"]),2)
  valid_X3["Predict"]<-round(exp(data["predict"]),2)
  outliers<-valid_X3[which(data["outlier"]==TRUE),]
  outliers["SE"]<-round((outliers["Predict"] - outliers["Origin"])/outliers["Origin"],3)
  #outliers["ID"]<- row.names(outliers)
  outliers<-outliers[order(outliers["SE"]),]
  
  # 生成参数列表,传入报告所需的参数。
  params <- list(
    algo = algo,
    threshold = threshold,
    rows = "",
    perf = perf,
    train = trainSelected,
    valid = validSelected,
    outliers = outliers
  )
  
 #写入到磁盘临时文件
 fn<<-paste(getwd(),"/melbourneReport",as.character(as.numeric(Sys.time())),".pdf",sep="")
# 传入参数,调用rmarkdown渲染生成PDF分析报告
  rmarkdown::render("/home/ubuntu/scripts/report.Rmd", 
                    output_file = fn,
                    params = params,
                    envir = new.env(parent = globalenv())
  )

  #返回报告文件路径供下载
  results <<- list(file = fn)
  return(results)
  
}              

# 测试函数调用
# pdfReport(algo, threshold)

2、R markdown脚本,report.Rmd。

  这里要用markdown语法来展示markdown脚本,所以在代码块标记符号```中插入了一个反斜杠,以免显示混乱,读者自己运行时要注意把反斜杠删掉。还有行内R语言代码块标记改成了R,否则R markdown渲染会出错,运行时要改回r。数据则由前面的函数传入。参阅资料

---
title: "墨尔本房价分析报告"
author: "Jean"
date: "`r Sys.Date()`"
header-includes:
  - \usepackage{ctex}
output: 
  pdf_document:
    latex_engine: xelatex
params:
  algo: NA
  threshold: NA
  rows: NA
  perf: NA
  train: NA
  valid: NA
  outliers: NA
---

# 各回归算法性能数据:

`\``{r, echo=FALSE, message=FALSE, warning=FALSE}
perf<-params$perf                # 传入参数被锁定不能改变,赋值其它变量。
perf[,-1]<- round(perf[,-1],3)   # 除第一列算法名称外,舍入至小数点后3位。
names(perf)<-c("算法","训练集","验证集","耗时(秒)")
knitr::kable(perf)
`\``

# 算法:`R params$algo` ,异常值阀值:`R params$threshold`%。

## 训练集拟合效果

`\``{r fig.showtext = TRUE, echo=FALSE, message=FALSE, warning=FALSE}
library(ggplot2)
library(showtext)
# Rmarkdown安装配置、输出PDF正文中文以及图片中文配置
# https://blog.csdn.net/weixin_46128755/article/details/125825935
# 设置showtext_auto(),打开ggplot2中文支持。 
showtext_auto() 

data<-params$train
ggplot(data = data)+
      geom_point(mapping = aes(x = index, y = predict, color = "blue"), size=0.1)+
      geom_line(mapping = aes(x = index, y = origin, color = "red"))+
      geom_ribbon(aes(index, ymax=upper, ymin=lower, fill="yellow"), alpha=0.25)+
      ggtitle("训练集拟合效果图")+
      xlab("样本")+
      ylab("ln(房价)")+
      scale_fill_identity(name = '异常值区间', guide = 'legend',labels = c('正常值')) +
      scale_colour_manual(name = '颜色', 
            values =c('blue'='blue','red'='red'), labels = c('预测值','实际值'))
`\``

## 验证集拟合效果

`\``{r fig.showtext = TRUE, echo=FALSE, message=FALSE, warning=FALSE}
data<-params$valid
ggplot(data = data)+
      geom_point(mapping = aes(x = index, y = predict, color = "blue"), size=0.1)+
      geom_line(mapping = aes(x = index, y = origin, color = "red"))+
      geom_ribbon(aes(index, ymax=upper, ymin=lower, fill="yellow"), alpha=0.25)+
      ggtitle("验证集拟合效果图")+
      xlab("样本")+
      ylab("ln(房价)")+
      scale_fill_identity(name = '异常值区间', guide = 'legend',labels = c('正常值')) +
      scale_colour_manual(name = '颜色', 
                          values =c('blue'='blue','red'='red'), labels = c('预测值','实际值'))

`\``

## 异常值列表

`\``{r, echo=FALSE, message=FALSE, warning=FALSE}
outliers<-head(params$outliers,10) # 演示性质,只打印前10行。
names(outliers)<-c("纬度","经度","距离","建筑面积","占地面积","建成","交易","类型",
                   "价格","预测","偏差")
knitr::kable(outliers)
`\``

## 选中的异常值行号:`R params$rows`。

三、Python语言部分

  Melbourne_Regress.py,数据、预处理及参数调优请参阅我的文章《房价预测模型:集成回归与深度学习》《房价预测回归模型之超参数调整》

  这些算法模型最终封装为PredictTrain()、PredictValid()、performance()三个函数供melbourneReport.R调用。

# Imports
# Ignore Warnings 
import warnings
warnings.filterwarnings('ignore')

# Basic Imports 
import numpy as np
import pandas as pd
import time

# Preprocessing
from sklearn.model_selection import train_test_split, KFold, cross_val_score

# Metrics 
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# ML Models
from lightgbm import LGBMRegressor 
from xgboost import XGBRegressor 
from catboost import CatBoostRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn import svm
from mlxtend.regressor import StackingCVRegressor

# Reading a CSV File
# 9015
df_NN = pd.read_csv("/home/ubuntu/data/Melbourne_housing_pre.csv",  encoding="utf-8")

X=df_NN[['Year','YearBuilt','Distance','Lattitude','Longtitude','Propertycount',
          'Landsize','BuildingArea', 'Rooms','Bathroom', 'Car','Type_h','Type_t','Type_u']]
y=df_NN['LogPrice']
train_X, valid_X, train_y, valid_y = train_test_split(X,y, test_size = .20, random_state=42)

train_X2 = train_X.copy()
valid_X2 = valid_X.copy()

# Data standardization
mean = train_X.mean(axis=0)
train_X -= mean
std = train_X.std(axis=0)
train_X /= std
valid_X -= mean
valid_X /= std

##% evaluateRegressor
# from sklearn.metrics import mean_squared_error, mean_absolute_error
def evaluateRegressor(true,predicted,message = "Test set"):
    R2 = r2_score(true,predicted)
    print(message)
    print("R2 :", R2)
    

# 以下是调参后优化的性能,CatBoost在本数据集中的性能最优。
# CatBoost,LightGBM,GBR, XGB, Stack, Blend这些模型测试集中预测的准确率都超过了90%。

# 以下各字典用于记录各回归模型的性能与结果,以便在R语言中引用
train_acc = {}
valid_acc = {}
predict_train = {}
predict_valid = {}
timetake = {}

# SVM best---------------------------------------------------------------------------------------
t1 = time.time()
params =  {'C': 6.673350889023755, 'gamma': 0.05106238973376298}
svr_best = svm.SVR(**params) 
svr_model_best = svr_best.fit(train_X, train_y)
train_acc["SVM"] = svr_model_best.score(train_X,train_y)
valid_acc["SVM"] = svr_model_best.score(valid_X,valid_y)
predict_train["SVM"] = svr_model_best.predict(train_X)
predict_valid["SVM"] = svr_model_best.predict(valid_X)
t2 = time.time()
timetake["SVM"] = t2-t1
print("SVM: ",t2-t1, "\n")

# Random forest best ----------------------------------------------------------------------------
t1 = time.time()
params =  {'max_depth': 26, 'max_features': 4, 'max_leaf_nodes': 1781, 'max_samples': 0.9703002612349153, 'min_samples_leaf': 0, 'min_samples_split': 0, 'n_estimators': 381}
params['max_depth'] = params['max_depth']+3
params['min_samples_split'] = params['min_samples_split']+2
params['max_leaf_nodes'] = params['max_leaf_nodes']+2
params['min_samples_leaf'] = params['min_samples_leaf']+1
params['n_estimators'] = params['n_estimators']+50
params['max_features'] = params['max_features']+3
rf_best = RandomForestRegressor( random_state = 0,verbose=0, **params)
rf_model_best = rf_best.fit(train_X, train_y)
train_acc["RF"] = rf_model_best.score(train_X,train_y)
valid_acc["RF"] = rf_model_best.score(valid_X,valid_y)
predict_train["RF"] = rf_model_best.predict(train_X)
predict_valid["RF"] = rf_model_best.predict(valid_X)
t2 = time.time()
timetake["RF"] = t2-t1
print("Random forest: ",t2-t1, "\n")

# GBR best ------------------------------------------------------------------------------------
t1 = time.time()
params = {'alpha': 0.9014933457984278, 'learning_rate': 0.035668343067947715, 'max_depth': 6, 'max_features': 5, 'max_leaf_nodes': 943, 'min_impurity_decrease': 0.015183480929538314, 'min_samples_leaf': 2, 'min_samples_split': 30, 'n_estimators': 440, 'subsample': 0.6109167820106534}
params['n_estimators'] = params['n_estimators']+50
params['min_samples_split'] = params['min_samples_split']+2
params['min_samples_leaf'] = params['min_samples_leaf']+1
params['max_depth'] = params['max_depth']+3
params['max_features'] = params['max_features']+3
params['max_leaf_nodes'] = params['max_leaf_nodes']+2
gbr_best = GradientBoostingRegressor(random_state=0,verbose=0, **params)
gbr_model_best = gbr_best.fit(train_X, train_y)
train_acc["GBR"] = gbr_model_best.score(train_X,train_y)
valid_acc["GBR"] = gbr_model_best.score(valid_X,valid_y)
predict_train["GBR"] = gbr_model_best.predict(train_X)
predict_valid["GBR"] = gbr_model_best.predict(valid_X)
t2 = time.time()
timetake["GBR"] = t2-t1
print("GBR: ",t2-t1, "\n")

# XGB best --------------------------------------------------------------------------------------
t1 = time.time()
params =  {'colsample_bytree': 0.8413894273173292, 'gamma': 0.008478702584417519, 'learning_rate': 0.05508679239402551, 'max_bin': 4, 'max_depth': 5, 'min_child_weight': 24.524635200338793, 'n_estimators': 578, 'reg_alpha': 0.809791155072757, 'reg_lambda': 1.4490119256389808, 'subsample': 0.8429852720715357}
params['max_bin'] = params['max_bin']+50
params['max_depth'] = params['max_depth']+3
params['n_estimators'] = params['n_estimators']+100
xgb_best = XGBRegressor(objective ='reg:squarederror', seed = 0,verbosity=0, **params)  # CPU 4.96s/trial
xgb_model_best = xgb_best.fit(train_X, train_y)
train_acc["XGB"] = xgb_model_best.score(train_X,train_y)
valid_acc["XGB"] = xgb_model_best.score(valid_X,valid_y)
predict_train["XGB"] = xgb_model_best.predict(train_X)
predict_valid["XGB"] = xgb_model_best.predict(valid_X)
t2 = time.time()
timetake["XGB"] = t2-t1
print("XGB: ",t2-t1, "\n")

# LigthGBM best -------------------------------------------------------------------------------
t1 = time.time()
params = {'colsample_bytree': 0.5142540541056978, 'learning_rate': 0.014284678929509775, 'max_bin': 161, 'max_depth': 4, 'min_child_samples': 5, 'min_child_weight': 4.534457967283932, 'min_split_gain': 0.0006363777341674458, 'n_estimators': 2006, 'num_leaves': 93, 'reg_alpha': 0.0037820689583625278, 'reg_lambda': 2.947360470949046, 'subsample': 0.9448608935296047, 'subsample_freq': 2}
params['max_bin'] = params['max_bin']+50
params['max_depth'] = params['max_depth']+3
params['num_leaves'] = params['num_leaves']+20
params['min_child_samples'] = params['min_child_samples']+10
params['subsample_freq'] = params['subsample_freq']+1
params['n_estimators'] = params['n_estimators']+1000
lgbm_best = LGBMRegressor(seed=0, **params)
lgb_model_best = lgbm_best.fit(train_X, train_y)
train_acc["LGBM"] = lgbm_best.score(train_X,train_y)
valid_acc["LGBM"] = lgbm_best.score(valid_X,valid_y)
predict_train["LGBM"] = lgbm_best.predict(train_X)
predict_valid["LGBM"] = lgbm_best.predict(valid_X)
t2 = time.time()
timetake["LGBM"] = t2-t1
print("LightGBM: ",t2-t1, "\n")

# CatBoost best --------------------------------------------------------------------------------------
t1 = time.time()
params = {'bagging_temperature': 0.5402870554069704, 'border_count': 183, 'depth': 5, 'fold_len_multiplier': 4.43906516804156, 'iterations': 899, 'l2_leaf_reg': 8.334167765336101, 'learning_rate': 0.0997818676941431, 'random_strength': 6.564979609549752, 'rsm': 0.8975065545697877, 'subsample': 0.857395221266925}
params['border_count'] = params['border_count']+150
params['depth'] = params['depth']+2
params['iterations'] = params['iterations']+500
cat_best = CatBoostRegressor(task_type='CPU',
                             random_seed=0,
                             leaf_estimation_iterations=1,
                             max_ctr_complexity=0,                             
                             verbose=False, **params)  # CPU 44.64s/trial
cat_model_best = cat_best.fit(train_X, train_y)
train_acc["Cat"] = cat_model_best.score(train_X,train_y)
valid_acc["Cat"] = cat_model_best.score(valid_X,valid_y)
predict_train["Cat"] = cat_model_best.predict(train_X)
predict_valid["Cat"] = cat_model_best.predict(valid_X)
t2 = time.time()
timetake["Cat"] = t2-t1
print("CatBoost: ",t2-t1, "\n")

# Blend models in order to make the final predictions more robust to overfitting
t1 = time.time()

def blended_predictions(X):
    return (
            (0.20 * cat_model_best.predict(X)) + \
            (0.20 * lgb_model_best.predict(X)) + \
            (0.20 * xgb_model_best.predict(X)) + \
            (0.20 * gbr_model_best.predict(X)) + \
            (0.20 * rf_model_best.predict(X))) 

train_acc["Blend"] = r2_score(train_y, blended_predictions(train_X))
valid_acc["Blend"] = r2_score(valid_y, blended_predictions(valid_X))
predict_train["Blend"] = blended_predictions(train_X)
predict_valid["Blend"] = blended_predictions(valid_X)
t2 = time.time()
timetake["Blend"] = t2-t1
print("Blend: ",t2-t1, "\n")

def PredictTrain():
    df = pd.DataFrame({"origin":train_y, "svm":predict_train["SVM"], "rf":predict_train["RF"],\
         "gbr":predict_train["GBR"], "xgb":predict_train["XGB"],"lgbm":predict_train["LGBM"],\
         "cat":predict_train["Cat"],"blend":predict_train["Blend"]})
    return(df)


def PredictValid():
    df = pd.DataFrame({"origin":valid_y, "svm":predict_valid["SVM"], "rf":predict_valid["RF"],\
         "gbr":predict_valid["GBR"], "xgb":predict_valid["XGB"],"lgbm":predict_valid["LGBM"],\
         "cat":predict_valid["Cat"],"blend":predict_valid["Blend"]})
    return(df)


def performance():
    df = pd.DataFrame({"algo":list(train_acc.keys()),"train":list(train_acc.values()), \
         "valid":list(valid_acc.values()), "time":list(timetake.values())})
    return(df)

#df1 = PredictTrain()
#df2 = PredictValid()
#df3 = performance()

四、Java语言部分

  这是Tomcat 7上运行的版本,如果要在Tomcat 10上运行,把ServLet中引用的javax.servlet改为jakarta.servlet即可。

1、algo.jsp输入算法参数。

<%@ page language="java" contentType="text/html; charset=GBK"
    pageEncoding="GBK"%>
<!DOCTYPE html>
<html>
<head>
<meta charset="GBK">
<title>墨尔本房价分析报告</title>
</head>
<body>
<center>
 <form action="/testR/MelbourneRserve" method="get">
 返回墨尔本房价分析报告<br/>
  算法: <select name="algo">
  <option>SVM</option>
  <option>RandomForest</option>
  <option>GBR</option>
  <option>XGB</option>
  <option>LigthGBM</option>
  <option selected>CatBoost</option>
  <option>Blend</option>
  </select>
  异常值阀值:<input type="number" name="threshold" value=30><br/>
 <input type="submit" value="生成分析报告">
 </form>
 </center>
</body>
</html>

2、MelbourneRserve ServLet调用R语言函数。

  rc.voidEval("source(fn)")加载melbourneReport.R时,会执行源码中非函数体的部分,加载数据,完成模型计算等。此处用rc.eval("algo<-'"+algo+"'")的方式先传入参数,再用x = (REXP)rc.eval("pdfReport(algo, threshold)")的方式调用R语言函数,是较为稳健的方式,当然也可以直接拼接带参数值的语句调用。

  Rengin包是Java调用Rserve的接口包,Rsession包提供了更易使用的封装。

package test;

import java.io.IOException;
import javax.servlet.ServletException;
import javax.servlet.annotation.WebServlet;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.servlet.http.HttpSession;

import org.math.R.Rsession;
import org.rosuda.REngine.REXP;
import org.rosuda.REngine.RList;

/**
 * Servlet implementation class MelbourneRserve
 */
@WebServlet("/MelbourneRserve")
public class MelbourneRserve extends HttpServlet {
    private static final long serialVersionUID = 1L;
       
    /**
     * @see HttpServlet#HttpServlet()
     */
    public MelbourneRserve() {
        super();
        // TODO Auto-generated constructor stub
    }

    /**
     * @see HttpServlet#doGet(HttpServletRequest request, HttpServletResponse response)
     */
    protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
        // TODO Auto-generated method stub
        // response.getWriter().append("Served at: ").append(request.getContextPath());
        
        callRServe(request, response);
        
    }

    /**
     * @see HttpServlet#doPost(HttpServletRequest request, HttpServletResponse response)
     */
    protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
        // TODO Auto-generated method stub
        doGet(request, response);
    }

    public void callRServe(HttpServletRequest request, HttpServletResponse response) {
        HttpSession session = request.getSession();
//      if (session.getAttribute("rs") == null)
            try {
                String algo=request.getParameter("algo");
                String threshold=request.getParameter("threshold");
                
                long t1 = System.currentTimeMillis();
                Rsession rc =RServeHelper.getRsessionInstance();
                String source = RServeHelper.prefix+"melbourneReport.R";            
                System.out.println(source);
                rc.set("fn",source);
                //rc.eval("source(fn)"装入源程序总是出错,用rc.voidEval("source(fn)")则可以
                rc.voidEval("source(fn)");
                rc.eval("algo<-'"+algo+"'");
                rc.eval("threshold<-"+threshold);
                REXP x ;
                RList excels=null;
                try{
                    x =  (REXP)rc.eval("pdfReport(algo, threshold)");
                    excels = x.asList();
                } catch(Exception e){
                    e.printStackTrace();
                }
                // parse the result returned                
                for (int i = 0; i < excels.size(); i++) {
                    String gf = excels.at(i).asString();
                    System.out.println(gf);
                }
                
                long t2 = System.currentTimeMillis();
                long t= (t2 - t1) / 1000 ;
                System.out.println("耗时:" +t+ "秒");

                RServeHelper.endRsession(rc);

                session.setAttribute("rs", excels);
                session.setAttribute("time", t);
                
            } catch (Exception e) {
                e.printStackTrace();
            }   
        try {
            System.out.println("显示结果");
            response.sendRedirect("./melbourne/report.jsp");
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
    
}

3、report.jsp显示中间结果。

<%@ page language="java" contentType="text/html; charset=GBK"
    pageEncoding="GBK"%>
<%@ page import="org.rosuda.REngine.RList,java.util.List"%>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=GBK">
<title>墨尔本房价分析报告</title>
</head>
<body>
<center>
墨尔本房价分析报告<br/>
<%

    if (session.getAttribute("rs")!=null){
        //RList rs=(RList)session.getAttribute("rs");
        RList rs=(RList)session.getAttribute("rs");
        for (int i = 0; i < rs.size(); i++) {
            String gf = rs.at(i).asString();
            
            System.out.println(gf);
        }
%>
        <a href="/testR/ServePDF?keep=true&filename=<%=rs.at("file").asString()%>">墨尔本房价分析报告</a><br/>   
        <br/>
        <center>耗时: <%=session.getAttribute("time") %> 秒</center>
<%} %>
</center>
</body>
</html>

4、ServePDF ServLet返回PDF分析报告。

  稍微修改一下,则可以根据临时文件的后缀名自动确定返回文件的MIME类型,用一个ServLet就可以处理所有文件类型的下载,MIME列表可以参阅该资料

package test;

import java.io.IOException;
import javax.servlet.ServletException;
import javax.servlet.annotation.WebServlet;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import java.io.IOException;
import java.io.OutputStream;

import org.math.R.Rsession;

/**
 * Servlet implementation class ServePDF
 */
@WebServlet("/ServePDF")
public class ServePDF extends HttpServlet {
    private static final long serialVersionUID = 1L;
       
    /**
     * @see HttpServlet#HttpServlet()
     */
    public ServePDF() {
        super();
        // TODO Auto-generated constructor stub
    }

    /**
     * @see HttpServlet#doGet(HttpServletRequest request, HttpServletResponse response)
     */
    protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
        // TODO Auto-generated method stub
        // response.getWriter().append("Served at: ").append(request.getContextPath());
        String  pdf= "application/pdf;charset=GBK";
        String gf = request.getParameter("filename");
        String keep = request.getParameter("keep");


        OutputStream out = response.getOutputStream();// 得到输出流
        response.setContentType(pdf);// 设定输出的类型

        callRServe(gf, out, keep);

        out.close();
        out.flush();        
    }

    public void callRServe(String fn, OutputStream out, String keep) throws IOException {
        Rsession rc =RServeHelper.getRsessionInstance();
        try {
            System.out.println(fn);
            rc.getFile(out, fn);
            if(keep==null||!keep.equalsIgnoreCase("true"))
                rc.eval("unlink(fn); r");
            RServeHelper.endRsession(rc);       
        } catch (Exception e) {
            e.printStackTrace();
            try{            
                RServeHelper.endRsession(rc);
            }catch(Exception ex){
                ex.printStackTrace();
            }
        }
    }

    /**
     * @see HttpServlet#doPost(HttpServletRequest request, HttpServletResponse response)
     */
    protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
        // TODO Auto-generated method stub
        doGet(request, response);
    }

}

5、RServeHelper.java。

  这个帮助类用于在开发测试时屏蔽Rserve在Windows与Linux上的差别,Rserve在Windows上只支持一个连接,一般不建议使用,具体请参阅Rserve网站主页文档说明的installation一节。

package test;

import java.io.IOException;
import java.util.Properties;

import org.math.R.RserveSession;
import org.math.R.RserverConf;
import org.math.R.Rsession;

public class RServeHelper {
    // private static boolean isWindows = false;
    private static Rsession rsession = null;
    private static String host = "124.223.110.20";
    //public static String host="127.0.0.1";
    public static String prefix = "../../../../home/jean/R/";
    //public static String prefix="C:/Users/lenovo/Documents/Rscripts/";
    public static String rpics = "/tmp/Rpics";
    //public static String rpics="C:/Users/lenovo/Documents/Rpics";

    /**
     * 利用Rsession初始化RServe
     * 
     * @return
     * @throws IOException
     */
    public static Rsession initRserve() throws IOException {
        // 从配置文件中读取Rserve信息,IP.用户名.密码
        // Properties prop = PropertieHelper.getPropInstance("ssh.properties");
        // String hostname = prop.getProperty("host");
        // String username = prop.getProperty("username");
        // String password = prop.getProperty("password");
        // RserverConf rconf=new RserverConf(host,6311,username,password,new
        // Properties());
        Properties prop = new Properties();
        prop.setProperty("tls", "true");
        RserverConf rconf = new RserverConf(host, 6311, "user", "password", prop);
        rsession = RserveSession.newInstanceTry(System.out, rconf);
        return rsession;
    }

    /**
     * 创建Rsession单例
     * 
     * @return
     * @throws IOException
     */
    public static Rsession getRsessionInstance() throws IOException {
        if (rsession == null) {
            rsession = initRserve();
        }
        return rsession;
    }

    public static void endRsession(Rsession rs) {
        rs.end();
        rsession = null;
    }
}

五、进一步探讨

  在这个实例中,先用JupyterHub+Jupyter Lab开发测试Python模型部分,然后用RStudio Linux Server开发测试R语言函数,如果是Shiny APP,再加上Shiny Server,这是深度数据分析中需要作交互式探索,开发测试脚本代码的部分。

  然后交给系统集成商去集成,提供WEB界面的应用系统,这是本例中的Java语言部分。

  最后交给业务用户的,就是产生的PDF分析报告,输入参数,得到报告,傻瓜式系统,屏蔽了实施上的复杂性。

  从本例可以看到深度数据分析应用都是综合运用多种技术,集成多个平台的多层应用。本例就使用了HTML、Java、R、Python四种语言,实际上还有C/C++、Fortran等,因为底层的很多算法都是用它们开发的。在数据层用Python实现了数据分析的算法模型,集成了JupyterHub/Jupyter Lab开发平台。这一层也可以归入中间层,因为本例没有访问云数据库、关系数据库、图数据库等数据源,它就成为了数据层。中间层一般实现业务组件的功能,用R语言实现了对算法模型的调用与分析报告的生成,集成了Rstudio Linux Server开发平台。应用层用Java语言开发了Tomcat Web APP,连接用户与中间层业务组件实现了业务流程的支持,用Eclipse开发集成了Tomcat等Java EE平台。表示层是人机交互UI界面,用通用的浏览器平台运行HTML语言。本例只有两个很简单的网页界面,然而它也可以根据需要复杂起来:Shiny APP就是用很多JavaScript库在浏览器内部的网页文档对象模型(DOM)的支持下,根据用户输入完全动态的生成网页,从而实现它的反应式编程机制。又比如前文机场航线网络环路分析的例子,数据层连接Neo4j图数据库平台,增加了Cypher语言与SQL语言操作网络数据,以及Neo4j Browser开发Cypher查询语句。表示层使用three.js与3d-forced-graph.js库作网络的可视化展示,增加了JavaScript与CSS语言。

  没有一个平台可以单独完成所有这些任务,也很难用一种语言去有效的利用各种资源完成所有这些任务。系统集成的作用就是,把各个平台以及它们承载的技术能力有效的连接起来,把好东西整合起来,去实现人们的落地应用目标。因此合理的综合运用各种技术和各种语言就非常必要了,尤其是在需要编程的深度数据分析落地应用中。

  不管是系统集成还是深度数据分析,编程都必不可少,区别是代码量的多少。这世上本没有灯神,只是熬夜的程序猿多了,便有了灯神。因此,还是以这句话来结束本篇:

  In depth we code.

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • 在学习R的时候,R的包众多,很多时候对于初学者会造成很大的困扰就是不知道用什么样的包比较合适。我会在不断使用...
    果果哥哥BBQ阅读 16,342评论 0 62
  • 工欲善其事,必先利其器。总结一下,方便多了。R语言还是很牛逼的,可以干很多事情。有一把顺手的刀还是很重要的。 0....
    Liam_ml阅读 10,155评论 1 60
  • R语言包的安装 参考文章 在R中大概有几种包的安装方式。 install.packages()常规包的安装 括号里...
    天涯清水阅读 14,809评论 0 22
  •   前面几篇文章在带Nvidia T4的GN7型虚拟主机 Ubuntu18上安装了Python 3.9的开发环境,...
    JeanYe阅读 5,190评论 0 0
  • 1.数据导入 以下R包主要用于数据导入和保存数据: feather:一种快速,轻量级的文件格式;在R和python...
    雨一流阅读 5,113评论 0 9