















Jupyter Notebooks、JupyterLab和Apache Zeppelin等基于Web的开发环境非常适合模型构建。如果组织的数据与笔记本电脑环境位于同一云平台中,则可以对数据进行分析,以很大程度地减少数据移动的时间。







大多数数据科学家拥有用于机器学习和深度学习技术的很喜欢的框架和编程语言。对于喜欢Python的人来说,Scikit学习通常是机器学习的最爱,而TensorFlow、PyTorch、Keras、MXNet通常是深度学习的首选。在Scala中,Spark MLlib往往是机器学习的首选。在R中,有许多原生机器学习包,以及与Python的良好接口。而在Java中,H2O.ai的评价很高,Java-ML和Deep Java Library也是如此。



















本文作者Martin Heller目前为InfoWorld网站的特约编辑兼评论员,此前曾担任Web和Windows编程顾问。从1986年至2010年,Heller一直从事数据库、软件和网站的开发工作。近期,Heller还出任了Alpha Software的技术兼训练副总裁和Tubifi的董事长兼首席执行官。


In order to create effective machine learning and deep learning

models, you need copious amounts of data, a way to clean the data and

perform feature engineering on it, and a way to train models on your

data in a reasonable amount of time. Then you need a way to deploy your

models, monitor them for drift over time, and retrain them as needed.

You can do all of that on-premises if you have invested in compute

resources and accelerators such as GPUs, but you may find that if your

resources are adequate, they are also idle much of the time. On the

other hand, it can sometimes be more cost-effective to run the entire

pipeline in the cloud, using large amounts of compute resources and

accelerators as needed, and then releasing them.

The major cloud providers — and a number of minor clouds too — have

put significant effort into building out their machine learning

platforms to support the complete machine learning lifecycle, from

planning a project to maintaining a model in production. How do you

determine which of these clouds will meet your needs? Here are 12

capabilities every end-to-end machine learning platform should provide,

with notes on which clouds provide them.

[ The best open source software of 2022 ]

Be close to your data

If you have the large amounts of data needed to build precise models,

you don’t want to ship it halfway around the world. The issue here

isn’t distance, however, it’s time: Data transmission latency is

ultimately limited by the speed of light, even on a perfect network with

infinite bandwidth. Long distances mean latency.

The ideal case for very large data sets is to build the model where

the data already resides, so that no mass data transmission is needed. A

number of databases support that.

The next best case is for the data to be on the same high-speed

network as the model-building software, which typically means within the

same data center. Even moving the data from one data center to another

within a cloud availability zone can introduce a significant delay if

you have terabytes (TB) or more. You can mitigate this by doing

incremental updates.

The worst case would be if you have to move big data long distances

over paths with constrained bandwidth and high latency. The

trans-Pacific cables going to Australia are particularly egregious in

this respect.

[ Attend Virtual Summit on November 8 - CIO's Future of Cloud Summit:

Mastering Complexity & Digital Innovation – Register Today! ]

The major cloud providers have been addressing this issue in multiple

ways. One is to add machine learning and deep learning to their

database services. For example, Amazon Redshift ML is designed to make

it easy for SQL users to create, train, and deploy machine learning

models using SQL commands against Amazon Redshift, a managed,

petabyte-scale data warehouse service. BigQuery ML lets you create and

execute machine learning models in BigQuery, Google Cloud’s managed,

petabyte-scale data warehouse, also using SQL queries.

IBM Db2 Warehouse on Cloud includes a wide set of in-database SQL

analytics that includes some basic machine learning functionality, plus

in-database support for R and Python. Microsoft SQL Server Machine

Learning Services supports R, Python, Java, the PREDICT T-SQL command,

the rx_Predict stored procedure in the SQL Server RDBMS, and Spark MLlib

in SQL Server Big Data Clusters. And, Oracle Cloud Infrastructure (OCI)

Data Science is a managed and serverless platform for data science

teams to build, train, and manage machine learning models using Oracle

Cloud Infrastructure including Oracle Autonomous Database and Oracle

Autonomous Data Warehouse.

Another way cloud providers have addressed this issue is to bring

their cloud services to customer data centers as well as to satellite

points of presence (often in large metropolitan areas) that are closer

to customers than full-blown availability zones. AWS calls these AWS

Outposts, and AWS Local Zones; Microsoft Azure calls them Azure Stack

Edge nodes and Azure Arc; Google Cloud Platform calls them network edge

locations, Google Distributed Cloud Virtual, and Anthos on-prem.

Support an ETL or ELT pipeline

ETL (export, transform, and load) and ELT (export, load, and

transform) are two data pipeline configurations that are common in the

database world. Machine learning and deep learning amplify the need for

these, especially the transform portion. ELT gives you more flexibility

when your transformations need to change, as the load phase is usually

the most time-consuming for big data.

In general, data in the wild is noisy. That needs to be filtered.

Additionally, data in the wild has varying ranges: One variable might

have a maximum in the millions, while another might have a range of -0.1

to -0.001. For machine learning, variables must be transformed to

standardized ranges to keep the ones with large ranges from dominating

the model. Exactly which standardized range depends on the algorithm

used for the model.

AWS Glue is an Apache Spark-based serverless ETL engine; AWS also

offers Amazon EMR, a big data platform that can run Apache Spark, and

Amazon Redshift Spectrum, which supports ELT from an Amazon S3-based

data lake. Azure Data Factory and Azure Synapse can do both ETL and ELT.

Google Cloud Data Fusion, Dataflow, and Dataproc are useful for ETL and

ELT. Third-party self-service ETL/ELT products such as Trifacta can

also be used on the clouds.

Support an online environment for model building

The conventional wisdom used to be that you should import your data

to your desktop for model building. The sheer quantity of data needed to

build good machine learning and deep learning models changes the

picture: You can download a small sample of data to your desktop for

exploratory data analysis and model building, but for production models

you need to have access to the full data.

Web-based development environments such as Jupyter Notebooks,

JupyterLab, and Apache Zeppelin are well suited for model building. If

your data is in the same cloud as the notebook environment, you can

bring the analysis to the data, minimizing the time-consuming movement

of data. Notebooks can also be used for ELT as part of the pipeline.

Amazon SageMaker allows you to build, train, and deploy machine

learning and deep learning models for any use case with fully managed

infrastructure, tools, and workflows. SageMaker Studio is based on


Microsoft Azure Machine Learning is an end-to-end, scalable, trusted

AI platform with experimentation and model management; Azure Machine

Learning Studio includes Jupyter Notebooks, a drag-and-drop machine

learning pipeline designer, and an AutoML facility. Azure Databricks is

an Apache Spark-based analytics platform; Azure Data Science Virtual

Machines make it easy for advanced data scientists to set up machine

learning and deep learning development environments.

Google Cloud Vertex AI allows you to build, deploy, and scale machine

learning models faster, with pre-trained models and custom tooling

within a unified artificial intelligence platform. Through Vertex AI

Workbench, Vertex AI is natively integrated with BigQuery, Dataproc, and

Spark. Vertex AI also integrates with widely used open source

frameworks such as TensorFlow, PyTorch, and Scikit-learn, and supports

all machine learning frameworks and artificial intelligence branches via

custom containers for training and prediction.

Support scale-up and scale-out training

The compute and memory requirements of notebooks are generally

minimal, except for training models. It helps a lot if a notebook can

spawn training jobs that run on multiple large virtual machines or

containers. It also helps a lot if the training can access accelerators

such as GPUs, TPUs, and FPGAs; these can turn days of training into


Amazon SageMaker supports a wide range of VM sizes; GPUs and other

accelerators including NVIDIA A100s, Habana Gaudi, and AWS Trainium; a

model compiler; and distributed training using either data parallelism

or model parallelism. Azure Machine Learning supports a wide range of VM

sizes; GPUs and other accelerators including NVIDIA A100s and Intel

FPGAs; and distributed training using either data parallelism or model

parallelism. Google Cloud Vertex AI supports a wide range of VM sizes;

GPUs and other accelerators including NVIDIA A100s and Google TPUs; and

distributed training using either data parallelism or model parallelism,

with an optional reduction server.

Support AutoML and automated feature engineering

Not everyone is good at picking machine learning models, selecting

features (the variables that are used by the model), and engineering new

features from the raw observations. Even if you’re good at those tasks,

they are time-consuming and can be automated to a large extent.

AutoML systems often try many models to see which result in the best

objective function values, for example the minimum squared error for

regression problems. The best AutoML systems can also perform feature

engineering, and use their resources effectively to pursue the best

possible models with the best possible sets of features.

Amazon SageMaker Autopilot provides AutoML and hyperparameter tuning,

which can use Hyperband as a search strategy. Azure Machine Learning

and Azure Databricks both provide AutoML, as does Apache Spark in Azure

HDInsight. Google Cloud Vertex AI supplies AutoML, and so do Google’s

specialized AutoML services for structured data, sight, and language,

although Google tends to lump AutoML in with transfer learning in some


DataRobot, Dataiku, and H2O.ai Driverless AI all offer AutoML with automated feature engineering and hyperparameter tuning.

Support the best machine learning and deep learning frameworks

Most data scientists have favorite frameworks and programming

languages for machine learning and deep learning. For those who prefer

Python, Scikit-learn is often a favorite for machine learning, while

TensorFlow, PyTorch, Keras, and MXNet are often top picks for deep

learning. In Scala, Spark MLlib tends to be preferred for machine

learning. In R, there are many native machine learning packages, and a

good interface to Python. In Java, H2O.ai rates highly, as do Java-ML

and Deep Java Library.

The cloud machine learning and deep learning platforms tend to have

their own collection of algorithms, and they often support external

frameworks in at least one language or as containers with specific entry

points. In some cases you can integrate your own algorithms and

statistical methods with the platform’s AutoML facilities, which is

quite convenient.

Some cloud platforms also offer their own tuned versions of major

deep learning frameworks. For example, AWS has an optimized version of

TensorFlow that it claims can achieve nearly linear scalability for deep

neural network training. Similarly, Google Cloud offers TensorFlow


Offer pre-trained models and support transfer learning

Not everyone wants to spend the time and compute resources to train

their own models — nor should they, when pre-trained models are

available. For example, the ImageNet dataset is huge, and training a

state-of-the-art deep neural network against it can take weeks, so it

makes sense to use a pre-trained model for it when you can.

On the other hand, pre-trained models may not always identify the

objects you care about. Transfer learning can help you customize the

last few layers of the neural network for your specific data set without

the time and expense of training the full network.

All major deep learning frameworks and cloud service providers

support transfer learning at some level. There are differences; one

major difference is that Azure can customize some kinds of models with

tens of labeled exemplars, versus hundreds or thousands for some of the

other platforms.

Offer tuned, pre-trained AI services

The major cloud platforms offer robust, tuned AI services for many

applications, not just image identification. Examples include language

translation, speech to text, text to speech, forecasting, and


These services have already been trained and tested on more data than

is usually available to businesses. They are also already deployed on

service endpoints with enough computational resources, including

accelerators, to ensure good response times under worldwide load.

The differences among the services offered by the big three tend to

be down in the weeds. One area of active development is services for the

edge, including machine learning that resides on devices such as

cameras and communicates with the cloud.

Manage your experiments

The only way to find the best model for your data set is to try

everything, whether manually or using AutoML. That leaves another

problem: Managing your experiments.

A good cloud machine learning platform will have a way that you can

see and compare the objective function values of each experiment for

both the training sets and the test data, as well as the size of the

model and the confusion matrix. Being able to graph all of that is a

definite plus.

In addition to the experiment tracking built into Amazon SageMaker,

Azure Machine Learning, and Google Cloud Vertex AI, you can use

third-party products such as Neptune.ai, Weights & Biases, Sacred

plus Omniboard, and MLflow. Most of these are free for at least personal

use, and some are open source.

Support model deployment for prediction

Once you have a way of picking the best experiment given your

criteria, you also need an easy way to deploy the model. If you deploy

multiple models for the same purpose, you’ll also need a way to

apportion traffic among them for a/b testing.

One sticking point is the cost of deploying an endpoint, as discussed under

本文主要内容原作者Martin Heller,仅供广大读者参考,如有侵犯您的知识产权或者权益,请联系我提供证据,我会予以删除。

CXO联盟(CXO union)是一家聚焦于CIO,CDO,cto,ciso,cfo,coo,chro,cpo,ceo等人群的平台组织,其中在CIO会议领域的领头羊,目前举办了大量的CIO大会、CIO论坛、CIO活动、CIO会议、CIO峰会、CIO会展。如华东CIO会议、华南cio会议、华北cio会议、中国cio会议、西部CIO会议。在这里,你可以参加大量的IT大会、IT行业会议、IT行业论坛、IT行业会展、数字化论坛、数字化转型论坛,在这里你可以认识很多的首席信息官、首席数字官、首席财务官、首席技术官、首席人力资源官、首席运营官、首席执行官、IT总监、财务总监、信息总监、运营总监、采购总监、供应链总监。


【CXO UNION部分社群会员】华为投资控股有限公司CISO、苏宁控股集团CISO、正威国际集团有限公司CISO、恒力集团有限公司CISO、碧桂园控股有限公司CISO、恒大集团有限公司CISO、联想控股股份有限公司CISO、国美控股集团有限公司CISO、万科企业股份有限公司CISO、浙江吉利控股集团有限公司CISO、中南控股集团有限公司CISO、美的集团股份有限公司CISO、山东魏桥创业集团有限公司CISO、青山控股集团有限公司CISO、江苏沙钢集团有限公司CISO、阳光龙净集团有限公司CISO、浙江恒逸集团有限公司CISO、小米通讯技术有限公司CISO、浙江荣盛控股集团有限公司CISO、秦康保险集团股份有限公司CISO、新疆广汇实业投资(集团)有限责任公司CISO、盛虹控股集团有限公司CISO、重庆市金科投资控殷(集团)有限责任公司CISO、海亮集团有限公司CISO、多弗国际控股集团有限公司CISO、新奥集团股份有限公司CISO、新希望集团有限公司CISO、大连万达集团股份有限公司CISO、北京建龙重工集团有限公司CISO、龙湖集团控股有限公司CISO、南通三建控股有限公司CISO、复星国际有限公司CISO、天能控股集团有限公司CISO、TCL集团CISO、万向集团公司CISO、中天钢铁集团有限公司CISO、比亚迪股份有限公司CISO、敬业集团有限公司CISO、东岭集团股份有限公司CISO、超威集团CISO、海澜集团有限公司CISO、东方希望集团有限公司CISO、河北津西钢铁集团股份有限公司CISO、山东东明石化集团有限公司CISO、顺丰控股股份有限公司CISO、西安迈科金属国际集团有限公司CISO、雅戈尔集团股份有限公司CISO、江阴澄星实业集团有限公司CISO、亨通集团有限公司CISO、百度公司CISO、上海均和集团有限公司CISO等

  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 224,509评论 6 522
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 96,093评论 3 402
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 171,632评论 0 366
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 60,848评论 1 300
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 69,855评论 6 399
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 53,353评论 1 314
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 41,738评论 3 428
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 40,706评论 0 279
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 47,240评论 1 324
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 39,268评论 3 345
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 41,398评论 1 354
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 37,012评论 5 350
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 42,709评论 3 337
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 33,178评论 0 25
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 34,312评论 1 275
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 49,932评论 3 381
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 46,464评论 2 365
