想免费用谷歌资源训练神经网络？Colab 详细使用教程 —— Jinkey 原创

原文链接 https://jinkey.ai/post/tech/xiang-mian-fei-yong-gu-ge-zi-yuan-xun-lian-shen-jing-wang-luo-colab-xiang-xi-shi-yong-jiao-cheng
本文作者 Jinkey（微信公众号 jinkey-love，官网 https://jinkey.ai）
文章允许非篡改署名转载，删除或修改本段版权信息转载的，视为侵犯知识产权，我们保留追求您法律责任的权利，特此声明！

1 简介

Colab 是谷歌内部类 Jupyter Notebook 的交互式 Python 环境，免安装快速切换 Python 2和 Python 3 的环境，支持Google全家桶(TensorFlow、BigQuery、GoogleDrive等)，支持 pip 安装任意自定义库。
网址：
https://colab.research.google.com

2 库的安装和使用

Colab 自带了 Tensorflow、Matplotlib、Numpy、Pandas 等深度学习基础库。如果还需要其他依赖，如 Keras，可以新建代码块，输入

# 安装最新版本Keras
# https://keras.io/
!pip install keras
# 指定版本安装
!pip install keras==2.0.9
# 安装 OpenCV
# https://opencv.org/
!apt-get -qq install -y libsm6 libxext6 && pip install -q -U opencv-python
# 安装 Pytorch
# http://pytorch.org/
!pip install -q http://download.pytorch.org/whl/cu75/torch-0.2.0.post3-cp27-cp27mu-manylinux1_x86_64.whl torchvision
# 安装 XGBoost
# https://github.com/dmlc/xgboost
!pip install -q xgboost
# 安装 7Zip
!apt-get -qq install -y libarchive-dev && pip install -q -U libarchive
# 安装 GraphViz 和 PyDot
!apt-get -qq install -y graphviz && pip install -q pydot

3 Google Drive 文件操作

授权登录

对于同一个 notebook，登录操作只需要进行一次，然后才可以进度读写操作。

# 安装 PyDrive 操作库，该操作每个 notebook 只需要执行一次
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# 授权登录，仅第一次的时候会鉴权
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

执行这段代码后，会打印以下内容，点击连接进行授权登录，获取到 token 值填写到输入框，按 Enter 继续即可完成登录。

遍历目录

# 列出根目录的所有文件
# "q" 查询条件教程详见：https://developers.google.com/drive/v2/web/search-parameters
file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()
for file1 in file_list:
  print('title: %s, id: %s, mimeType: %s' % (file1['title'], file1['id'], file1["mimeType"]))

可以看到控制台打印结果

title: Colab 测试, id: 1cB5CHKSdL26AMXQ5xrqk2kaBv5LSkIsJ8HuEDyZpeqQ, mimeType: application/vnd.google-apps.document

title: Colab Notebooks, id: 1U9363A12345TP2nSeh2K8FzDKSsKj5Jj, mimeType: application/vnd.google-apps.folder

其中 id 是接下来的教程获取文件的唯一标识。根据 mimeType 可以知道 Colab 测试 文件为 doc 文档，而 Colab Notebooks 为文件夹（也就是 Colab 的 Notebook 储存的根目录），如果想查询 Colab Notebooks 文件夹下的文件，查询条件可以这么写：

# '目录 id' in parents
file_list = drive.ListFile({'q': "'1cB5CHKSdL26AMXQ5xrqk2kaBv5LBkIsJ8HuEDyZpeqQ' in parents and trashed=false"}).GetList()

读取文件内容

目前测试过可以直接读取内容的格式为 .txt（mimeType: text/plain），读取代码：

file = drive.CreateFile({'id': "替换成你的 .txt 文件 id"}) 
file.GetContentString()

而 .csv 如果用GetContentString()只能打印第一行的数据，要用``

file = drive.CreateFile({'id': "替换成你的 .csv 文件 id"}) 
#这里的下载操作只是缓存，不会在你的Google Drive 目录下多下载一个文件
file.GetContentFile('iris.csv', "text/csv") 

# 直接打印文件内容
with open('iris.csv') as f:
  print f.readlines()
# 用 pandas 读取
import pandas
pd.read_csv('iris.csv', index_col=[0,1], skipinitialspace=True)

Colab 会直接以表格的形式输出结果（下图为截取 iris 数据集的前几行）， iris 数据集地址为 http://aima.cs.berkeley.edu/data/iris.csv ，学习的同学可以执行上传到自己的 Google Drive。

写文件操作

# 创建一个文本文件
uploaded = drive.CreateFile({'title': '示例.txt'})
uploaded.SetContentString('测试内容')
uploaded.Upload()
print('创建后文件 id 为 {}'.format(uploaded.get('id')))

更多操作可查看 http://pythonhosted.org/PyDrive/filemanagement.html

4 Google Sheet 电子表格操作

授权登录

对于同一个 notebook，登录操作只需要进行一次，然后才可以进度读写操作。

!pip install --upgrade -q gspread
from google.colab import auth
auth.authenticate_user()

import gspread
from oauth2client.client import GoogleCredentials

gc = gspread.authorize(GoogleCredentials.get_application_default())

读取

把 iris.csv 的数据导入创建一个 Google Sheet 文件来做演示，可以放在 Google Drive 的任意目录

worksheet = gc.open('iris').sheet1

# 获取一个列表[
# [第1行第1列, 第1行第2列, ... , 第1行第n列], ... ,[第n行第1列, 第n行第2列, ... , 第n行第n列]]
rows = worksheet.get_all_values()
print(rows)

#  用 pandas 读取
import pandas as pd
pd.DataFrame.from_records(rows)

打印结果分别为

[['5.1', '3.5', '1.4', '0.2', 'setosa'], ['4.9', '3', '1.4', '0.2', 'setosa'], ...

写入

sh = gc.create('谷歌表')

# 打开工作簿和工作表
worksheet = gc.open('谷歌表').sheet1
cell_list = worksheet.range('A1:C2')

import random
for cell in cell_list:
  cell.value = random.randint(1, 10)
worksheet.update_cells(cell_list)

5 下载文件到本地

from google.colab import files
with open('example.txt', 'w') as f:
  f.write('测试内容')
files.download('example.txt')

6 实战

这里以我在 Github 的开源LSTM 文本分类项目为例子https://github.com/Jinkeycode/keras_lstm_chinese_document_classification
把 master/data 目录下的三个文件存放到 Google Drive 上。该示例演示的是对健康、科技、设计三个类别的标题进行分类。

新建

在 Colab 上新建 Python2 的笔记本

安装依赖

!pip install keras
!pip install jieba
!pip install h5py

import h5py
import jieba as jb
import numpy as np
import keras as krs
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder

加载数据

授权登录

# 安装 PyDrive 操作库，该操作每个 notebook 只需要执行一次
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

def login_google_drive():
  # 授权登录，仅第一次的时候会鉴权
  auth.authenticate_user()
  gauth = GoogleAuth()
  gauth.credentials = GoogleCredentials.get_application_default()
  drive = GoogleDrive(gauth)
  return drive

列出 GD 下的所有文件

def list_file(drive):
  file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()
  for file1 in file_list:
    print('title: %s, id: %s, mimeType: %s' % (file1['title'], file1['id'], file1["mimeType"]))
    

drive = login_google_drive()
list_file(drive)

缓存数据到工作环境

def cache_data():
  # id 替换成上一步读取到的对应文件 id
  health_txt = drive.CreateFile({'id': "117GkBtuuBP3wVjES0X0L4wVF5rp5Cewi"}) 
  tech_txt = drive.CreateFile({'id': "14sDl4520Tpo1MLPydjNBoq-QjqOKk9t6"})
  design_txt = drive.CreateFile({'id': "1J4lndcsjUb8_VfqPcfsDeOoB21bOLea3"})
  #这里的下载操作只是缓存，不会在你的Google Drive 目录下多下载一个文件
  
  health_txt.GetContentFile('health.txt', "text/plain")
  tech_txt.GetContentFile('tech.txt', "text/plain")
  design_txt.GetContentFile('design.txt', "text/plain")
  
  print("缓存成功")
  
cache_data()

读取工作环境的数据

def load_data():
    titles = []
    print("正在加载健康类别的数据...")
    with open("health.txt", "r") as f:
        for line in f.readlines():
            titles.append(line.strip())

    print("正在加载科技类别的数据...")
    with open("tech.txt", "r") as f:
        for line in f.readlines():
            titles.append(line.strip())


    print("正在加载设计类别的数据...")
    with open("design.txt", "r") as f:
        for line in f.readlines():
            titles.append(line.strip())

    print("一共加载了 %s 个标题" % len(titles))

    return titles
  
titles = load_data()

加载标签

def load_label():
    arr0 = np.zeros(shape=[12000, ])
    arr1 = np.ones(shape=[12000, ])
    arr2 = np.array([2]).repeat(7318)
    target = np.hstack([arr0, arr1, arr2])
    print("一共加载了 %s 个标签" % target.shape)

    encoder = LabelEncoder()
    encoder.fit(target)
    encoded_target = encoder.transform(target)
    dummy_target = krs.utils.np_utils.to_categorical(encoded_target)

    return dummy_target
  
target = load_label()

文本预处理

max_sequence_length = 30
embedding_size = 50

# 标题分词
titles = [".".join(jb.cut(t, cut_all=True)) for t in titles]

# �word2vec 词袋化
vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(max_sequence_length, min_frequency=1)
text_processed = np.array(list(vocab_processor.fit_transform(titles)))

# 读取词标签
dict = vocab_processor.vocabulary_._mapping
sorted_vocab = sorted(dict.items(), key = lambda x : x[1])

构建神经网络

这里使用 Embedding 和 lstm 作为前两层，通过 softmax 激活输出结果

# 配置网络结构
def build_netword(num_vocabs):
    # 配置网络结构
    model = krs.Sequential()
    model.add(krs.layers.Embedding(num_vocabs, embedding_size, input_length=max_sequence_length))
    model.add(krs.layers.LSTM(32, dropout=0.2, recurrent_dropout=0.2))
    model.add(krs.layers.Dense(3))
    model.add(krs.layers.Activation("softmax"))
    model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

    return model
  
num_vocabs = len(dict.items())
model = build_netword(num_vocabs=num_vocabs)

import time
start = time.time()
# 训练模型
model.fit(text_processed, target, batch_size=512, epochs=10, )
finish = time.time()
print("训练耗时：%f 秒" %(finish-start))

预测样本

sen 可以换成你自己的句子，预测结果为[健康类文章概率, 科技类文章概率, 设计类文章概率], 概率最高的为那一类的文章，但最大概率低于 0.8 时判定为无法分类的文章。

sen = "做好商业设计需要学习的小技巧"
sen_prosessed = " ".join(jb.cut(sen, cut_all=True))
sen_prosessed = vocab_processor.transform([sen_prosessed])
sen_prosessed = np.array(list(sen_prosessed))
result = model.predict(sen_prosessed)

catalogue = list(result[0]).index(max(result[0]))
threshold=0.8
if max(result[0]) > threshold:
    if catalogue == 0:
        print("这是一篇关于健康的文章")
    elif catalogue == 1:
        print("这是一篇关于科技的文章")
    elif catalogue == 2:
        print("这是一篇关于设计的文章")
    else:
        print("这篇文章没有可信分类")

最后编辑于：2018.01.16 12:18:43

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 221,548评论 6赞 515
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 94,497评论 3赞 399
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 167,990评论 0赞 360
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 59,618评论 1赞 296
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 68,618评论 6赞 397
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 52,246评论 1赞 308
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,819评论 3赞 421
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,725评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 46,268评论 1赞 320
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 38,356评论 3赞 340
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,488评论 1赞 352
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 36,181评论 5赞 350
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,862评论 3赞 333
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 32,331评论 0赞 24
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,445评论 1赞 272
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,897评论 3赞 376
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,500评论 2赞 359

想免费用谷歌资源训练神经网络？Colab 详细使用教程 —— Jinkey 原创

1 简介

2 库的安装和使用

3 Google Drive 文件操作

授权登录

遍历目录

读取文件内容

写文件操作

4 Google Sheet 电子表格操作

授权登录

读取

写入

5 下载文件到本地

6 实战

新建

安装依赖

加载数据

文本预处理

构建神经网络

预测样本

推荐阅读更多精彩内容