Spark构建回归模型(一)

  • 为了阐述本章的一些概念,我们选择了bike sharing数据集做实验。这个数据集记录了bike
    sharing系统每小时自行车的出租次数。另外还包括日期、时间、天气、季节和节假日等相关信息。

    [hadoop@master spark]$ wget http://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip
    [hadoop@master spark]$ tar xvf Bike-Sharing-Dataset.zip 
    
    [hadoop@master spark]$ sed 1d hour.csv > hour_noheader.csv
    [hadoop@master spark]$ hdfs dfs -put hour_noheader.csv ML/
    [hadoop@master spark]$ cat Readme.txt 
    
    Dataset characteristics
    =========================================   
    Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv
        
        - instant: record index
        - dteday : date
        - season : season (1:springer, 2:summer, 3:fall, 4:winter)
        - yr : year (0: 2011, 1:2012)
        - mnth : month ( 1 to 12)
        - hr : hour (0 to 23)
        - holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
        - weekday : day of the week
        - workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
        + weathersit : 
            - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
            - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
            - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
            - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
        - temp : Normalized temperature in Celsius. The values are divided to 41 (max)
        - atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
        - hum: Normalized humidity. The values are divided to 100 (max)
        - windspeed: Normalized wind speed. The values are divided to 67 (max)
        - casual: count of casual users
        - registered: count of registered users
        - cnt: count of total rental bikes including both casual and registered
        
    =========================================
    
    
  • 加载和查看数据集

    export PYSPARK_DRIVER_PYTHON=/usr/local/program/python2.7/bin/python
    export PYSPARK_PYTHON=/usr/local/program/python2.7/bin/python[hadoop@master ~]$ pyspark --master yarn --driver-memory 4G
    >>> raw_data = sc.textFile("ML/hour_noheader.csv")
    >>> num_data = raw_data.count()
    >>> records = raw_data.map(lambda x: x.split(","))                              
    >>> first = records.first()
    >>> print first
    [u'1', u'2011-01-01', u'1', u'0', u'1', u'0', u'0', u'6', u'0', u'1', u'0.24', u'0.2879', u'0.81', u'0', u'3', u'13', u'16']
    >>> print num_data
    17379
    >>> records.cache()
    PythonRDD[4] at RDD at PythonRDD.scala:48
    
  • 为了将类型特征表示成二维形式,我们将特征值映射到二元向量中非0的位置

    
    def get_mapping(rdd, idx):
        return rdd.map(lambda fields: fields[idx]).distinct().zipWithIndex().collectAsMap()
        
    >>> print "Mapping of first categorical feasture column: %s" % get_mapping(records, 2)
    Mapping of first categorical feasture column: {u'1': 0, u'3': 1, u'2': 2, u'4': 3}
    
    #8个类型变量
    mappings = [get_mapping(records, i) for i in range(2,10)]
    >>> print mappings
    [{u'1': 0, u'3': 1, u'2': 2, u'4': 3}, {u'1': 0, u'0': 1}, {u'11': 0, u'10': 6, u'12': 7, u'1': 1, u'3': 2, u'2': 8, u'5': 3, u'4': 9, u'7': 4, u'6': 10, u'9': 5, u'8': 11}, {u'20': 2, u'21': 14, u'22': 4, u'23': 15, u'1': 6, u'0': 18, u'3': 7, u'2': 19, u'5': 8, u'4': 20, u'7': 9, u'6': 21, u'9': 10, u'8': 22, u'11': 0, u'10': 12, u'13': 1, u'12': 13, u'15': 11, u'14': 23, u'17': 3, u'16': 17, u'19': 5, u'18': 16}, {u'1': 0, u'0': 1}, {u'1': 0, u'0': 3, u'3': 1, u'2': 4, u'5': 2, u'4': 5, u'6': 6}, {u'1': 0, u'0': 1}, {u'1': 0, u'3': 1, u'2': 2, u'4': 3}]
    
    #计算完每个变量的映射之后,统计一下最终二元向量的总长度
    #这里的len是函数len
    cat_len = sum(map(len, mappings))
    num_len = len(records.first()[11:15])
    total_len = num_len + cat_len
    
    >>> print "Feature vector length for categorical features: %d" % cat_len
    Feature vector length for categorical features: 57
    >>> print "Feature vector length for numerical features: %d" % num_len
    Feature vector length for numerical features: 4
    >>> print "Total feature vector length: %d" % total_len
    Total feature vector length: 61
    
    
  • 为线性模型创建特征向量
    用上面的映射函数将所有类型特征转换为二元编码的特征。为了方便对每条记录提取特征和标签,我们分别定义两个辅助函数extract_features和extract_label

    from pyspark.mllib.regression import LabeledPoint
    import numpy as np
    
    def extract_features(record):
        cat_vec = np.zeros(cat_len)
        i = 0
        step = 0
        for field in record[2:9]:
            #m是一个字典
            m = mappings[i]
            idx = m[field]
            cat_vec[idx + step] = 1
            i = i + 1
            step = step + len(m)
        num_vec = np.array([float(field) for field in record[10:14]])
        return np.concatenate((cat_vec, num_vec))
    
    def extract_label(record):
        #pyhotn的数组可以看成一个环,所以-1就是第0个的前一个,也就是最后一个
        return float(record[-1])
        
    data = records.map(lambda r: LabeledPoint(extract_label(r), extract_features(r)))
    first_point = data.first()
    >>> print "Raw data: " + str(first[2:])
    Raw data: [u'1', u'0', u'1', u'0', u'0', u'6', u'0', u'1', u'0.24', u'0.2879', u'0.81', u'0', u'3', u'13', u'16']
    >>> print "Label: " + str(first_point.label)
    Label: 16.0
    >>> print "Linear Model feature vector:\n" + str(first_point.features)
    Linear Model feature vector:
    [1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.24,0.2879,0.81,0.0]
    >>> print "Linear Model feature vector length: " + str(len(first_point.features))
    Linear Model feature vector length: 61
    
    
  • 为决策树创建特征向量
    决策树模型可以直接使用原始数据(不需要将类型数据用二元向量表示)。因此,只需要创建一个分割函数简单地将所有数值转换为浮点数,最后用numpy的array封装

    def extract_features_dt(record):
        return np.array(map(float, record[2:14]))
    data_dt = records.map(lambda r: LabeledPoint(extract_label(r),extract_features_dt(r)))
    first_point_dt = data_dt.first()
    
    >>> print "Decision Tree feature vector: " + str(first_point_dt.features)
    Decision Tree feature vector: [1.0,0.0,1.0,0.0,0.0,6.0,0.0,1.0,0.24,0.2879,0.81,0.0]
    >>> print "Decision Tree feature vector length: " + str(len(first_point_dt.features))
    Decision Tree feature vector length: 12
    
    
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 214,504评论 6 496
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,434评论 3 389
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 160,089评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,378评论 1 288
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,472评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,506评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,519评论 3 413
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,292评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,738评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,022评论 2 329
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,194评论 1 342
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,873评论 5 338
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,536评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,162评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,413评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,075评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,080评论 2 352

推荐阅读更多精彩内容

  • pyspark.sql模块 模块上下文 Spark SQL和DataFrames的重要类: pyspark.sql...
    mpro阅读 9,449评论 0 13
  • 今天,在学校林荫道又看见了你啊。你牵着她的手,看起来比我们那时候般配。毕竟我很少安静。 你还记不记得,当时分开的时...
    passionstar阅读 353评论 0 0
  • 文/赵桔子 我被影视洗了脑 有这样一句话:如果你达不到自己的期许,就用文字来充实自己。 因为文字,我爱上了文学,因...
    赵一听阅读 349评论 5 7
  • 前天大风,昨天下“心情雨”,今天是“谜”一样的雾。 早上一上班,就听说一卡通业务年底将移交国家电网,人员怎么安排谁...
    一棹碧涛阅读 429评论 5 7
  • 【_学龄后】 打卡日期:2018年6月 20日 打卡累计天数: 15/30 #宣言(如:真正的陪伴是和孩子一起成长...
    丁嘉_387e阅读 75评论 0 0