【Spark】(task4)SparkML基础(数据编码)

网友投稿 237 2022-09-22

【Spark】(task4)SparkML基础(数据编码)

学习总结

文章目录

​​学习总结​​​​零、导言​​​​一、构建ML Pipeline机器学习流程​​

​​1.1 ML Pipeline构建流程​​​​1.2 ML Pipeline组件​​

​​二、数据编码​​

​​2.1 学习Spark ML中数据编码模块​​​​2.2 读取文件Pokemon.csv,理解数据字段含义​​​​2.3 将其中的类别属性使用onehotencoder​​​​2.4 对其中的数值属性字段使用 minmaxscaler​​​​2.5 对编码后的属性使用pca进行降维(维度可以自己选择)​​

​​三、multi-hot编码​​​​四、数值型特征进行分桶归一化处理​​​​Reference​​

零、导言

【导言】park是一个快速和通用的大数据引擎,可以通俗的理解成一个分布式的大数据处理框架,允许用户将Spark部署在大量廉价的硬件之上,形成集群。Spark使用scala 实现,提供了 JAVA, Python,R等语言的调用接口。本次task4学习sparkML基础(数据编码,分类,聚类模型等)。

回顾前三个task:

​​【Spark】(task1)PySpark基础数据处理​​

使用Python链接Spark环境创建dateframe数据用spark执行以下逻辑:找到数据行数、列数用spark筛选class为1的样本用spark筛选language >90 或 math> 90的样本

​​【Spark】(task2)PySpark数据统计和分组聚合​​

一、数据统计

读取文件保存读取的信息分析每列的类型,取值个数分析每列是否包含缺失值

二、分组聚合

学习groupby分组聚合的使用学习agg分组聚合的使用transform的使用

​​【Spark】(task3)SparkSQL基础​​

就是和mysql一样的东西,不过要连接spark集群环境进行,注意列名类型。

一、构建ML Pipeline机器学习流程

如果样本较少,可以直接使用python对样本进行ML建模,但当需要大规模数据集时,可以使用spark进行分布式内存计算,虽然spark的原生语言是scala,但如果用python写可以用pyspark。

1.1 ML Pipeline构建流程

spark有MLlib机器学习库,比ML Pipeline复杂,先来大概看下ML Pipeline构建机器学习流程:

数据准备: 将特征值和预测变量整理成DataFrame建立机器学习流程Pipeline:

StringIndexer:将文字分类特征转化为数字OneHotEncoder:将数字分类特征转化为稀疏向量VectorAssembler:将所有特征字段整合成一个Vector字段DecisionTreeClassfier:训练生成模型

训练:训练集使用pipeline.fit()进行训练,产生pipelineModel预测:使用pipelineModel.transform()预测测试集,产生预测结果

1.2 ML Pipeline组件

注意:pyspark的一些组件和python中的同名组件不完全一样:

​​DataFrame​​​: 是Spark ML机器学习API处理的数据格式,可以由文本文件、RDD、或者Spark SQL创建,与python 的​​Dataframe​​概念相近但是方法完全不同。​​Transformer​​:可以使用.transform方法将一个DataFrame转换成另一个DataFrame。​​Estimator​​:可以使用.fit方法传入DataFrame,生成一个Transformer。​​pipeline​​:可以串联多个Transformer和Estimator建立ML机器学习的工作流。​​Parameter​​:以上Transformer和Estimator都可以共享的参数API。

二、数据编码

2.1 学习Spark ML中数据编码模块

​​读取文件Pokemon.csv,理解数据字段含义

import pandas as pdfrom pyspark.sql import SparkSessionfrom pyspark import SparkFiles# 创建spark应用spark = SparkSession.builder.appName('SparkTest').getOrCreate()spark.sparkContext.addFile('= spark.read.csv(path = SparkFiles.get('Pokemon.csv'), header=True, inferSchema=True)# 字段名字重命名pokemon = pokemon.withColumnRenamed('Sp. Atk', 'Sp Atk')pokemon = pokemon.withColumnRenamed('Sp. Def', 'Sp Def')pokemon.show(5)

创建spark会话应用、字段重命名后的​​pokeman​​表:

+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+| Name|Type 1|Type 2|Total| HP|Attack|Defense|Sp Atk|Sp Def|Speed|Generation|Legendary|+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+| Bulbasaur| Grass|Poison| 318| 45| 49| 49| 65| 65| 45| 1| false|| Ivysaur| Grass|Poison| 405| 60| 62| 63| 80| 80| 60| 1| false|| Venusaur| Grass|Poison| 525| 80| 82| 83| 100| 100| 80| 1| false||VenusaurMega Venu...| Grass|Poison| 625| 80| 100| 123| 122| 120| 80| 1| false|| Charmander| Fire| null| 309| 39| 52| 43| 60| 50| 65| 1| false|+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+only showing top 5 rows

字段名类型:

pokemon.dtypes“””[('Name', 'string'), ('Type 1', 'string'), ('Type 2', 'string'), ('Total', 'int'), ('HP', 'int'), ('Attack', 'int'), ('Defense', 'int'), ('Sp Atk', 'int'), ('Sp Def', 'int'), ('Speed', 'int'), ('Generation', 'int'), ('Legendary', 'boolean')]

# encoding=utf-8from pyspark.sql import SparkSessionfrom pyspark import SparkFilesfrom pyspark.ml import Pipelinefrom pyspark.ml.feature import StringIndexerfrom pyspark.ml.feature import OneHotEncoderfrom pyspark.ml.feature import MinMaxScalerfrom pyspark.ml.feature import VectorAssemblerfrom pyspark.ml.feature import PCA# 任务5:SparkML基础:数据编码# 步骤0:连接spark集群spark = SparkSession.builder.appName('pyspark').getOrCreate()# 步骤1:学习Spark ML中数据编码模块# 步骤2:读取文件Pokemon.csv,理解数据字段含义# 步骤2.1:读取文件= "file://"+SparkFiles.get("Pokemon.csv")# 步骤2.2:将读取的进行保存,表头也需要保存df = spark.read.csv(path=path, header=True, inferSchema= True)df = df.withColumnRenamed('Sp. Atk', 'SpAtk')df = df.withColumnRenamed('Sp. Def', 'SpDef')df = df.withColumnRenamed('Type 1', 'Type1')df = df.withColumnRenamed('Type 2', 'Type2')df.show(n=3)# 属于“类别属性”的字段:Type1, Type2, Generation# 属于“数值属性”的字段:Total,HP,Attack,Defense,SpAtk,SpDef,Speed

2.3 将其中的类别属性使用onehotencoder

将类别属性进行 one hot 独热编码。先来看下​​onehotencoder​​的参数:

class pyspark.ml.feature.OneHotEncoder(*, inputCols=None, outputCols=None, handleInvalid='error', dropLast=True, inputCol=None, outputCol=None)

# 步骤3:将其中的类别属性使用onehotencoder# 步骤3.1:将字符串类型特征转换为索引类型# = StringIndexer( inputCols=["Type1", "Type2"], outputCols=["Type1_idx", "Type2_idx"], handleInvalid='skip')df = indexer.fit(df).transform(df)df.show(n=3)# 步骤3.2:将索引类型特征转换为one-hot编码# = OneHotEncoder( inputCols=['Type1_idx', 'Type2_idx', 'Generation'], outputCols=["Type1_vec", "Type2_vec", "Generation_vec"])df = one_hot_encoder.fit(df).transform(df)df.show(n=3)

对应的字符串类型转为索引类型后、将索引特征转为one hot向量的结果:

+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+| Name|Type1| Type2|Total| HP|Attack|Defense|SpAtk|SpDef|Speed|Generation|Legendary|Type1_idx|Type2_idx|+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+|Bulbasaur|Grass|Poison| 318| 45| 49| 49| 65| 65| 45| 1| false| 2.0| 2.0|| Ivysaur|Grass|Poison| 405| 60| 62| 63| 80| 80| 60| 1| false| 2.0| 2.0|| Venusaur|Grass|Poison| 525| 80| 82| 83| 100| 100| 80| 1| false| 2.0| 2.0|+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+only showing top 3 rows+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+--------------+--------------+| Name|Type1| Type2|Total| HP|Attack|Defense|SpAtk|SpDef|Speed|Generation|Legendary|Type1_idx|Type2_idx| Type1_vec| Type2_vec|Generation_vec|+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+--------------+--------------+|Bulbasaur|Grass|Poison| 318| 45| 49| 49| 65| 65| 45| 1| false| 2.0| 2.0|(17,[2],[1.0])|(17,[2],[1.0])| (6,[1],[1.0])|| Ivysaur|Grass|Poison| 405| 60| 62| 63| 80| 80| 60| 1| false| 2.0| 2.0|(17,[2],[1.0])|(17,[2],[1.0])| (6,[1],[1.0])|| Venusaur|Grass|Poison| 525| 80| 82| 83| 100| 100| 80| 1| false| 2.0| 2.0|(17,[2],[1.0])|(17,[2],[1.0])| (6,[1],[1.0])|+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+--------------+--------------+only showing top 3

2.4 对其中的数值属性字段使用 minmaxscaler

对数值属性字段我们常用归一化(如果是常用的最大-最小归一化),公式为:Rescaled(e_i) = (e_i - E_min) / (E_max - E_min) * (max - min) + min。​​minmaxscaler​​其参数如下:

pyspark.ml.feature.MinMaxScaler(*, min = 0.0, max = 1.0, inputCol = None, outputCol = None)

# 步骤4:对其中的数值属性字段使用minmaxscaler# = ["Total", "HP", "Attack", "Defense", "SpAtk", "SpDef", "Speed"]assemblers, scalers = list(), list()for col in columns_to_scale: vec = VectorAssembler(inputCols=[col], outputCol=col + "_vec") assemblers.append(vec) sc = MinMaxScaler(inputCol=col + "_vec", outputCol=col + "_scaled") scalers.append(sc)pipeline = Pipeline(stages=assemblers + scalers)df = pipeline.fit(df).transform(df)df.show(n=3)

对应的结果为:

+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+--------------+--------------+---------+------+----------+-----------+---------+---------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+| Name|Type1| Type2|Total| HP|Attack|Defense|SpAtk|SpDef|Speed|Generation|Legendary|Type1_idx|Type2_idx| Type1_vec| Type2_vec|Generation_vec|Total_vec|HP_vec|Attack_vec|Defense_vec|SpAtk_vec|SpDef_vec|Speed_vec| Total_scaled| HP_scaled| Attack_scaled| Defense_scaled| SpAtk_scaled| SpDef_scaled| Speed_scaled|+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+--------------+--------------+---------+------+----------+-----------+---------+---------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+|Bulbasaur|Grass|Poison| 318| 45| 49| 49| 65| 65| 45| 1| false| 2.0| 2.0|(17,[2],[1.0])|(17,[2],[1.0])| (6,[1],[1.0])| [318.0]|[45.0]| [49.0]| [49.0]| [65.0]| [65.0]| [45.0]|[0.21694915254237...|[0.2953020134228188]|[0.21666666666666...|[0.15813953488372...|[0.3235294117647059]|[0.2142857142857143]|[0.25806451612903...|| Ivysaur|Grass|Poison| 405| 60| 62| 63| 80| 80| 60| 1| false| 2.0| 2.0|(17,[2],[1.0])|(17,[2],[1.0])| (6,[1],[1.0])| [405.0]|[60.0]| [62.0]| [63.0]| [80.0]| [80.0]| [60.0]|[0.3644067796610169]|[0.3959731543624161]|[0.2888888888888889]|[0.22325581395348...|[0.4117647058823529]|[0.28571428571428...|[0.3548387096774194]|| Venusaur|Grass|Poison| 525| 80| 82| 83| 100| 100| 80| 1| false| 2.0| 2.0|(17,[2],[1.0])|(17,[2],[1.0])| (6,[1],[1.0])| [525.0]|[80.0]| [82.0]| [83.0]| [100.0]| [100.0]| [80.0]|[0.5677966101694915]|[0.5302013422818792]| [0.4]|[0.31627906976744...|[0.5294117647058824]| [0.380952380952381]|[0.4838709677419355]|+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+--------------+--------------+---------+------+----------+-----------+---------+---------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+

2.5 对编码后的属性使用pca进行降维(维度可以自己选择)

PCA降维,这里我选维度K=5。

# 步骤5:对编码后的属性使用pca进行降维(维度可以自己选择)# encoded features: Type1_vec, Type2_vec, Generation_vec, Total_scaled, HP_scaled,# Attack_scaled, Defense_scaled, SpAtk_scaled, SpDef_scaled, Speed_scaledcols = ["Type1_vec", "Type2_vec", "Generation_vec", "Total_scaled", "HP_scaled", "Attack_scaled", "Defense_scaled", "SpAtk_scaled", "SpDef_scaled", "Speed_scaled"]assembler = VectorAssembler(inputCols=cols, outputCol="features")df = assembler.transform(df)df.select("features").show(n=3)# = PCA(k=5, inputCol="features", outputCol="pca")df = pca.fit(df).transform(df)df.show(n=3)rows = df.select("pca").collect()print(rows[0].asDict())spark.stop()

结果为:

+--------------------+| features|+--------------------+|(47,[2,19,35,40,4...||(47,[2,19,35,40,4...||(47,[2,19,35,40,4...|+--------------------+only showing top 3 rows+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+--------------+--------------+---------+------+----------+-----------+---------+---------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+| Name|Type1| Type2|Total| HP|Attack|Defense|SpAtk|SpDef|Speed|Generation|Legendary|Type1_idx|Type2_idx| Type1_vec| Type2_vec|Generation_vec|Total_vec|HP_vec|Attack_vec|Defense_vec|SpAtk_vec|SpDef_vec|Speed_vec| Total_scaled| HP_scaled| Attack_scaled| Defense_scaled| SpAtk_scaled| SpDef_scaled| Speed_scaled| features| pca|+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+--------------+--------------+---------+------+----------+-----------+---------+---------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+|Bulbasaur|Grass|Poison| 318| 45| 49| 49| 65| 65| 45| 1| false| 2.0| 2.0|(17,[2],[1.0])|(17,[2],[1.0])| (6,[1],[1.0])| [318.0]|[45.0]| [49.0]| [49.0]| [65.0]| [65.0]| [45.0]|[0.21694915254237...|[0.2953020134228188]|[0.21666666666666...|[0.15813953488372...|[0.3235294117647059]|[0.2142857142857143]|[0.25806451612903...|(47,[2,19,35,40,4...|[0.34275937676253...|| Ivysaur|Grass|Poison| 405| 60| 62| 63| 80| 80| 60| 1| false| 2.0| 2.0|(17,[2],[1.0])|(17,[2],[1.0])| (6,[1],[1.0])| [405.0]|[60.0]| [62.0]| [63.0]| [80.0]| [80.0]| [60.0]|[0.3644067796610169]|[0.3959731543624161]|[0.2888888888888889]|[0.22325581395348...|[0.4117647058823529]|[0.28571428571428...|[0.3548387096774194]|(47,[2,19,35,40,4...|[0.32329833337804...|| Venusaur|Grass|Poison| 525| 80| 82| 83| 100| 100| 80| 1| false| 2.0| 2.0|(17,[2],[1.0])|(17,[2],[1.0])| (6,[1],[1.0])| [525.0]|[80.0]| [82.0]| [83.0]| [100.0]| [100.0]| [80.0]|[0.5677966101694915]|[0.5302013422818792]| [0.4]|[0.31627906976744...|[0.5294117647058824]| [0.380952380952381]|[0.4838709677419355]|(47,[2,19,35,40,4...|[0.29572580767124...|+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+--------------+--------------+---------+------+----------+-----------+---------+---------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+only showing top 3 rows{'pca': DenseVector([0.3428, -0.8743, -0.6616, 0.0442, 0.7151])}

三、multi-hot编码

可以参考下面的代码例子。

四、数值型特征进行分桶归一化处理

为了解决数值型特征2个问题:特征的尺度、特征的分布。

特征的尺度:比如在电影推荐中,电影的评价次数可能非常大,而电影的平均评分则一般数值会比较小,这样前者可能就会掩盖后者的作用,所以一般我们会将两个特征尺度”拉倒“同一个区域中,即所谓的归一化操作,如pyspark中的​​MinMaxScaler​​接口。特征的分布:同样在电影推荐例子中,可能大部分人打分有中庸偏上的倾向,因此很多评分是在3.5以上,但是这样的数据不利于模型的学习——特征的区分度不高。我们一般会用【分桶】解决,即将样本按照某特征值排序,然后按照桶的数量找到分位数,将样本分到各自的桶中,再用桶ID直接作为特征值。

在 Spark MLlib 中,分别提供了两个转换器​​MinMaxScaler​​​ 和​​QuantileDiscretizer​​​,来进行归一化和分桶的特征处理。它们的使用方法和之前介绍的​​OneHotEncoderEstimator​​​ 一样(2.4版本后改为​​OneHotEncoder​​了),都是先用 ​​fit​​ 函数进行数据预处理,再用 ​​transform​​ 函数完成特征转换。

from pyspark import SparkConffrom pyspark.ml import Pipelinefrom pyspark.ml.feature import OneHotEncoder, StringIndexer, QuantileDiscretizer, MinMaxScalerfrom pyspark.ml.linalg import VectorUDT, Vectorsfrom pyspark.sql import SparkSessionfrom pyspark.sql.functions import *from pyspark.sql.types import *from pyspark.sql import functions as Fdef oneHotEncoderExample(movieSamples): samplesWithIdNumber = movieSamples.withColumn("movieIdNumber", F.col("movieId").cast(IntegerType())) encoder = OneHotEncoder(inputCols=["movieIdNumber"], outputCols=['movieIdVector'], dropLast=False) oneHotEncoderSamples = encoder.fit(samplesWithIdNumber).transform(samplesWithIdNumber) oneHotEncoderSamples.printSchema() oneHotEncoderSamples.show(10)# 用于后面的udf函数def array2vec(genreIndexes, indexSize): genreIndexes.sort() fill_list = [1.0 for _ in range(len(genreIndexes))] return Vectors.sparse(indexSize, genreIndexes, fill_list)# multi-hot编码def multiHotEncoderExample(movieSamples): samplesWithGenre = movieSamples.select("movieId", "title", explode( split(F.col("genres"), "\\|").cast(ArrayType(StringType()))).alias('genre')) genreIndexer = StringIndexer(inputCol="genre", outputCol="genreIndex") StringIndexerModel = genreIndexer.fit(samplesWithGenre) genreIndexSamples = StringIndexerModel.transform(samplesWithGenre).withColumn("genreIndexInt", F.col("genreIndex").cast(IntegerType())) indexSize = genreIndexSamples.agg(max(F.col("genreIndexInt"))).head()[0] + 1 processedSamples = genreIndexSamples.groupBy('movieId').agg( F.collect_list('genreIndexInt').alias('genreIndexes')).withColumn("indexSize", F.lit(indexSize)) finalSample = processedSamples.withColumn("vector", udf(array2vec, VectorUDT())(F.col("genreIndexes"), F.col("indexSize"))) finalSample.printSchema() finalSample.show(10)# 数值型特征处理def ratingFeatures(ratingSamples): ratingSamples.printSchema() ratingSamples.show() # calculate average movie rating score and rating count movieFeatures = ratingSamples.groupBy('movieId').agg(F.count(F.lit(1)).alias('ratingCount'), F.avg("rating").alias("avgRating"), F.variance('rating').alias('ratingVar')) \ .withColumn('avgRatingVec', udf(lambda x: Vectors.dense(x), VectorUDT())('avgRating')) movieFeatures.show(10) # bucketing 分桶处理,将打分次数这一特征分到100个桶中 ratingCountDiscretizer = QuantileDiscretizer(numBuckets=100, inputCol="ratingCount", outputCol="ratingCountBucket") # Normalization 将平均得分进行归一化 ratingScaler = MinMaxScaler(inputCol="avgRatingVec", outputCol="scaleAvgRating") # 创建一个pipeline,依次执行两个特征处理过程 pipelineStage = [ratingCountDiscretizer, ratingScaler] featurePipeline = Pipeline(stages=pipelineStage) movieProcessedFeatures = featurePipeline.fit(movieFeatures).transform(movieFeatures) movieProcessedFeatures.show(10)if __name__ == '__main__': conf = SparkConf().setAppName('featureEngineering').setMaster('local') spark = SparkSession.builder.config(conf=conf).getOrCreate() file_path = 'file:///home/hadoop/development/RecSys' movieResourcesPath = file_path + "/data/movies.csv" movieSamples = spark.read.format('csv').option('header', 'true').load(movieResourcesPath) print("Raw Movie Samples:") movieSamples.show(10) movieSamples.printSchema() """ print("OneHotEncoder Example:") oneHotEncoderExample(movieSamples) print("MultiHotEncoder Example:") multiHotEncoderExample(movieSamples) """ print("Numerical features Example:") ratingsResourcesPath = file_path + "/data/ratings.csv" ratingSamples = spark.read.format('csv').option('header', 'true').load(ratingsResourcesPath) ratingFeatures(ratingSamples)

结果为:

Raw Movie Samples:+-------+--------------------+--------------------+|movieId| title| genres|+-------+--------------------+--------------------+| 1| Toy Story (1995)|Adventure|Animati...|| 2| Jumanji (1995)|Adventure|Childre...|| 3|Grumpier Old Men ...| Comedy|Romance|| 4|Waiting to Exhale...|Comedy|Drama|Romance|| 5|Father of the Bri...| Comedy|| 6| Heat (1995)|Action|Crime|Thri...|| 7| Sabrina (1995)| Comedy|Romance|| 8| Tom and Huck (1995)| Adventure|Children|| 9| Sudden Death (1995)| Action|| 10| GoldenEye (1995)|Action|Adventure|...|+-------+--------------------+--------------------+only showing top 10 rowsroot |-- movieId: string (nullable = true) |-- title: string (nullable = true) |-- genres: string (nullable = true)Numerical features Example:root |-- userId: string (nullable = true) |-- movieId: string (nullable = true) |-- rating: string (nullable = true) |-- timestamp: string (nullable = true)+------+-------+------+----------+|userId|movieId|rating| timestamp|+------+-------+------+----------+| 1| 2| 3.5|1112486027|| 1| 29| 3.5|1112484676|| 1| 32| 3.5|1112484819|| 1| 47| 3.5|1112484727|| 1| 50| 3.5|1112484580|| 1| 112| 3.5|1094785740|| 1| 151| 4.0|1094785734|| 1| 223| 4.0|1112485573|| 1| 253| 4.0|1112484940|| 1| 260| 4.0|1112484826|| 1| 293| 4.0|1112484703|| 1| 296| 4.0|1112484767|| 1| 318| 4.0|1112484798|| 1| 337| 3.5|1094785709|| 1| 367| 3.5|1112485980|| 1| 541| 4.0|1112484603|| 1| 589| 3.5|1112485557|| 1| 593| 3.5|1112484661|| 1| 653| 3.0|1094785691|| 1| 919| 3.5|1094785621|+------+-------+------+----------+only showing top 20 rows+-------+-----------+------------------+------------------+--------------------+|movieId|ratingCount| avgRating| ratingVar| avgRatingVec|+-------+-----------+------------------+------------------+--------------------+| 296| 14616| 4.165606185002737|0.9615737413069365| [4.165606185002737]|| 467| 174|3.4367816091954024|1.5075410271742742|[3.4367816091954024]|| 829| 402|2.6243781094527363|1.4982072182727264|[2.6243781094527363]|| 691| 254|3.1161417322834644|1.0842838691606238|[3.1161417322834644]|| 675| 6|2.3333333333333335|0.6666666666666667|[2.3333333333333335]|| 125| 788| 3.713197969543147|0.8598255922703321| [3.713197969543147]|| 800| 1609|4.0447482908638905|0.8325734596130598|[4.0447482908638905]|| 944| 259|3.8262548262548264|0.8534165394630511|[3.8262548262548264]|| 853| 20| 3.5| 1.526315789473684| [3.5]|| 451| 159| 3.00314465408805|0.7800533397022531| [3.00314465408805]|+-------+-----------+------------------+------------------+--------------------+only showing top 10 rows+-------+-----------+------------------+------------------+--------------------+-----------------+--------------------+|movieId|ratingCount| avgRating| ratingVar| avgRatingVec|ratingCountBucket| scaleAvgRating|+-------+-----------+------------------+------------------+--------------------+-----------------+--------------------+| 296| 14616| 4.165606185002737|0.9615737413069365| [4.165606185002737]| 99.0|[0.9170998054196596]|| 467| 174|3.4367816091954024|1.5075410271742742|[3.4367816091954024]| 38.0|[0.7059538707722662]|| 829| 402|2.6243781094527363|1.4982072182727264|[2.6243781094527363]| 54.0|[0.4705944962973248]|| 691| 254|3.1161417322834644|1.0842838691606238|[3.1161417322834644]| 45.0|[0.6130620985364005]|| 675| 6|2.3333333333333335|0.6666666666666667|[2.3333333333333335]| 4.0|[0.38627664627161...|| 125| 788| 3.713197969543147|0.8598255922703321| [3.713197969543147]| 67.0|[0.7860337592595664]|| 800| 1609|4.0447482908638905|0.8325734596130598|[4.0447482908638905]| 79.0|[0.8820863689021069]|| 944| 259|3.8262548262548264|0.8534165394630511|[3.8262548262548264]| 46.0|[0.8187871768460151]|| 853| 20| 3.5| 1.526315789473684| [3.5]| 12.0|[0.7242687117592825]|| 451| 159| 3.00314465408805|0.7800533397022531| [3.00314465408805]| 37.0|[0.5803259992335382]|+-------+-----------+------------------+------------------+--------------------+-----------------+--------------------+only showing top 10

Reference

​​https://spark.apache.org/

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:《有翡》开播,赵丽颖王一博率先上演你追我赶“猫鼠游戏”!
下一篇:GBDT和随机森林的区别
相关文章

 发表评论

暂时没有评论,来抢沙发吧~