【Spark】(task4)SparkML基础（数据编码）-APISpace

【Spark】(task4)SparkML基础（数据编码）

学习总结

文章目录

学习总结零、导言一、构建ML Pipeline机器学习流程

1.1 ML Pipeline构建流程1.2 ML Pipeline组件

二、数据编码

2.1 学习Spark ML中数据编码模块2.2 读取文件Pokemon.csv，理解数据字段含义2.3 将其中的类别属性使用onehotencoder2.4 对其中的数值属性字段使用 minmaxscaler2.5 对编码后的属性使用pca进行降维（维度可以自己选择）

三、multi-hot编码四、数值型特征进行分桶归一化处理Reference

零、导言

【导言】park是一个快速和通用的大数据引擎，可以通俗的理解成一个分布式的大数据处理框架，允许用户将Spark部署在大量廉价的硬件之上，形成集群。Spark使用scala 实现，提供了 JAVA, Python，R等语言的调用接口。本次task4学习sparkML基础（数据编码，分类，聚类模型等）。

回顾前三个task：

【Spark】(task1)PySpark基础数据处理

使用Python链接Spark环境创建dateframe数据用spark执行以下逻辑：找到数据行数、列数用spark筛选class为1的样本用spark筛选language >90 或 math> 90的样本

【Spark】(task2)PySpark数据统计和分组聚合

一、数据统计

读取文件保存读取的信息分析每列的类型，取值个数分析每列是否包含缺失值

二、分组聚合

学习groupby分组聚合的使用学习agg分组聚合的使用transform的使用

【Spark】(task3)SparkSQL基础

就是和mysql一样的东西，不过要连接spark集群环境进行，注意列名类型。

一、构建ML Pipeline机器学习流程

如果样本较少，可以直接使用python对样本进行ML建模，但当需要大规模数据集时，可以使用spark进行分布式内存计算，虽然spark的原生语言是scala，但如果用python写可以用pyspark。

1.1 ML Pipeline构建流程

spark有MLlib机器学习库，比ML Pipeline复杂，先来大概看下ML Pipeline构建机器学习流程：

数据准备：将特征值和预测变量整理成DataFrame建立机器学习流程Pipeline：

StringIndexer:将文字分类特征转化为数字OneHotEncoder:将数字分类特征转化为稀疏向量VectorAssembler:将所有特征字段整合成一个Vector字段DecisionTreeClassfier:训练生成模型

训练：训练集使用pipeline.fit()进行训练，产生pipelineModel预测：使用pipelineModel.transform()预测测试集，产生预测结果

1.2 ML Pipeline组件

注意：pyspark的一些组件和python中的同名组件不完全一样：

DataFrame: 是Spark ML机器学习API处理的数据格式，可以由文本文件、RDD、或者Spark SQL创建，与python 的Dataframe概念相近但是方法完全不同。Transformer:可以使用.transform方法将一个DataFrame转换成另一个DataFrame。Estimator:可以使用.fit方法传入DataFrame，生成一个Transformer。pipeline:可以串联多个Transformer和Estimator建立ML机器学习的工作流。Parameter:以上Transformer和Estimator都可以共享的参数API。

二、数据编码

2.1 学习Spark ML中数据编码模块

读取文件Pokemon.csv，理解数据字段含义

import pandas as pdfrom pyspark.sql import SparkSessionfrom pyspark import SparkFiles# 创建spark应用spark = SparkSession.builder.appName('SparkTest').getOrCreate()spark.sparkContext.addFile('= spark.read.csv(path = SparkFiles.get('Pokemon.csv'), header=True, inferSchema=True)# 字段名字重命名pokemon = pokemon.withColumnRenamed('Sp. Atk', 'Sp Atk')pokemon = pokemon.withColumnRenamed('Sp. Def', 'Sp Def')pokemon.show(5)

创建spark会话应用、字段重命名后的pokeman表：

+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+| Name|Type 1|Type 2|Total| HP|Attack|Defense|Sp Atk|Sp Def|Speed|Generation|Legendary|+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+| Bulbasaur| Grass|Poison| 318| 45| 49| 49| 65| 65| 45| 1| false|| Ivysaur| Grass|Poison| 405| 60| 62| 63| 80| 80| 60| 1| false|| Venusaur| Grass|Poison| 525| 80| 82| 83| 100| 100| 80| 1| false||VenusaurMega Venu...| Grass|Poison| 625| 80| 100| 123| 122| 120| 80| 1| false|| Charmander| Fire| null| 309| 39| 52| 43| 60| 50| 65| 1| false|+--------------------+------+------+-----+---+------+-------+------+------+-----+----------+---------+only showing top 5 rows

字段名类型：

pokemon.dtypes“””[('Name', 'string'), ('Type 1', 'string'), ('Type 2', 'string'), ('Total', 'int'), ('HP', 'int'), ('Attack', 'int'), ('Defense', 'int'), ('Sp Atk', 'int'), ('Sp Def', 'int'), ('Speed', 'int'), ('Generation', 'int'), ('Legendary', 'boolean')]

# encoding=utf-8from pyspark.sql import SparkSessionfrom pyspark import SparkFilesfrom pyspark.ml import Pipelinefrom pyspark.ml.feature import StringIndexerfrom pyspark.ml.feature import OneHotEncoderfrom pyspark.ml.feature import MinMaxScalerfrom pyspark.ml.feature import VectorAssemblerfrom pyspark.ml.feature import PCA# 任务5：SparkML基础：数据编码# 步骤0：连接spark集群spark = SparkSession.builder.appName('pyspark').getOrCreate()# 步骤1：学习Spark ML中数据编码模块# 步骤2：读取文件Pokemon.csv，理解数据字段含义# 步骤2.1：读取文件= "file://"+SparkFiles.get("Pokemon.csv")# 步骤2.2：将读取的进行保存，表头也需要保存df = spark.read.csv(path=path, header=True, inferSchema= True)df = df.withColumnRenamed('Sp. Atk', 'SpAtk')df = df.withColumnRenamed('Sp. Def', 'SpDef')df = df.withColumnRenamed('Type 1', 'Type1')df = df.withColumnRenamed('Type 2', 'Type2')df.show(n=3)# 属于“类别属性”的字段：Type1, Type2, Generation# 属于“数值属性”的字段：Total，HP，Attack，Defense，SpAtk，SpDef，Speed

2.3 将其中的类别属性使用onehotencoder

将类别属性进行 one hot 独热编码。先来看下onehotencoder的参数：

class pyspark.ml.feature.OneHotEncoder(*, inputCols=None, outputCols=None, handleInvalid='error', dropLast=True, inputCol=None, outputCol=None)

# 步骤3：将其中的类别属性使用onehotencoder# 步骤3.1：将字符串类型特征转换为索引类型# = StringIndexer( inputCols=["Type1", "Type2"], outputCols=["Type1_idx", "Type2_idx"], handleInvalid='skip')df = indexer.fit(df).transform(df)df.show(n=3)# 步骤3.2：将索引类型特征转换为one-hot编码# = OneHotEncoder( inputCols=['Type1_idx', 'Type2_idx', 'Generation'], outputCols=["Type1_vec", "Type2_vec", "Generation_vec"])df = one_hot_encoder.fit(df).transform(df)df.show(n=3)

对应的字符串类型转为索引类型后、将索引特征转为one hot向量的结果：

+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+| Name|Type1| Type2|Total| HP|Attack|Defense|SpAtk|SpDef|Speed|Generation|Legendary|Type1_idx|Type2_idx|+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+|Bulbasaur|Grass|Poison| 318| 45| 49| 49| 65| 65| 45| 1| false| 2.0| 2.0|| Ivysaur|Grass|Poison| 405| 60| 62| 63| 80| 80| 60| 1| false| 2.0| 2.0|| Venusaur|Grass|Poison| 525| 80| 82| 83| 100| 100| 80| 1| false| 2.0| 2.0|+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+only showing top 3 rows+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+--------------+--------------+| Name|Type1| Type2|Total| HP|Attack|Defense|SpAtk|SpDef|Speed|Generation|Legendary|Type1_idx|Type2_idx| Type1_vec| Type2_vec|Generation_vec|+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+--------------+--------------+|Bulbasaur|Grass|Poison| 318| 45| 49| 49| 65| 65| 45| 1| false| 2.0| 2.0|(17,[2],[1.0])|(17,[2],[1.0])| (6,[1],[1.0])|| Ivysaur|Grass|Poison| 405| 60| 62| 63| 80| 80| 60| 1| false| 2.0| 2.0|(17,[2],[1.0])|(17,[2],[1.0])| (6,[1],[1.0])|| Venusaur|Grass|Poison| 525| 80| 82| 83| 100| 100| 80| 1| false| 2.0| 2.0|(17,[2],[1.0])|(17,[2],[1.0])| (6,[1],[1.0])|+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+--------------+--------------+only showing top 3

2.4 对其中的数值属性字段使用 minmaxscaler

对数值属性字段我们常用归一化（如果是常用的最大-最小归一化），公式为：Rescaled(e_i) = (e_i - E_min) / (E_max - E_min) * (max - min) + min。minmaxscaler其参数如下：

pyspark.ml.feature.MinMaxScaler（*， min = 0.0， max = 1.0， inputCol = None， outputCol = None）

# 步骤4：对其中的数值属性字段使用minmaxscaler# = ["Total", "HP", "Attack", "Defense", "SpAtk", "SpDef", "Speed"]assemblers, scalers = list(), list()for col in columns_to_scale: vec = VectorAssembler(inputCols=[col], outputCol=col + "_vec") assemblers.append(vec) sc = MinMaxScaler(inputCol=col + "_vec", outputCol=col + "_scaled") scalers.append(sc)pipeline = Pipeline(stages=assemblers + scalers)df = pipeline.fit(df).transform(df)df.show(n=3)

对应的结果为：

+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+--------------+--------------+---------+------+----------+-----------+---------+---------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+| Name|Type1| Type2|Total| HP|Attack|Defense|SpAtk|SpDef|Speed|Generation|Legendary|Type1_idx|Type2_idx| Type1_vec| Type2_vec|Generation_vec|Total_vec|HP_vec|Attack_vec|Defense_vec|SpAtk_vec|SpDef_vec|Speed_vec| Total_scaled| HP_scaled| Attack_scaled| Defense_scaled| SpAtk_scaled| SpDef_scaled| Speed_scaled|+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+--------------+--------------+---------+------+----------+-----------+---------+---------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+|Bulbasaur|Grass|Poison| 318| 45| 49| 49| 65| 65| 45| 1| false| 2.0| 2.0|(17,[2],[1.0])|(17,[2],[1.0])| (6,[1],[1.0])| [318.0]|[45.0]| [49.0]| [49.0]| [65.0]| [65.0]| [45.0]|[0.21694915254237...|[0.2953020134228188]|[0.21666666666666...|[0.15813953488372...|[0.3235294117647059]|[0.2142857142857143]|[0.25806451612903...|| Ivysaur|Grass|Poison| 405| 60| 62| 63| 80| 80| 60| 1| false| 2.0| 2.0|(17,[2],[1.0])|(17,[2],[1.0])| (6,[1],[1.0])| [405.0]|[60.0]| [62.0]| [63.0]| [80.0]| [80.0]| [60.0]|[0.3644067796610169]|[0.3959731543624161]|[0.2888888888888889]|[0.22325581395348...|[0.4117647058823529]|[0.28571428571428...|[0.3548387096774194]|| Venusaur|Grass|Poison| 525| 80| 82| 83| 100| 100| 80| 1| false| 2.0| 2.0|(17,[2],[1.0])|(17,[2],[1.0])| (6,[1],[1.0])| [525.0]|[80.0]| [82.0]| [83.0]| [100.0]| [100.0]| [80.0]|[0.5677966101694915]|[0.5302013422818792]| [0.4]|[0.31627906976744...|[0.5294117647058824]| [0.380952380952381]|[0.4838709677419355]|+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+--------------+--------------+---------+------+----------+-----------+---------+---------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+

2.5 对编码后的属性使用pca进行降维（维度可以自己选择）

PCA降维，这里我选维度K=5。

# 步骤5：对编码后的属性使用pca进行降维（维度可以自己选择）# encoded features: Type1_vec, Type2_vec, Generation_vec, Total_scaled, HP_scaled,# Attack_scaled, Defense_scaled, SpAtk_scaled, SpDef_scaled, Speed_scaledcols = ["Type1_vec", "Type2_vec", "Generation_vec", "Total_scaled", "HP_scaled", "Attack_scaled", "Defense_scaled", "SpAtk_scaled", "SpDef_scaled", "Speed_scaled"]assembler = VectorAssembler(inputCols=cols, outputCol="features")df = assembler.transform(df)df.select("features").show(n=3)# = PCA(k=5, inputCol="features", outputCol="pca")df = pca.fit(df).transform(df)df.show(n=3)rows = df.select("pca").collect()print(rows[0].asDict())spark.stop()

结果为：

+--------------------+| features|+--------------------+|(47,[2,19,35,40,4...||(47,[2,19,35,40,4...||(47,[2,19,35,40,4...|+--------------------+only showing top 3 rows+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+--------------+--------------+---------+------+----------+-----------+---------+---------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+| Name|Type1| Type2|Total| HP|Attack|Defense|SpAtk|SpDef|Speed|Generation|Legendary|Type1_idx|Type2_idx| Type1_vec| Type2_vec|Generation_vec|Total_vec|HP_vec|Attack_vec|Defense_vec|SpAtk_vec|SpDef_vec|Speed_vec| Total_scaled| HP_scaled| Attack_scaled| Defense_scaled| SpAtk_scaled| SpDef_scaled| Speed_scaled| features| pca|+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+--------------+--------------+---------+------+----------+-----------+---------+---------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+|Bulbasaur|Grass|Poison| 318| 45| 49| 49| 65| 65| 45| 1| false| 2.0| 2.0|(17,[2],[1.0])|(17,[2],[1.0])| (6,[1],[1.0])| [318.0]|[45.0]| [49.0]| [49.0]| [65.0]| [65.0]| [45.0]|[0.21694915254237...|[0.2953020134228188]|[0.21666666666666...|[0.15813953488372...|[0.3235294117647059]|[0.2142857142857143]|[0.25806451612903...|(47,[2,19,35,40,4...|[0.34275937676253...|| Ivysaur|Grass|Poison| 405| 60| 62| 63| 80| 80| 60| 1| false| 2.0| 2.0|(17,[2],[1.0])|(17,[2],[1.0])| (6,[1],[1.0])| [405.0]|[60.0]| [62.0]| [63.0]| [80.0]| [80.0]| [60.0]|[0.3644067796610169]|[0.3959731543624161]|[0.2888888888888889]|[0.22325581395348...|[0.4117647058823529]|[0.28571428571428...|[0.3548387096774194]|(47,[2,19,35,40,4...|[0.32329833337804...|| Venusaur|Grass|Poison| 525| 80| 82| 83| 100| 100| 80| 1| false| 2.0| 2.0|(17,[2],[1.0])|(17,[2],[1.0])| (6,[1],[1.0])| [525.0]|[80.0]| [82.0]| [83.0]| [100.0]| [100.0]| [80.0]|[0.5677966101694915]|[0.5302013422818792]| [0.4]|[0.31627906976744...|[0.5294117647058824]| [0.380952380952381]|[0.4838709677419355]|(47,[2,19,35,40,4...|[0.29572580767124...|+---------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+--------------+--------------+---------+------+----------+-----------+---------+---------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+only showing top 3 rows{'pca': DenseVector([0.3428, -0.8743, -0.6616, 0.0442, 0.7151])}

三、multi-hot编码

可以参考下面的代码例子。

四、数值型特征进行分桶归一化处理

为了解决数值型特征2个问题：特征的尺度、特征的分布。

特征的尺度：比如在电影推荐中，电影的评价次数可能非常大，而电影的平均评分则一般数值会比较小，这样前者可能就会掩盖后者的作用，所以一般我们会将两个特征尺度”拉倒“同一个区域中，即所谓的归一化操作，如pyspark中的MinMaxScaler接口。特征的分布：同样在电影推荐例子中，可能大部分人打分有中庸偏上的倾向，因此很多评分是在3.5以上，但是这样的数据不利于模型的学习——特征的区分度不高。我们一般会用【分桶】解决，即将样本按照某特征值排序，然后按照桶的数量找到分位数，将样本分到各自的桶中，再用桶ID直接作为特征值。

在 Spark MLlib 中，分别提供了两个转换器MinMaxScaler 和QuantileDiscretizer，来进行归一化和分桶的特征处理。它们的使用方法和之前介绍的OneHotEncoderEstimator 一样（2.4版本后改为OneHotEncoder了），都是先用 fit 函数进行数据预处理，再用 transform 函数完成特征转换。

from pyspark import SparkConffrom pyspark.ml import Pipelinefrom pyspark.ml.feature import OneHotEncoder, StringIndexer, QuantileDiscretizer, MinMaxScalerfrom pyspark.ml.linalg import VectorUDT, Vectorsfrom pyspark.sql import SparkSessionfrom pyspark.sql.functions import *from pyspark.sql.types import *from pyspark.sql import functions as Fdef oneHotEncoderExample(movieSamples): samplesWithIdNumber = movieSamples.withColumn("movieIdNumber", F.col("movieId").cast(IntegerType())) encoder = OneHotEncoder(inputCols=["movieIdNumber"], outputCols=['movieIdVector'], dropLast=False) oneHotEncoderSamples = encoder.fit(samplesWithIdNumber).transform(samplesWithIdNumber) oneHotEncoderSamples.printSchema() oneHotEncoderSamples.show(10)# 用于后面的udf函数def array2vec(genreIndexes, indexSize): genreIndexes.sort() fill_list = [1.0 for _ in range(len(genreIndexes))] return Vectors.sparse(indexSize, genreIndexes, fill_list)# multi-hot编码def multiHotEncoderExample(movieSamples): samplesWithGenre = movieSamples.select("movieId", "title", explode( split(F.col("genres"), "\\|").cast(ArrayType(StringType()))).alias('genre')) genreIndexer = StringIndexer(inputCol="genre", outputCol="genreIndex") StringIndexerModel = genreIndexer.fit(samplesWithGenre) genreIndexSamples = StringIndexerModel.transform(samplesWithGenre).withColumn("genreIndexInt", F.col("genreIndex").cast(IntegerType())) indexSize = genreIndexSamples.agg(max(F.col("genreIndexInt"))).head()[0] + 1 processedSamples = genreIndexSamples.groupBy('movieId').agg( F.collect_list('genreIndexInt').alias('genreIndexes')).withColumn("indexSize", F.lit(indexSize)) finalSample = processedSamples.withColumn("vector", udf(array2vec, VectorUDT())(F.col("genreIndexes"), F.col("indexSize"))) finalSample.printSchema() finalSample.show(10)# 数值型特征处理def ratingFeatures(ratingSamples): ratingSamples.printSchema() ratingSamples.show() # calculate average movie rating score and rating count movieFeatures = ratingSamples.groupBy('movieId').agg(F.count(F.lit(1)).alias('ratingCount'), F.avg("rating").alias("avgRating"), F.variance('rating').alias('ratingVar')) \ .withColumn('avgRatingVec', udf(lambda x: Vectors.dense(x), VectorUDT())('avgRating')) movieFeatures.show(10) # bucketing 分桶处理，将打分次数这一特征分到100个桶中 ratingCountDiscretizer = QuantileDiscretizer(numBuckets=100, inputCol="ratingCount", outputCol="ratingCountBucket") # Normalization 将平均得分进行归一化 ratingScaler = MinMaxScaler(inputCol="avgRatingVec", outputCol="scaleAvgRating") # 创建一个pipeline，依次执行两个特征处理过程 pipelineStage = [ratingCountDiscretizer, ratingScaler] featurePipeline = Pipeline(stages=pipelineStage) movieProcessedFeatures = featurePipeline.fit(movieFeatures).transform(movieFeatures) movieProcessedFeatures.show(10)if __name__ == '__main__': conf = SparkConf().setAppName('featureEngineering').setMaster('local') spark = SparkSession.builder.config(conf=conf).getOrCreate() file_path = 'file:///home/hadoop/development/RecSys' movieResourcesPath = file_path + "/data/movies.csv" movieSamples = spark.read.format('csv').option('header', 'true').load(movieResourcesPath) print("Raw Movie Samples:") movieSamples.show(10) movieSamples.printSchema() """ print("OneHotEncoder Example:") oneHotEncoderExample(movieSamples) print("MultiHotEncoder Example:") multiHotEncoderExample(movieSamples) """ print("Numerical features Example:") ratingsResourcesPath = file_path + "/data/ratings.csv" ratingSamples = spark.read.format('csv').option('header', 'true').load(ratingsResourcesPath) ratingFeatures(ratingSamples)

结果为：

Raw Movie Samples:+-------+--------------------+--------------------+|movieId| title| genres|+-------+--------------------+--------------------+| 1| Toy Story (1995)|Adventure|Animati...|| 2| Jumanji (1995)|Adventure|Childre...|| 3|Grumpier Old Men ...| Comedy|Romance|| 4|Waiting to Exhale...|Comedy|Drama|Romance|| 5|Father of the Bri...| Comedy|| 6| Heat (1995)|Action|Crime|Thri...|| 7| Sabrina (1995)| Comedy|Romance|| 8| Tom and Huck (1995)| Adventure|Children|| 9| Sudden Death (1995)| Action|| 10| GoldenEye (1995)|Action|Adventure|...|+-------+--------------------+--------------------+only showing top 10 rowsroot |-- movieId: string (nullable = true) |-- title: string (nullable = true) |-- genres: string (nullable = true)Numerical features Example:root |-- userId: string (nullable = true) |-- movieId: string (nullable = true) |-- rating: string (nullable = true) |-- timestamp: string (nullable = true)+------+-------+------+----------+|userId|movieId|rating| timestamp|+------+-------+------+----------+| 1| 2| 3.5|1112486027|| 1| 29| 3.5|1112484676|| 1| 32| 3.5|1112484819|| 1| 47| 3.5|1112484727|| 1| 50| 3.5|1112484580|| 1| 112| 3.5|1094785740|| 1| 151| 4.0|1094785734|| 1| 223| 4.0|1112485573|| 1| 253| 4.0|1112484940|| 1| 260| 4.0|1112484826|| 1| 293| 4.0|1112484703|| 1| 296| 4.0|1112484767|| 1| 318| 4.0|1112484798|| 1| 337| 3.5|1094785709|| 1| 367| 3.5|1112485980|| 1| 541| 4.0|1112484603|| 1| 589| 3.5|1112485557|| 1| 593| 3.5|1112484661|| 1| 653| 3.0|1094785691|| 1| 919| 3.5|1094785621|+------+-------+------+----------+only showing top 20 rows+-------+-----------+------------------+------------------+--------------------+|movieId|ratingCount| avgRating| ratingVar| avgRatingVec|+-------+-----------+------------------+------------------+--------------------+| 296| 14616| 4.165606185002737|0.9615737413069365| [4.165606185002737]|| 467| 174|3.4367816091954024|1.5075410271742742|[3.4367816091954024]|| 829| 402|2.6243781094527363|1.4982072182727264|[2.6243781094527363]|| 691| 254|3.1161417322834644|1.0842838691606238|[3.1161417322834644]|| 675| 6|2.3333333333333335|0.6666666666666667|[2.3333333333333335]|| 125| 788| 3.713197969543147|0.8598255922703321| [3.713197969543147]|| 800| 1609|4.0447482908638905|0.8325734596130598|[4.0447482908638905]|| 944| 259|3.8262548262548264|0.8534165394630511|[3.8262548262548264]|| 853| 20| 3.5| 1.526315789473684| [3.5]|| 451| 159| 3.00314465408805|0.7800533397022531| [3.00314465408805]|+-------+-----------+------------------+------------------+--------------------+only showing top 10 rows+-------+-----------+------------------+------------------+--------------------+-----------------+--------------------+|movieId|ratingCount| avgRating| ratingVar| avgRatingVec|ratingCountBucket| scaleAvgRating|+-------+-----------+------------------+------------------+--------------------+-----------------+--------------------+| 296| 14616| 4.165606185002737|0.9615737413069365| [4.165606185002737]| 99.0|[0.9170998054196596]|| 467| 174|3.4367816091954024|1.5075410271742742|[3.4367816091954024]| 38.0|[0.7059538707722662]|| 829| 402|2.6243781094527363|1.4982072182727264|[2.6243781094527363]| 54.0|[0.4705944962973248]|| 691| 254|3.1161417322834644|1.0842838691606238|[3.1161417322834644]| 45.0|[0.6130620985364005]|| 675| 6|2.3333333333333335|0.6666666666666667|[2.3333333333333335]| 4.0|[0.38627664627161...|| 125| 788| 3.713197969543147|0.8598255922703321| [3.713197969543147]| 67.0|[0.7860337592595664]|| 800| 1609|4.0447482908638905|0.8325734596130598|[4.0447482908638905]| 79.0|[0.8820863689021069]|| 944| 259|3.8262548262548264|0.8534165394630511|[3.8262548262548264]| 46.0|[0.8187871768460151]|| 853| 20| 3.5| 1.526315789473684| [3.5]| 12.0|[0.7242687117592825]|| 451| 159| 3.00314465408805|0.7800533397022531| [3.00314465408805]| 37.0|[0.5803259992335382]|+-------+-----------+------------------+------------------+--------------------+-----------------+--------------------+only showing top 10

Reference

https://spark.apache.org/

oracle竖列的数据怎么变成一行

237 2022-09-22

【Spark】(task4)SparkML基础（数据编码）

oracle竖列的数据怎么变成一行

Taskflow API之三大特性

RESTful API设计规范

推荐文章

api接口有哪几种分类及功能

什么是API接口?API接口简单介绍

短信API接口概述，短信API接口的优势

7款快递物流的物流查询API工具，物流快递查询API接口怎么对接？

企业四要素: 了解企业经营成功的关键

什么是语音验证码?,语音验证码平台有哪些

全国工商查询系统怎么查企业名录

哪些平台提供实名认证的接口？

PHP如何调用API接口?

如何使用百度天气预报API接口?

最近发表

热评文章

数据接口api（数据接口API开发平台）

数据开放接口api（数据服务api开发）

Python爬虫教程：爬取酷狗音乐（python爬取

hbuilder怎么更改字体大小和颜色

直播平台api接口 - 构建卓越的直播平台

实时股票数据api接口（股票实时行情api接口）