Hive中的order by、sort by、distribute by、cluster by解释及测试

网友投稿 259 2022-11-23

Hive中的order by、sort by、distribute by、cluster by解释及测试

结论:order by:全局排序,这也是4种排序手段中唯一一个能在终端输出中看出全局排序的方法,只有一个reduce,可能造成renduce任务时间过长,在严格模式下,要求必须具备limit子句。sort by:可以运行多个reduce,每个reduce内排序,默认升序排序。distribute by:控制map的输出在reduce中是如何划分的。通常与sort by组合使用,按照特定字段进行排序。cluster by:如果distribute by字段和sort by字段相同,且安装默认升序方式进行排序,可以使用cluster by语句代替distribute by和sort by,但是这样会剥夺sort by的并行性,但是也不能保证全局输出是有序的(这是测试结果)。 1、order by全局排序测试: set mapred.max.split.size=200;set mapred.reduce.tasks=3;select empno,ename,sal from emp order by sal asc limit 20; hive (chavin)> set mapred.max.split.size=200;hive (chavin)> set mapred.reduce.tasks=3;hive (chavin)> select empno,ename,sal from emp order by sal asc limit 20;Total jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes):   set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers:   set hive.exec.reducers.max=In order to set a constant number of reducers:   set mapreduce.job.reduces=Starting Job = job_1501198522682_0059, Tracking URL = Command = /opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6/bin/hadoop job  -kill job_1501198522682_0059Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 12017-07-30 02:21:58,594 Stage-1 map = 0%,  reduce = 0%2017-07-30 02:22:59,734 Stage-1 map = 0%,  reduce = 0%2017-07-30 02:24:00,084 Stage-1 map = 0%,  reduce = 0%2017-07-30 02:25:00,859 Stage-1 map = 0%,  reduce = 0%2017-07-30 02:25:28,846 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 123.57 sec2017-07-30 02:26:03,306 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 128.97 secMapReduce Total cumulative CPU time: 2 minutes 8 seconds 970 msecEnded Job = job_1501198522682_0059MapReduce Jobs Launched: Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 128.97 sec   HDFS Read: 2119 HDFS Write: 299 SUCCESSTotal MapReduce CPU Time Spent: 2 minutes 8 seconds 970 msecOKempno    ename    sal7369    SMITH    800.07900    JAMES    950.07876    ADAMS    1100.07654    MARTIN    1250.07521    WARD    1250.07934    MILLER    1300.07844    TURNER    1500.07499    ALLEN    1600.07782    CLARK    2450.07698    BLAKE    2850.07566    JONES    2975.07788    SCOTT    3000.07902    FORD    3000.07839    KING    5000.08888    king    300000.08888    ChavinKing    300000.0Time taken: 442.499 seconds, Fetched: 16 row(s) 我们可以从输出日志看出,这个任务一共启动了3个map任务,1个reduce任务,输出结果是按照sal字段内容升序排序,并且全局有序。注意在任务开始前我们设置了reduce数目为3,但是实际仅启动了一个reduce任务,这说明order by是强制启动一个reduce完成全局排序的。当数据集比较大时,一个reduce任务将会成为这个任务的性能瓶颈。 2、sort by测试: ============================================== hive (chavin)> set mapred.max.split.size=200;hive (chavin)> set mapred.reduce.tasks=3;hive (chavin)> select empno,ename,sal from emp sort by sal asc;Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Defaulting to jobconf value of: 3In order to change the average load for a reducer (in bytes):   set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers:   set hive.exec.reducers.max=In order to set a constant number of reducers:   set mapreduce.job.reduces=Starting Job = job_1501198522682_0063, Tracking URL = Command = /opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6/bin/hadoop job  -kill job_1501198522682_0063Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 32017-07-30 02:40:32,898 Stage-1 map = 0%,  reduce = 0%2017-07-30 02:41:18,531 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 5.05 sec2017-07-30 02:41:27,062 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 17.9 sec2017-07-30 02:42:15,867 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 28.39 sec2017-07-30 02:42:18,655 Stage-1 map = 100%,  reduce = 78%, Cumulative CPU 30.31 sec2017-07-30 02:42:20,045 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 33.92 secMapReduce Total cumulative CPU time: 33 seconds 920 msecEnded Job = job_1501198522682_0063MapReduce Jobs Launched: Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 40.36 sec   HDFS Read: 2119 HDFS Write: 299 SUCCESSTotal MapReduce CPU Time Spent: 40 seconds 360 msecOKempno    ename    sal7654    MARTIN    1250.07844    TURNER    1500.08888    king    300000.08888    ChavinKing    300000.07900    JAMES    950.07521    WARD    1250.07934    MILLER    1300.07499    ALLEN    1600.07782    CLARK    2450.07566    JONES    2975.07902    FORD    3000.07788    SCOTT    3000.07839    KING    5000.07369    SMITH    800.07876    ADAMS    1100.07698    BLAKE    2850.0Time taken: 136.972 seconds, Fetched: 16 row(s) sort by asc进行排序操作,启动了3个map任务,3个reduce任务,这与我们前期配置是相符合的。输出结果局部有序,也侧面验证了sort by属于reduce内进行排序的。 ============================================ set mapred.max.split.size=200;set mapred.reduce.tasks=3;select empno,ename,sal from emp sort by sal desc; hive (chavin)> set mapred.max.split.size=200;hive (chavin)> set mapred.reduce.tasks=3;hive (chavin)> select empno,ename,sal from emp sort by sal desc;Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Defaulting to jobconf value of: 3In order to change the average load for a reducer (in bytes):   set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers:   set hive.exec.reducers.max=In order to set a constant number of reducers:   set mapreduce.job.reduces=Starting Job = job_1501198522682_0064, Tracking URL = Command = /opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6/bin/hadoop job  -kill job_1501198522682_0064Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 32017-07-30 02:43:57,741 Stage-1 map = 0%,  reduce = 0%2017-07-30 02:44:41,415 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 4.27 sec2017-07-30 02:44:44,267 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 9.69 sec2017-07-30 02:44:46,779 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 14.64 sec2017-07-30 02:45:27,003 Stage-1 map = 100%,  reduce = 22%, Cumulative CPU 17.61 sec2017-07-30 02:45:28,551 Stage-1 map = 100%,  reduce = 44%, Cumulative CPU 22.5 sec2017-07-30 02:45:29,834 Stage-1 map = 100%,  reduce = 56%, Cumulative CPU 24.54 sec2017-07-30 02:45:32,660 Stage-1 map = 100%,  reduce = 78%, Cumulative CPU 29.26 sec2017-07-30 02:45:35,291 Stage-1 map = 100%,  reduce = 90%, Cumulative CPU 35.0 sec2017-07-30 02:45:37,542 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 36.92 secMapReduce Total cumulative CPU time: 36 seconds 920 msecEnded Job = job_1501198522682_0064MapReduce Jobs Launched: Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 36.92 sec   HDFS Read: 2119 HDFS Write: 299 SUCCESSTotal MapReduce CPU Time Spent: 36 seconds 920 msecOKempno    ename    sal8888    king    300000.08888    ChavinKing    300000.07844    TURNER    1500.07654    MARTIN    1250.07839    KING    5000.07788    SCOTT    3000.07902    FORD    3000.07566    JONES    2975.07782    CLARK    2450.07499    ALLEN    1600.07934    MILLER    1300.07521    WARD    1250.07900    JAMES    950.07698    BLAKE    2850.07876    ADAMS    1100.07369    SMITH    800.0Time taken: 132.663 seconds, Fetched: 16 row(s) sort by asc进行排序操作,启动了3个map任务,3个reduce任务,这与我们前期配置是相符合的。输出结果局部有序,也侧面验证了sort by属于reduce内进行排序的。 ====================================================================== 3、distribute by测试: set mapred.max.split.size=200;set mapred.reduce.tasks=3;select empno,ename,sal from emp distribute by sal sort by sal asc; hive (chavin)> set mapred.max.split.size=200;hive (chavin)> set mapred.reduce.tasks=3;hive (chavin)> select empno,ename,sal from emp distribute by sal sort by sal asc;Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Defaulting to jobconf value of: 3In order to change the average load for a reducer (in bytes):   set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers:   set hive.exec.reducers.max=In order to set a constant number of reducers:   set mapreduce.job.reduces=Starting Job = job_1501198522682_0067, Tracking URL = Command = /opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6/bin/hadoop job  -kill job_1501198522682_0067Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 32017-07-30 02:52:14,833 Stage-1 map = 0%,  reduce = 0%2017-07-30 02:53:04,015 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 6.0 sec2017-07-30 02:53:05,393 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 11.35 sec2017-07-30 02:53:06,657 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 11.72 sec2017-07-30 02:53:46,542 Stage-1 map = 100%,  reduce = 22%, Cumulative CPU 20.83 sec2017-07-30 02:53:47,981 Stage-1 map = 100%,  reduce = 44%, Cumulative CPU 23.77 sec2017-07-30 02:53:50,909 Stage-1 map = 100%,  reduce = 70%, Cumulative CPU 27.81 sec2017-07-30 02:53:54,581 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 37.24 secMapReduce Total cumulative CPU time: 37 seconds 240 msecEnded Job = job_1501198522682_0067MapReduce Jobs Launched: Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 39.56 sec   HDFS Read: 2119 HDFS Write: 299 SUCCESSTotal MapReduce CPU Time Spent: 39 seconds 560 msecOKempno    ename    sal7876    ADAMS    1100.07654    MARTIN    1250.07521    WARD    1250.07782    CLARK    2450.07566    JONES    2975.08888    king    300000.08888    ChavinKing    300000.07934    MILLER    1300.07499    ALLEN    1600.07698    BLAKE    2850.07788    SCOTT    3000.07902    FORD    3000.07839    KING    5000.07369    SMITH    800.07900    JAMES    950.07844    TURNER    1500.0Time taken: 129.0 seconds, Fetched: 16 row(s) distribute by asc进行排序操作,启动了3个map任务,3个reduce任务,这与我们前期配置是相符合的。输出结果局部有序,也侧面验证了sort by属于reduce内进行排序的。同时我们发现sal值相同的字段并排输出,这说明distribute by sal按照sal的不同值分发的不同的reduce中。 ============================================================ set mapred.max.split.size=200;set mapred.reduce.tasks=3;select empno,ename,sal from emp distribute by sal sort by sal desc; hive (chavin)> set mapred.max.split.size=200;hive (chavin)> set mapred.reduce.tasks=3;hive (chavin)> select empno,ename,sal from emp distribute by sal sort by sal desc;Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Defaulting to jobconf value of: 3In order to change the average load for a reducer (in bytes):   set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers:   set hive.exec.reducers.max=In order to set a constant number of reducers:   set mapreduce.job.reduces=Starting Job = job_1501198522682_0068, Tracking URL = Command = /opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6/bin/hadoop job  -kill job_1501198522682_0068Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 32017-07-30 02:55:49,989 Stage-1 map = 0%,  reduce = 0%2017-07-30 02:56:29,911 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 4.4 sec2017-07-30 02:56:37,401 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 16.42 sec2017-07-30 02:57:17,118 Stage-1 map = 100%,  reduce = 22%, Cumulative CPU 19.05 sec2017-07-30 02:57:21,673 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 21.07 sec2017-07-30 02:57:24,397 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 23.17 sec2017-07-30 02:57:25,844 Stage-1 map = 100%,  reduce = 56%, Cumulative CPU 26.7 sec2017-07-30 02:57:27,134 Stage-1 map = 100%,  reduce = 78%, Cumulative CPU 30.89 sec2017-07-30 02:57:28,508 Stage-1 map = 100%,  reduce = 81%, Cumulative CPU 32.52 sec2017-07-30 02:57:29,728 Stage-1 map = 100%,  reduce = 83%, Cumulative CPU 34.62 sec2017-07-30 02:57:32,582 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 39.17 secMapReduce Total cumulative CPU time: 39 seconds 170 msecEnded Job = job_1501198522682_0068MapReduce Jobs Launched: Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 39.17 sec   HDFS Read: 2119 HDFS Write: 299 SUCCESSTotal MapReduce CPU Time Spent: 39 seconds 170 msecOKempno    ename    sal8888    king    300000.08888    ChavinKing    300000.07566    JONES    2975.07782    CLARK    2450.07654    MARTIN    1250.07521    WARD    1250.07876    ADAMS    1100.07839    KING    5000.07788    SCOTT    3000.07902    FORD    3000.07698    BLAKE    2850.07499    ALLEN    1600.07934    MILLER    1300.07844    TURNER    1500.07900    JAMES    950.07369    SMITH    800.0Time taken: 135.076 seconds, Fetched: 16 row(s) distribute by asc进行排序操作,启动了3个map任务,3个reduce任务,这与我们前期配置是相符合的。输出结果局部有序,也侧面验证了sort by属于reduce内进行排序的。同时我们发现sal值相同的字段并排输出,这说明distribute by sal按照sal的不同值分发的不同的reduce中。 ========================================================= 4、cluster by测试: set mapred.max.split.size=200;set mapred.reduce.tasks=3;select empno,ename,sal from emp cluster by sal; hive (chavin)> set mapred.max.split.size=200;hive (chavin)> set mapred.reduce.tasks=3;hive (chavin)> select empno,ename,sal from emp cluster by sal;Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Defaulting to jobconf value of: 3In order to change the average load for a reducer (in bytes):   set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers:   set hive.exec.reducers.max=In order to set a constant number of reducers:   set mapreduce.job.reduces=Starting Job = job_1501198522682_0069, Tracking URL = Command = /opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6/bin/hadoop job  -kill job_1501198522682_0069Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 32017-07-30 02:59:19,969 Stage-1 map = 0%,  reduce = 0%2017-07-30 03:00:00,105 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 5.02 sec2017-07-30 03:00:02,559 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 9.55 sec2017-07-30 03:00:03,736 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 14.01 sec2017-07-30 03:00:40,134 Stage-1 map = 100%,  reduce = 22%, Cumulative CPU 18.03 sec2017-07-30 03:00:41,421 Stage-1 map = 100%,  reduce = 44%, Cumulative CPU 21.22 sec2017-07-30 03:00:44,508 Stage-1 map = 100%,  reduce = 81%, Cumulative CPU 27.56 sec2017-07-30 03:00:45,937 Stage-1 map = 100%,  reduce = 92%, Cumulative CPU 31.57 sec2017-07-30 03:00:47,208 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 33.51 secMapReduce Total cumulative CPU time: 33 seconds 510 msecEnded Job = job_1501198522682_0069MapReduce Jobs Launched: Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 33.51 sec   HDFS Read: 2119 HDFS Write: 299 SUCCESSTotal MapReduce CPU Time Spent: 33 seconds 510 msecOKempno    ename    sal7876    ADAMS    1100.07654    MARTIN    1250.07521    WARD    1250.07782    CLARK    2450.07566    JONES    2975.08888    king    300000.08888    ChavinKing    300000.07934    MILLER    1300.07499    ALLEN    1600.07698    BLAKE    2850.07788    SCOTT    3000.07902    FORD    3000.07839    KING    5000.07369    SMITH    800.07900    JAMES    950.07844    TURNER    1500.0Time taken: 119.103 seconds, Fetched: 16 row(s) cluster by在特定条件下可以替代distribute by和sort by的组合,进行排序操作,启动了3个map任务,3个reduce任务,这与我们前期配置是相符合的。输出结果局部有序。 测试结论:以上4种排序方法中,真正能达到全局排序的只有order by,针对于sort by的局部排序如果想达到全局排序效果还需要对其结果进行一次order by的过程。而distribute by和cluster by可以合并相同的值,但并不是宣传中的那样可以达到全局排序的效果。或许还有其他手段可以达到,但绝不是针对于这2种排序本身。

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:多点快速采集系统中信号可靠转贮研究
下一篇:java返回集合为null还是空集合及空集合的三种写法小结
相关文章

 发表评论

暂时没有评论,来抢沙发吧~