Hive中的order by、sort by、distribute by、cluster by解释及测试
结论:order by:全局排序,这也是4种排序手段中唯一一个能在终端输出中看出全局排序的方法,只有一个reduce,可能造成renduce任务时间过长,在严格模式下,要求必须具备limit子句。sort by:可以运行多个reduce,每个reduce内排序,默认升序排序。distribute by:控制map的输出在reduce中是如何划分的。通常与sort by组合使用,按照特定字段进行排序。cluster by:如果distribute by字段和sort by字段相同,且安装默认升序方式进行排序,可以使用cluster by语句代替distribute by和sort by,但是这样会剥夺sort by的并行性,但是也不能保证全局输出是有序的(这是测试结果)。
1、order by全局排序测试:
set mapred.max.split.size=200;set mapred.reduce.tasks=3;select empno,ename,sal from emp order by sal asc limit 20;
hive (chavin)> set mapred.max.split.size=200;hive (chavin)> set mapred.reduce.tasks=3;hive (chavin)> select empno,ename,sal from emp order by sal asc limit 20;Total jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Starting Job = job_1501198522682_0059, Tracking URL = Command = /opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6/bin/hadoop job -kill job_1501198522682_0059Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 12017-07-30 02:21:58,594 Stage-1 map = 0%, reduce = 0%2017-07-30 02:22:59,734 Stage-1 map = 0%, reduce = 0%2017-07-30 02:24:00,084 Stage-1 map = 0%, reduce = 0%2017-07-30 02:25:00,859 Stage-1 map = 0%, reduce = 0%2017-07-30 02:25:28,846 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 123.57 sec2017-07-30 02:26:03,306 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 128.97 secMapReduce Total cumulative CPU time: 2 minutes 8 seconds 970 msecEnded Job = job_1501198522682_0059MapReduce Jobs Launched: Stage-Stage-1: Map: 3 Reduce: 1 Cumulative CPU: 128.97 sec HDFS Read: 2119 HDFS Write: 299 SUCCESSTotal MapReduce CPU Time Spent: 2 minutes 8 seconds 970 msecOKempno ename sal7369 SMITH 800.07900 JAMES 950.07876 ADAMS 1100.07654 MARTIN 1250.07521 WARD 1250.07934 MILLER 1300.07844 TURNER 1500.07499 ALLEN 1600.07782 CLARK 2450.07698 BLAKE 2850.07566 JONES 2975.07788 SCOTT 3000.07902 FORD 3000.07839 KING 5000.08888 king 300000.08888 ChavinKing 300000.0Time taken: 442.499 seconds, Fetched: 16 row(s)
我们可以从输出日志看出,这个任务一共启动了3个map任务,1个reduce任务,输出结果是按照sal字段内容升序排序,并且全局有序。注意在任务开始前我们设置了reduce数目为3,但是实际仅启动了一个reduce任务,这说明order by是强制启动一个reduce完成全局排序的。当数据集比较大时,一个reduce任务将会成为这个任务的性能瓶颈。
2、sort by测试:
==============================================
hive (chavin)> set mapred.max.split.size=200;hive (chavin)> set mapred.reduce.tasks=3;hive (chavin)> select empno,ename,sal from emp sort by sal asc;Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Defaulting to jobconf value of: 3In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Starting Job = job_1501198522682_0063, Tracking URL = Command = /opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6/bin/hadoop job -kill job_1501198522682_0063Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 32017-07-30 02:40:32,898 Stage-1 map = 0%, reduce = 0%2017-07-30 02:41:18,531 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 5.05 sec2017-07-30 02:41:27,062 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 17.9 sec2017-07-30 02:42:15,867 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 28.39 sec2017-07-30 02:42:18,655 Stage-1 map = 100%, reduce = 78%, Cumulative CPU 30.31 sec2017-07-30 02:42:20,045 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 33.92 secMapReduce Total cumulative CPU time: 33 seconds 920 msecEnded Job = job_1501198522682_0063MapReduce Jobs Launched: Stage-Stage-1: Map: 3 Reduce: 3 Cumulative CPU: 40.36 sec HDFS Read: 2119 HDFS Write: 299 SUCCESSTotal MapReduce CPU Time Spent: 40 seconds 360 msecOKempno ename sal7654 MARTIN 1250.07844 TURNER 1500.08888 king 300000.08888 ChavinKing 300000.07900 JAMES 950.07521 WARD 1250.07934 MILLER 1300.07499 ALLEN 1600.07782 CLARK 2450.07566 JONES 2975.07902 FORD 3000.07788 SCOTT 3000.07839 KING 5000.07369 SMITH 800.07876 ADAMS 1100.07698 BLAKE 2850.0Time taken: 136.972 seconds, Fetched: 16 row(s)
sort by asc进行排序操作,启动了3个map任务,3个reduce任务,这与我们前期配置是相符合的。输出结果局部有序,也侧面验证了sort by属于reduce内进行排序的。
============================================
set mapred.max.split.size=200;set mapred.reduce.tasks=3;select empno,ename,sal from emp sort by sal desc;
hive (chavin)> set mapred.max.split.size=200;hive (chavin)> set mapred.reduce.tasks=3;hive (chavin)> select empno,ename,sal from emp sort by sal desc;Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Defaulting to jobconf value of: 3In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Starting Job = job_1501198522682_0064, Tracking URL = Command = /opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6/bin/hadoop job -kill job_1501198522682_0064Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 32017-07-30 02:43:57,741 Stage-1 map = 0%, reduce = 0%2017-07-30 02:44:41,415 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 4.27 sec2017-07-30 02:44:44,267 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 9.69 sec2017-07-30 02:44:46,779 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 14.64 sec2017-07-30 02:45:27,003 Stage-1 map = 100%, reduce = 22%, Cumulative CPU 17.61 sec2017-07-30 02:45:28,551 Stage-1 map = 100%, reduce = 44%, Cumulative CPU 22.5 sec2017-07-30 02:45:29,834 Stage-1 map = 100%, reduce = 56%, Cumulative CPU 24.54 sec2017-07-30 02:45:32,660 Stage-1 map = 100%, reduce = 78%, Cumulative CPU 29.26 sec2017-07-30 02:45:35,291 Stage-1 map = 100%, reduce = 90%, Cumulative CPU 35.0 sec2017-07-30 02:45:37,542 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 36.92 secMapReduce Total cumulative CPU time: 36 seconds 920 msecEnded Job = job_1501198522682_0064MapReduce Jobs Launched: Stage-Stage-1: Map: 3 Reduce: 3 Cumulative CPU: 36.92 sec HDFS Read: 2119 HDFS Write: 299 SUCCESSTotal MapReduce CPU Time Spent: 36 seconds 920 msecOKempno ename sal8888 king 300000.08888 ChavinKing 300000.07844 TURNER 1500.07654 MARTIN 1250.07839 KING 5000.07788 SCOTT 3000.07902 FORD 3000.07566 JONES 2975.07782 CLARK 2450.07499 ALLEN 1600.07934 MILLER 1300.07521 WARD 1250.07900 JAMES 950.07698 BLAKE 2850.07876 ADAMS 1100.07369 SMITH 800.0Time taken: 132.663 seconds, Fetched: 16 row(s)
sort by asc进行排序操作,启动了3个map任务,3个reduce任务,这与我们前期配置是相符合的。输出结果局部有序,也侧面验证了sort by属于reduce内进行排序的。
======================================================================
3、distribute by测试:
set mapred.max.split.size=200;set mapred.reduce.tasks=3;select empno,ename,sal from emp distribute by sal sort by sal asc;
hive (chavin)> set mapred.max.split.size=200;hive (chavin)> set mapred.reduce.tasks=3;hive (chavin)> select empno,ename,sal from emp distribute by sal sort by sal asc;Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Defaulting to jobconf value of: 3In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Starting Job = job_1501198522682_0067, Tracking URL = Command = /opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6/bin/hadoop job -kill job_1501198522682_0067Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 32017-07-30 02:52:14,833 Stage-1 map = 0%, reduce = 0%2017-07-30 02:53:04,015 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 6.0 sec2017-07-30 02:53:05,393 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 11.35 sec2017-07-30 02:53:06,657 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 11.72 sec2017-07-30 02:53:46,542 Stage-1 map = 100%, reduce = 22%, Cumulative CPU 20.83 sec2017-07-30 02:53:47,981 Stage-1 map = 100%, reduce = 44%, Cumulative CPU 23.77 sec2017-07-30 02:53:50,909 Stage-1 map = 100%, reduce = 70%, Cumulative CPU 27.81 sec2017-07-30 02:53:54,581 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 37.24 secMapReduce Total cumulative CPU time: 37 seconds 240 msecEnded Job = job_1501198522682_0067MapReduce Jobs Launched: Stage-Stage-1: Map: 3 Reduce: 3 Cumulative CPU: 39.56 sec HDFS Read: 2119 HDFS Write: 299 SUCCESSTotal MapReduce CPU Time Spent: 39 seconds 560 msecOKempno ename sal7876 ADAMS 1100.07654 MARTIN 1250.07521 WARD 1250.07782 CLARK 2450.07566 JONES 2975.08888 king 300000.08888 ChavinKing 300000.07934 MILLER 1300.07499 ALLEN 1600.07698 BLAKE 2850.07788 SCOTT 3000.07902 FORD 3000.07839 KING 5000.07369 SMITH 800.07900 JAMES 950.07844 TURNER 1500.0Time taken: 129.0 seconds, Fetched: 16 row(s)
distribute by asc进行排序操作,启动了3个map任务,3个reduce任务,这与我们前期配置是相符合的。输出结果局部有序,也侧面验证了sort by属于reduce内进行排序的。同时我们发现sal值相同的字段并排输出,这说明distribute by sal按照sal的不同值分发的不同的reduce中。
============================================================
set mapred.max.split.size=200;set mapred.reduce.tasks=3;select empno,ename,sal from emp distribute by sal sort by sal desc;
hive (chavin)> set mapred.max.split.size=200;hive (chavin)> set mapred.reduce.tasks=3;hive (chavin)> select empno,ename,sal from emp distribute by sal sort by sal desc;Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Defaulting to jobconf value of: 3In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Starting Job = job_1501198522682_0068, Tracking URL = Command = /opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6/bin/hadoop job -kill job_1501198522682_0068Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 32017-07-30 02:55:49,989 Stage-1 map = 0%, reduce = 0%2017-07-30 02:56:29,911 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 4.4 sec2017-07-30 02:56:37,401 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 16.42 sec2017-07-30 02:57:17,118 Stage-1 map = 100%, reduce = 22%, Cumulative CPU 19.05 sec2017-07-30 02:57:21,673 Stage-1 map = 100%, reduce = 24%, Cumulative CPU 21.07 sec2017-07-30 02:57:24,397 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 23.17 sec2017-07-30 02:57:25,844 Stage-1 map = 100%, reduce = 56%, Cumulative CPU 26.7 sec2017-07-30 02:57:27,134 Stage-1 map = 100%, reduce = 78%, Cumulative CPU 30.89 sec2017-07-30 02:57:28,508 Stage-1 map = 100%, reduce = 81%, Cumulative CPU 32.52 sec2017-07-30 02:57:29,728 Stage-1 map = 100%, reduce = 83%, Cumulative CPU 34.62 sec2017-07-30 02:57:32,582 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 39.17 secMapReduce Total cumulative CPU time: 39 seconds 170 msecEnded Job = job_1501198522682_0068MapReduce Jobs Launched: Stage-Stage-1: Map: 3 Reduce: 3 Cumulative CPU: 39.17 sec HDFS Read: 2119 HDFS Write: 299 SUCCESSTotal MapReduce CPU Time Spent: 39 seconds 170 msecOKempno ename sal8888 king 300000.08888 ChavinKing 300000.07566 JONES 2975.07782 CLARK 2450.07654 MARTIN 1250.07521 WARD 1250.07876 ADAMS 1100.07839 KING 5000.07788 SCOTT 3000.07902 FORD 3000.07698 BLAKE 2850.07499 ALLEN 1600.07934 MILLER 1300.07844 TURNER 1500.07900 JAMES 950.07369 SMITH 800.0Time taken: 135.076 seconds, Fetched: 16 row(s)
distribute by asc进行排序操作,启动了3个map任务,3个reduce任务,这与我们前期配置是相符合的。输出结果局部有序,也侧面验证了sort by属于reduce内进行排序的。同时我们发现sal值相同的字段并排输出,这说明distribute by sal按照sal的不同值分发的不同的reduce中。
=========================================================
4、cluster by测试:
set mapred.max.split.size=200;set mapred.reduce.tasks=3;select empno,ename,sal from emp cluster by sal;
hive (chavin)> set mapred.max.split.size=200;hive (chavin)> set mapred.reduce.tasks=3;hive (chavin)> select empno,ename,sal from emp cluster by sal;Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Defaulting to jobconf value of: 3In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Starting Job = job_1501198522682_0069, Tracking URL = Command = /opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6/bin/hadoop job -kill job_1501198522682_0069Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 32017-07-30 02:59:19,969 Stage-1 map = 0%, reduce = 0%2017-07-30 03:00:00,105 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 5.02 sec2017-07-30 03:00:02,559 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 9.55 sec2017-07-30 03:00:03,736 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 14.01 sec2017-07-30 03:00:40,134 Stage-1 map = 100%, reduce = 22%, Cumulative CPU 18.03 sec2017-07-30 03:00:41,421 Stage-1 map = 100%, reduce = 44%, Cumulative CPU 21.22 sec2017-07-30 03:00:44,508 Stage-1 map = 100%, reduce = 81%, Cumulative CPU 27.56 sec2017-07-30 03:00:45,937 Stage-1 map = 100%, reduce = 92%, Cumulative CPU 31.57 sec2017-07-30 03:00:47,208 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 33.51 secMapReduce Total cumulative CPU time: 33 seconds 510 msecEnded Job = job_1501198522682_0069MapReduce Jobs Launched: Stage-Stage-1: Map: 3 Reduce: 3 Cumulative CPU: 33.51 sec HDFS Read: 2119 HDFS Write: 299 SUCCESSTotal MapReduce CPU Time Spent: 33 seconds 510 msecOKempno ename sal7876 ADAMS 1100.07654 MARTIN 1250.07521 WARD 1250.07782 CLARK 2450.07566 JONES 2975.08888 king 300000.08888 ChavinKing 300000.07934 MILLER 1300.07499 ALLEN 1600.07698 BLAKE 2850.07788 SCOTT 3000.07902 FORD 3000.07839 KING 5000.07369 SMITH 800.07900 JAMES 950.07844 TURNER 1500.0Time taken: 119.103 seconds, Fetched: 16 row(s)
cluster by在特定条件下可以替代distribute by和sort by的组合,进行排序操作,启动了3个map任务,3个reduce任务,这与我们前期配置是相符合的。输出结果局部有序。
测试结论:以上4种排序方法中,真正能达到全局排序的只有order by,针对于sort by的局部排序如果想达到全局排序效果还需要对其结果进行一次order by的过程。而distribute by和cluster by可以合并相同的值,但并不是宣传中的那样可以达到全局排序的效果。或许还有其他手段可以达到,但绝不是针对于这2种排序本身。
版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。
暂时没有评论,来抢沙发吧~