Hadoop大数据——mapreduce的排序机制之total排序-APISpace

Hadoop大数据——mapreduce的排序机制之total排序

mapreduce的排序机制之total排序（1）设置一个reduce task ，全局有序，但是并发度太低，单节点负载太大（2）设置分区段partitioner，设置相应数量的reduce task，可以实现全局有序，但难以避免数据分布不均匀——数据倾斜问题，有些reduce task负载过大，而有些则过小；（3）可以通过编写一个job来统计数据分布规律，获取合适的区段划分，然后用分区段partitioner来实现排序；但是这样需要另外编写一个job对整个数据集运算，比较费事（4）利用hadoop自带的取样器，来对数据集取样并划分区段，然后利用hadoop自带的TotalOrderPartitioner分区来实现全局排序 /** * 全排序示例 * @author zhangxueliang * */ public class TotalSort { static class TotalSortMapper extends Mapper { OrderBean bean = new OrderBean(); @Override protected void map(Text key, Text value, Context context) throws IOException, InterruptedException { // String line = value.toString(); // String[] fields = line.split("\t"); // bean.set(fields[0], Double.parseDouble(fields[1])); context.write(key, value); } } static class TotalSortReducer extends Reducer { @Override protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { for (Text v : values) { context.write(key, v); } } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf); job.setJarByClass(TotalSort.class); job.setMapperClass(TotalSortMapper.class); job.setReducerClass(TotalSortReducer.class); // job.setOutputKeyClass(OrderBean.class); // job.setOutputValueClass(NullWritable.class); //用来读取sequence源文件的输入组件 job.setInputFormatClass(SequenceFileInputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // job.setPartitionerClass(RangePartitioner.class); //分区的逻辑使用的hadoop自带的全局排序分区组件 job.setPartitionerClass(TotalOrderPartitioner.class); //系统自带的这个抽样器只能针对sequencefile抽样 RandomSampler randomSampler = new InputSampler.RandomSampler(0.1,100,10); InputSampler.writePartitionFile(job, randomSampler); //获取抽样器所产生的分区规划描述文件 Configuration conf2 = job.getConfiguration(); String partitionFile = TotalOrderPartitioner.getPartitionFile(conf2); //把分区描述规划文件分发到每一个task节点的本地 job.addCacheFile(new URI(partitionFile)); //设置若干并发的reduce task job.setNumReduceTasks(3); job.waitForCompletion(true); } }

c语言sscanf函数的用法是什么

238 2022-11-24

Hadoop大数据——mapreduce的排序机制之total排序

c语言sscanf函数的用法是什么

c语言一维数组怎么快速排列

linux怎么查看本机内存大小

推荐文章

api接口有哪几种分类及功能

什么是API接口?API接口简单介绍

短信API接口概述，短信API接口的优势

7款快递物流的物流查询API工具，物流快递查询API接口怎么对接？

企业四要素: 了解企业经营成功的关键

什么是语音验证码?,语音验证码平台有哪些

全国工商查询系统怎么查企业名录

哪些平台提供实名认证的接口？

PHP如何调用API接口?

如何使用百度天气预报API接口?

最近发表

热评文章

数据接口api（数据接口API开发平台）

数据开放接口api（数据服务api开发）

Python爬虫教程：爬取酷狗音乐（python爬取

hbuilder怎么更改字体大小和颜色

直播平台api接口 - 构建卓越的直播平台

实时股票数据api接口（股票实时行情api接口）