Hadoop大数据——mapreduce的secondary排序机制-APISpace

Hadoop大数据——mapreduce的secondary排序机制

secondary排序机制 ----就是让mapreduce帮我们根据value排序考虑一个场景，需要取按key分组的最大value条目：通常，shuffle只是对key进行排序如果需要对value排序，则需要将value放到key中，但是此时，value就和原来的key形成了一个组合key，从而到达reducer时，组合key是一个一个到达reducer，想在reducer中输出最大value的那一个，不好办，它会一个一个都输出去，除非自己弄一个缓存，将到达的组合key全部缓存起来然后只取第一个（或者弄一个访问标识？但是同一个reducer可能会收到多个key的组合key，无法判断访问标识）此时就可以用到secondary sort，其思路：（1）要有对组合key排序的比较器（2）要有partitioner进行分区负载并行reducer计算（3）要有一个groupingcomparator来重定义valuelist聚合策略——这是关键，其原理就是将相同key而不同组合key的数据进行聚合，从而把他们聚合成一组，然后在reducer中可以一次收到这一组key的组合key，并且，value最大的也就是在这一组中的第一个组合key会被选为迭代器valuelist的key，从而可以直接输出这个组合key，就实现了我们的需求示例：输出每个item的订单金额最大的记录（1）定义一个GroupingComparator /** * 用于控制shuffle过程中reduce端对kv对的聚合逻辑 * @author zhangxueliang * */ public class ItemidGroupingComparator extends WritableComparator { protected ItemidGroupingComparator() { super(OrderBean.class, true); } @Override public int compare(WritableComparable a, WritableComparable b) { OrderBean abean = (OrderBean) a; OrderBean bbean = (OrderBean) b; //将item_id相同的bean都视为相同，从而聚合为一组 return abean.getItemid().compareTo(bbean.getItemid()); } } （2）定义订单信息bean /** * 订单信息bean，实现hadoop的序列化机制 * @author zhangxueliang * */ public class OrderBean implements WritableComparable{ private Text itemid; private DoubleWritable amount; public OrderBean() { } public OrderBean(Text itemid, DoubleWritable amount) { set(itemid, amount); } public void set(Text itemid, DoubleWritable amount) { this.itemid = itemid; this.amount = amount; } public Text getItemid() { return itemid; } public DoubleWritable getAmount() { return amount; } @Override public int compareTo(OrderBean o) { int cmp = this.itemid.compareTo(o.getItemid()); if (cmp == 0) { cmp = -this.amount.compareTo(o.getAmount()); } return cmp; } @Override public void write(DataOutput out) throws IOException { out.writeUTF(itemid.toString()); out.writeDouble(amount.get()); } @Override public void readFields(DataInput in) throws IOException { String readUTF = in.readUTF(); double readDouble = in.readDouble(); this.itemid = new Text(readUTF); this.amount= new DoubleWritable(readDouble); } @Override public String toString() { return itemid.toString() + "\t" + amount.get(); } } （3）自定义一个partitioner，以使相同id的bean发往相同reduce task public class ItemIdPartitioner extends Partitioner{ @Override public int getPartition(OrderBean key, NullWritable value, int numPartitions) { //指定item_id相同的bean发往相同的reducer task return (key.getItemid().hashCode() & Integer.MAX_VALUE) % numPartitions; } } （4）定义mr主体流程 /** * 利用secondarysort机制输出每种item订单金额最大的记录 * @author zhangxueliang * */ public class SecondarySort { static class SecondarySortMapper extends Mapper{ OrderBean bean = new OrderBean(); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] fields = StringUtils.split(line, "\t"); bean.set(new Text(fields[0]), new DoubleWritable(Double.parseDouble(fields[1]))); context.write(bean, NullWritable.get()); } } static class SecondarySortReducer extends Reducer{ //在设置了groupingcomparator以后，这里收到的kv数据就是： <1001 87.6>,null <1001 76.5>,null .... //此时，reduce方法中的参数key就是上述kv组中的第一个kv的key：<1001 87.6> //要输出同一个item的所有订单中最大金额的那一个，就只要输出这个key @Override protected void reduce(OrderBean key, Iterable values, Context context) throws IOException, InterruptedException { context.write(key, NullWritable.get()); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf); job.setJarByClass(SecondarySort.class); job.setMapperClass(SecondarySortMapper.class); job.setReducerClass(SecondarySortReducer.class); job.setOutputKeyClass(OrderBean.class); job.setOutputValueClass(NullWritable.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); //指定shuffle所使用的GroupingComparator类 job.setGroupingComparatorClass(ItemidGroupingComparator.class); //指定shuffle所使用的partitioner类 job.setPartitionerClass(ItemIdPartitioner.class); job.setNumReduceTasks(3); job.waitForCompletion(true); } }

c语言sscanf函数的用法是什么

348 2022-11-24

Hadoop大数据——mapreduce的secondary排序机制

c语言sscanf函数的用法是什么

c语言一维数组怎么快速排列

linux怎么查看本机内存大小

推荐文章

api接口有哪几种分类及功能

什么是API接口?API接口简单介绍

短信API接口概述，短信API接口的优势

7款快递物流的物流查询API工具，物流快递查询API接口怎么对接？

企业四要素: 了解企业经营成功的关键

什么是语音验证码?,语音验证码平台有哪些

全国工商查询系统怎么查企业名录

哪些平台提供实名认证的接口？

PHP如何调用API接口?

如何使用百度天气预报API接口?

最近发表

热评文章

数据接口api（数据接口API开发平台）

数据开放接口api（数据服务api开发）

Python爬虫教程：爬取酷狗音乐（python爬取

hbuilder怎么更改字体大小和颜色

直播平台api接口 - 构建卓越的直播平台

实时股票数据api接口（股票实时行情api接口）