Hadoop大数据——mapreduce的secondary排序机制
secondary排序机制
----就是让mapreduce帮我们根据value排序 考虑一个场景,需要取按key分组的最大value条目: 通常,shuffle只是对key进行排序 如果需要对value排序,则需要将value放到key中,但是此时,value就和原来的key形成了一个组合key,从而到达reducer时,组合key是一个一个到达reducer,想在reducer中输出最大value的那一个,不好办,它会一个一个都输出去,除非自己弄一个缓存,将到达的组合key全部缓存起来然后只取第一个 (或者弄一个访问标识?但是同一个reducer可能会收到多个key的组合key,无法判断访问标识) 此时就可以用到secondary sort,其思路: (1)要有对组合key排序的比较器 (2)要有partitioner进行分区负载并行reducer计算 (3)要有一个groupingcomparator来重定义valuelist聚合策略——这是关键,其原理就是将相同key而不同组合key的数据进行聚合,从而把他们聚合成一组,然后在reducer中可以一次收到这一组key的组合key,并且,value最大的也就是在这一组中的第一个组合key会被选为迭代器valuelist的key,从而可以直接输出这个组合key,就实现了我们的需求 示例:输出每个item的订单金额最大的记录 (1)定义一个GroupingComparator
/**
* 用于控制shuffle过程中reduce端对kv对的聚合逻辑
* @author zhangxueliang
*
*/
public class ItemidGroupingComparator extends WritableComparator {
protected ItemidGroupingComparator() {
super(OrderBean.class, true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
OrderBean abean = (OrderBean) a;
OrderBean bbean = (OrderBean) b;
//将item_id相同的bean都视为相同,从而聚合为一组
return abean.getItemid().compareTo(bbean.getItemid());
}
}
(2)定义订单信息bean
/**
* 订单信息bean,实现hadoop的序列化机制
* @author zhangxueliang
*
*/
public class OrderBean implements WritableComparable{
private Text itemid;
private DoubleWritable amount;
public OrderBean() {
}
public OrderBean(Text itemid, DoubleWritable amount) {
set(itemid, amount);
}
public void set(Text itemid, DoubleWritable amount) {
this.itemid = itemid;
this.amount = amount;
}
public Text getItemid() {
return itemid;
}
public DoubleWritable getAmount() {
return amount;
}
@Override
public int compareTo(OrderBean o) {
int cmp = this.itemid.compareTo(o.getItemid());
if (cmp == 0) {
cmp = -this.amount.compareTo(o.getAmount());
}
return cmp;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(itemid.toString());
out.writeDouble(amount.get());
}
@Override
public void readFields(DataInput in) throws IOException {
String readUTF = in.readUTF();
double readDouble = in.readDouble();
this.itemid = new Text(readUTF);
this.amount= new DoubleWritable(readDouble);
}
@Override
public String toString() {
return itemid.toString() + "\t" + amount.get();
}
}
(3)自定义一个partitioner,以使相同id的bean发往相同reduce task
public class ItemIdPartitioner extends Partitioner{
@Override
public int getPartition(OrderBean key, NullWritable value, int numPartitions) {
//指定item_id相同的bean发往相同的reducer task
return (key.getItemid().hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}
(4)定义mr主体流程
/**
* 利用secondarysort机制输出每种item订单金额最大的记录
* @author zhangxueliang
*
*/
public class SecondarySort {
static class SecondarySortMapper extends Mapper{
OrderBean bean = new OrderBean();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] fields = StringUtils.split(line, "\t");
bean.set(new Text(fields[0]), new DoubleWritable(Double.parseDouble(fields[1])));
context.write(bean, NullWritable.get());
}
}
static class SecondarySortReducer extends Reducer{
//在设置了groupingcomparator以后,这里收到的kv数据 就是: <1001 87.6>,null <1001 76.5>,null ....
//此时,reduce方法中的参数key就是上述kv组中的第一个kv的key:<1001 87.6>
//要输出同一个item的所有订单中最大金额的那一个,就只要输出这个key
@Override
protected void reduce(OrderBean key, Iterable values, Context context) throws IOException, InterruptedException {
context.write(key, NullWritable.get());
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(SecondarySort.class);
job.setMapperClass(SecondarySortMapper.class);
job.setReducerClass(SecondarySortReducer.class);
job.setOutputKeyClass(OrderBean.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//指定shuffle所使用的GroupingComparator类
job.setGroupingComparatorClass(ItemidGroupingComparator.class);
//指定shuffle所使用的partitioner类
job.setPartitionerClass(ItemIdPartitioner.class);
job.setNumReduceTasks(3);
job.waitForCompletion(true);
}
}
版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。
暂时没有评论,来抢沙发吧~