大数据-MapReduce-APISpace

大数据-MapReduce

源码见：，用于编写批处理应用程序。编写好的程序可以提交到 Hadoop 集群上用于并行处理大规模的数据集。MapReduce 作业通过将输入的数据集拆分为独立的块，这些块由 map 以并行的方式处理，框架对 map 的输出进行排序，然后输入到 reduce 中

源自于Google的MapReduce论文 ,论文发表于2004年12月 Hadoop MapReduce是Google MapReduce的克隆版 MapReduce优点:海量数据离线处理&易开发&易运行 MapReduce缺点:实时流式计算

MapReduce编程模型

我们编程主要关注的是如何Splitting和如何ReduceMapReduce 框架专门用于 <key，value> 键值对处理，它将作业的输入视为一组 <key，value> 对，并生成一组 <key，value> 对作为输出。

MapReduce将作业拆分成Map阶段和Reduce阶段

input : 读取文本文件； splitting : 将文件按照行进行拆分，此时得到的 K1 行数，V1 表示对应行的文本内容； mapping : 并行将每一行按照空格进行拆分，拆分得到的 List(K2,V2)，其中 K2 代表每一个单词，由于是做词频统计，所以 V2 的值为 1，代表出现 1 次； shuffling：由于 Mapping 操作可能是在不同的机器上并行处理的，所以需要通过 shuffling 将相同 key 值的数据分发到同一个节点上去合并，这样才能统计出最终的结果，此时得到 K2 为每一个单词，List(V2) 为可迭代集合，V2 就是 Mapping 中的 V2； Reducing : 这里的案例是统计单词出现的总次数，所以 Reducing 对 List(V2) 进行归约求和操作，最终输出。

(input) -> map -> -> combine -> -> reduce -> (output)

Mapper

// // Source code recreated from a .class file by IntelliJ IDEA // (powered by FernFlower decompiler) // package org.apache.hadoop.mapreduce; import java.io.IOException; import org.apache.hadoop.classification.InterfaceAudience.Public; import org.apache.hadoop.classification.InterfaceStability.Stable; @Public @Stable public class Mapper { public Mapper() { } protected void setup(Mapper.Context context) throws IOException, InterruptedException { } protected void map(KEYIN key, VALUEIN value, Mapper.Context context) throws IOException, InterruptedException { context.write(key, value); } protected void cleanup(Mapper.Context context) throws IOException, InterruptedException { } public void run(Mapper.Context context) throws IOException, InterruptedException { this.setup(context); try { while(context.nextKeyValue()) { this.map(context.getCurrentKey(), context.getCurrentValue(), context); } } finally { this.cleanup(context); } } public abstract class Context implements MapContext { public Context() { } } }

Reducer

// // Source code recreated from a .class file by IntelliJ IDEA // (powered by FernFlower decompiler) // package org.apache.hadoop.mapreduce; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.classification.InterfaceAudience.Public; import org.apache.hadoop.classification.InterfaceStability.Stable; import org.apache.hadoop.mapreduce.ReduceContext.ValueIterator; import org.apache.hadoop.mapreduce.task.annotation.Checkpointable; @Checkpointable @Public @Stable public class Reducer { public Reducer() { } protected void setup(Reducer.Context context) throws IOException, InterruptedException { } protected void reduce(KEYIN key, Iterable values, Reducer.Context context) throws IOException, InterruptedException { Iterator i$ = values.iterator(); while(i$.hasNext()) { VALUEIN value = i$.next(); context.write(key, value); } } protected void cleanup(Reducer.Context context) throws IOException, InterruptedException { } public void run(Reducer.Context context) throws IOException, InterruptedException { this.setup(context); try { while(context.nextKey()) { this.reduce(context.getCurrentKey(), context.getValues(), context); Iterator iter = context.getValues().iterator(); if (iter instanceof ValueIterator) { ((ValueIterator)iter).resetBackupStore(); } } } finally { this.cleanup(context); } } public abstract class Context implements ReduceContext { public Context() { } } }

MapReduce编程模型之执行步骤

准备map处理的输入数据 Mapper处理 Shuffle Reduce 输出结果

MapReduce编程模型之核心概念

Split InputFormat OutputFormat Combiner Partitioner

c语言sscanf函数的用法是什么

224 2022-11-23

大数据-MapReduce

c语言sscanf函数的用法是什么

c语言一维数组怎么快速排列

linux怎么查看本机内存大小

推荐文章

api接口有哪几种分类及功能

什么是API接口?API接口简单介绍

短信API接口概述，短信API接口的优势

7款快递物流的物流查询API工具，物流快递查询API接口怎么对接？

企业四要素: 了解企业经营成功的关键

什么是语音验证码?,语音验证码平台有哪些

全国工商查询系统怎么查企业名录

哪些平台提供实名认证的接口？

PHP如何调用API接口?

如何使用百度天气预报API接口?

最近发表

热评文章

数据接口api（数据接口API开发平台）

数据开放接口api（数据服务api开发）

Python爬虫教程：爬取酷狗音乐（python爬取

hbuilder怎么更改字体大小和颜色

直播平台api接口 - 构建卓越的直播平台

实时股票数据api接口（股票实时行情api接口）