Hadoop之MapReduce02【自定义wordcount案例】

网友投稿 241 2022-11-17

创建MapperTask

接口形参说明

参数	说明
K1	默认是一行一行读取的偏移量的类型
V1	默认读取的一行的类型
K2	用户处理完成后返回的数据的key的类型
V2	用户处理完成后返回的value的类型

注意数据经过网络传输，所以需要序列化

数据类型	序列化类型
Integer	IntWritable
Long	LongWritable
Double	DoubleWritable
Float	FloatWritable
String	Text
null	NullWritable
Boolean	BooleanWritable
…

/** * 注意数据经过网络传输，所以需要序列化 * * KEYIN:默认是一行一行读取的偏移量 long LongWritable * VALUEIN:默认读取的一行的类型 String * * KEYOUT:用户处理完成后返回的数据的key String LongWritable * VALUEOUT:用户处理完成后返回的value integer IntWritable * @author 波波烤鸭 * dengpbs@163.com */public class MyMapperTask extends Mapper { /** * Map阶段的业务逻辑写在Map方法中 * 默认是每读取一行记录就会调用一次该方法 * @param key 读取的偏移量 * @param value 读取的那行数据 */ @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); // 根据空格切割单词 String[] words = line.split(" "); for (String word : words) { // 将单词作为key 将1作为值以便于后续的数据分发 context.write(new Text(word), new IntWritable(1)); } }}

创建ReduceTask

创建java类继承自Reducer父类。

参数	说明
KEYIN	对应的是map阶段的 KEYOUT
VALUEIN	对应的是map阶段的 VALUEOUT
KEYOUT	reduce逻辑处理的输出Key类型
VALUEOUT	reduce逻辑处理的输出Value类型

/** * KEYIN和VALUEIN 对应的是map阶段的 KEYOUT和VALUEOUT * * KEYOUT: reduce逻辑处理的输出类型 * VALUEOUT: * @author 波波烤鸭 * dengpbs@163.com */public class MyReducerTask extends Reducer{ /** * @param key map阶段输出的key * @param values map阶段输出的相同的key对应的数据集 * @param context 上下文 */ @Override protected void reduce(Text key, Iterable values,Context context) throws IOException, InterruptedException { int count = 0 ; // 统计同一个key下的单词的个数 for (IntWritable value : values) { count += value.get(); } context.write(key, new IntWritable(count)); }}

创建启动工具类

package com.bobo.mr.wc;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class WcTest { public static void main(String[] args) throws Exception { // 创建配置文件对象 Configuration conf = new Configuration(true); // 获取Job对象 Job job = Job.getInstance(conf); // 设置相关类 job.setJarByClass(WcTest.class); // 指定 Map阶段和Reduce阶段的处理类 job.setMapperClass(MyMapperTask.class); job.setReducerClass(MyReducerTask.class); // 指定Map阶段的输出类型 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); // 指定job的原始文件的输入输出路径通过参数传入 FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // 提交任务，并等待响应 job.waitForCompletion(true); }}

打包部署

maven打包为jar包

上传测试

在HDFS系统中创建wordcount案例文件夹，并测试

hadoop fs -mkdir -p /hdfs/wordcount/inputhadoop fs -put a.txt b.txt /hdfs/wordcount/input/

执行程序测试

hadoop jar hadoop-demo-0.0.1-SNAPSHOT.jar com.bobo.mr.wc.WcTest /hdfs/wordcount/input /hdfs/wordcount/output/

执行成功

[root@hadoop-node01 ~]# hadoop jar hadoop-demo-0.0.1-SNAPSHOT.jar com.bobo.mr.wc.WcTest /hdfs/wordcount/input /hdfs/wordcount/output/19/04/03 16:56:43 INFO client.RMProxy: Connecting to ResourceManager at hadoop-node01/192.168.88.61:803219/04/03 16:56:46 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.19/04/03 16:56:48 INFO input.FileInputFormat: Total input paths to process : 219/04/03 16:56:49 INFO mapreduce.JobSubmitter: number of splits:219/04/03 16:56:51 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1554281786018_000119/04/03 16:56:52 INFO impl.YarnClientImpl: Submitted application application_1554281786018_000119/04/03 16:56:53 INFO mapreduce.Job: The url to track the job: 16:56:53 INFO mapreduce.Job: Running job: job_1554281786018_000119/04/03 16:57:14 INFO mapreduce.Job: Job job_1554281786018_0001 running in uber mode : false19/04/03 16:57:14 INFO mapreduce.Job: map 0% reduce 0%19/04/03 16:57:38 INFO mapreduce.Job: map 100% reduce 0%19/04/03 16:57:56 INFO mapreduce.Job: map 100% reduce 100%19/04/03 16:57:57 INFO mapreduce.Job: Job job_1554281786018_0001 completed successfully19/04/03 16:57:57 INFO mapreduce.Job: Counters: 50 File System Counters FILE: Number of bytes read=181 FILE: Number of bytes written=321388 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=325 HDFS: Number of bytes written=87 HDFS: Number of read operations=9 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=2 Launched reduce tasks=1 Data-local map tasks=1 Rack-local map tasks=1 Total time spent by all maps in occupied slots (ms)=46511 Total time spent by all reduces in occupied slots (ms)=12763 Total time spent by all map tasks (ms)=46511 Total time spent by all reduce tasks (ms)=12763 Total vcore-milliseconds taken by all map tasks=46511 Total vcore-milliseconds taken by all reduce tasks=12763 Total megabyte-milliseconds taken by all map tasks=47627264 Total megabyte-milliseconds taken by all reduce tasks=13069312 Map-Reduce Framework Map input records=14 Map output records=14 Map output bytes=147 Map output materialized bytes=187 Input split bytes=234 Combine input records=0 Combine output records=0 Reduce input groups=10 Reduce shuffle bytes=187 Reduce input records=14 Reduce output records=10 Spilled Records=28 Shuffled Maps =2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms)=1049 CPU time spent (ms)=5040 Physical memory (bytes) snapshot=343056384 Virtual memory (bytes) snapshot=6182891520 Total committed heap usage (bytes)=251813888 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=91 File Output Format Counters Bytes Written=87

查看结果

[root@hadoop-node01 ~]# hadoop fs -cat /hdfs/wordcount/output/part-r-00000ajax 1bobo烤鸭 1hello 2java 2mybatis 1name 1php 1shell 2spring 2springmvc 1

OK~

标签：工具

暂时没有评论，来抢沙发吧~

Hadoop之MapReduce02【自定义wordcount案例】

linux cpu占用率如何看

宝塔数据库如何清理缓存

oracle怎么创建存储过程

推荐文章

api接口有哪几种分类及功能

什么是API接口?API接口简单介绍

短信API接口概述，短信API接口的优势

7款快递物流的物流查询API工具，物流快递查询API接口怎么对接？

企业四要素: 了解企业经营成功的关键

什么是语音验证码?,语音验证码平台有哪些

全国工商查询系统怎么查企业名录

哪些平台提供实名认证的接口？

PHP如何调用API接口?

如何使用百度天气预报API接口?

最近发表

热评文章

数据接口api（数据接口API开发平台）

数据开放接口api（数据服务api开发）

Python爬虫教程：爬取酷狗音乐（python爬取

hbuilder怎么更改字体大小和颜色

直播平台api接口 - 构建卓越的直播平台

实时股票数据api接口（股票实时行情api接口）