Hadoop之——错误及解决方案总结

网友投稿 296 2022-11-20

Hadoop之——错误及解决方案总结

​​​​

1,错误一:java.io.IOException: Incompatible clusterIDs时常出现在namenode重新格式化之后9 `7 k# I: L2 |9 U* @6 d

2014-04-29 14:32:53,877 FATAL org.apache.​​hadoop​​.hdfs.server.datanode.DataNode: Initialization failed forblock pool Block pool BP-1480406410-192.168.1.181-1398701121586 (storage idDS-167510828-192.168.1.191-50010-1398750515421) service tohadoop-master/192.168.1.181:9000" J' |7 h2q( T& @$ h" s' B

java.io.IOException: Incompatible clusterIDs in/data/dfs/data: namenode clusterID = CID-d1448b9e-da0f-499e-b1d4-78cb18ecdebb;datanode clusterID = CID-ff0faa40-2940-4838-b321-98272eb0dee3! U8 t) L- F( @0 ~' H0 N9 I

atorg.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)5 ~" a4 j4 o6 M7 ~* r

atorg.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)8 {* e. t; f7 ?# I8 I: \- v

atorg.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)

atorg.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)

atorg.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)9 l  e( o1 o  u# D

atorg.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)

atorg.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)

atorg.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)* j) }9 t/ x* ~

atjava.lang.Thread.run(Thread.java:722)

2014-04-29 14:32:53,885 WARNorg.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for:Block pool BP-1480406410-192.168.1.181-1398701121586 (storage idDS-167510828-192.168.1.191-50010-1398750515421) service tohadoop-master/192.168.1.181:90002 V9 G- G3 f* L

2014-04-29 14:32:53,889 INFOorg.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-1480406410-192.168.1.181-1398701121586(storage id DS-167510828-192.168.1.191-50010-1398750515421)

2014-04-29 14:32:55,897 WARNorg.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode

原因:

每次 namenode format 会重新创建一个 namenodeId, 而 data 目录包含了上次 format 时的 id,namenode format 清空了 namenode 下的数据 , 但是没有清空 datanode 下的数据 , 导致启动时失败 , 所要做的就是每次 fotmat 前 , 清空 data 下的所有目录 .

: d6 E2 t& M" g7 a* q3 l, H

解决办法:停掉集群,删除问题节点的data目录下的所有内容。即hdfs-site.xml文件中配置的dfs.data.dir目录。重新格式化namenode。

另一个更省事的办法:先停掉集群,然后将datanode节点目录/dfs/data/current/VERSION中的修改为与namenode一致即可。

2,错误:org.apache.hadoop.yarn.exceptions.YarnException:Unauthorized request to start container

. Y& }; L% ^' M8 H6 I5 T

14/04/29 02:45:07 INFO mapreduce.Job: Jobjob_1398704073313_0021 failed with state FAILED due to: Applicationapplication_1398704073313_0021 failed 2 times due to Error launchingappattempt_1398704073313_0021_000002. Got exception:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to startcontainer. ' F4 }0 C* `/ y# L9 A

This token is expired. current time is1398762692768 found 1398711306590

atsun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown Source)5 [& j' H( j0 j9 _4 ?7 o; q6 m

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)7 g- {( w6 [( N) s" `

atjava.lang.reflect.Constructor.newInstance(Constructor.java:525)

atorg.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)& n/ J* ]3 |& a2 q) t* g

atorg.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106), B4 u7 G  f, f  d6 H

atorg.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)0 o+ [. u$ O; S' S- ?3 t. y

atorg.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249)/ z+ v$ o( g) j* p. G

atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

atjava.lang.Thread.run(Thread.java:722)

. Failing the application.6 N  G( N1 f9 l& K) z

14/04/29 02:45:07 INFO mapreduce.Job: Counters:0

& M0 e9 z6 h: a7 O

问题原因:namenode,datanode时间同步问题8 y$ c& L  f2 W. h/ q( v& K

) a$ K$ ~1 q3 R0 D5 e

解决办法:多个datanode与namenode进行时间同步,在每台服务器执行:ntpdatetime.nist.gov,确认时间同步成功。6 q' N6 }4 N8 u

最好在每台服务器的/etc/crontab中加入一行:

0 2 * * * root ntpdate time.nist.gov && hwclock –w

3,错误:java.net.SocketTimeoutException: 480000 millistimeout while waiting for channel to be ready for write4 y# J0 W+ W- l#F  o% A4 I

2014-05-06 14:28:09,386 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:hadoop-datanode1:50010ataXceiver error processing READ_BLOCKoperation  src: /192.168.1.191:48854 dest: /192.168.1.191:50010

java.net.SocketTimeoutException: 480000 millistimeout while waiting for channel to be ready for write. ch :java.nio.channels.SocketChannel[connected local=/192.168.1.191:50010remote=/192.168.1.191:48854]$i  w. V4 w9 o& p

atorg.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)  Y$ @: l5 Z9 y  L

atorg.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)- s$ K8 W) R; |

atorg.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)

atorg.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:546)

atorg.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:710)+ N# ]! t7 W8 Q& y

atorg.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:340)

atorg.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:101): y, ~, W* P) K* \1 J

atorg.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)

atorg.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)1 q- a' m3 A5 f# `  o8 ^

atjava.lang.Thread.run(Thread.java:722)

原因:IO超时

解决方法:4 @3 A( S/ l3 z5 g

修改hadoop配置文件hdfs-site.xml,增加dfs.datanode.socket.write.timeout和dfs.socket.timeout两个属性的设置。0 M9 a' R; h9 J; T' U& u

0 ]2 l+ T8 }3 ]

dfs.datanode.socket.write.timeout" |0 C% k: u1 q. o9 R. O

6000000' p  F! j6 g& x! J  U

6 h- q6 G) k/ Z$ S  g" V

dfs.socket.timeout1 Z3 N1 Q6 A: g

6000000

注意:超时上限值以毫秒为单位。0表示无限制。

4,错误:DataXceiver error processing WRITE_BLOCKoperation

2014-05-0615:21:30,378 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:hadoop-datanode1:50010ataXceivererror processing WRITE_BLOCK operation  src: /192.168.1.193:34147dest: /192.168.1.191:500100 d3 F/ x) v" t- d/ `1 V' f

java.io.IOException: Premature EOF from inputStream; T4 W) h) A0 Z( o; g

at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)1 E7 q" J6 V& D2 ]3 J8 u5 r# Y

atorg.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)9 B/ `& K4 u3 |" N" o; }

atorg.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134), X4 i5 y8 ?, E, F7 a

atorg.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)% M0 O4 f& k. H/ H. \/ a2 j0 I

atorg.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)

atorg.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)

atorg.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)

at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)

atorg.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)

atorg.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221); Q& ~$ U/ I' q& ?2 _

at java.lang.Thread.run(Thread.java:722)

原因:文件操作超租期,实际上就是data stream操作过程中文件被删掉了。; v! @( [( o# r# j4 s

$ p/ h6 \+ {9 E+ e

解决办法:

修改hdfs-site.xml(针对2.x版本,1.x版本属性名应该是:dfs.datanode.max.xcievers):

dfs.datanode.max.transfer.threads % B  |& `$ |. p, S) j7 M; s7o

8192   q  G4 `1 b$ |

拷贝到各datanode节点并重启datanode即可" X% ?) P& ^. N* z* \( x

5,错误:java.io.IOException: Failed to replace a baddatanode on the existing pipeline due to no more good datanodes being availableto try.& C. g# q% y9 a3 j0 o* T! ~

2014-05-07 12:21:41,820 WARN [Thread-115]org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Graceful stop failed

org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException:Failed to replace a bad datanode on the existing pipeline due to no more gooddatanodes being available to try. (Nodes: current=[192.168.1.191:50010,192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). Thecurrent failed datanode replacement policy is DEFAULT, and a client mayconfigure this via 'dfs.client.block.write.replace-datanode-on-failure.policy'in its configuration.

atorg.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:514)

atorg.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:332)

atorg.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)

atorg.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)

atorg.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)

atorg.apache.hadoop.service.CompositeService.stop(CompositeService.java:159)3 f5 l5 i& y7 P6 k2 _

atorg.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132)

atorg.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)

atorg.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:548)

atorg.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:599)

Caused by: java.io.IOException: Failed to replace a bad datanode on theexisting pipeline due to no more good datanodes being available to try. (Nodes:current=[192.168.1.191:50010, 192.168.1.192:50010],original=[192.168.1.191:50010, 192.168.1.192:50010]). The current faileddatanode replacement policy is DEFAULT, and a client may configure this via'dfs.client.block.write.replace-datanode-on-failure.policy' in itsconfiguration.: x$ H! L! r3 f(f1 i/ X

atorg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:860)% ]. B/ u8 e0 c" Q5 f1 H* G# e' {5 m

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:925)

atorg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1031)8 {2 E( W2 G$ C7 O# Y( G2 x

atorg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823)4 k' U# @0 S2 N3 i2 \6 }) M

atorg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475)1 @/ v; |$ V: [: |

4 ?! g4 J.G  A2 @" V

原因:无法写入;我的环境中有3个datanode,备份数量设置的是3。在写操作时,它会在pipeline中写3个机器。默认replace-datanode-on-failure.policy是DEFAULT,如果系统中的datanode大于等于3,它会找另外一个datanode来拷贝。目前机器只有3台,因此只要一台datanode出问题,就一直无法写入成功。

, `& P. v+ `+r+ B

解决办法:修改hdfs-site.xml文件,添加或者修改如下两项:3 I, |& y9 ]4 G4 C

7 [( K( ?, x# X5 J4 d

dfs.client.block.write.replace-datanode-on-failure.enable. x9 t. ~7 V: X2 e* v& I1 K! [8 s

true  M  s: [- Y2 B3 Y/ B

: b0 I" E  g7 T/ C7 }2 U8m& x

dfs.client.block.write.replace-datanode-on-failure.policy

NEVER

2 M: v3 v/ @% M# o( N% H8 @

' g; f# V. {% K(D7 [

对于dfs.client.block.write.replace-datanode-on-failure.enable,客户端在写失败的时候,是否使用更换策略,默认是true没有问题。

对于,dfs.client.block.write.replace-datanode-on-failure.policy,default在3个或以上备份的时候,是会尝试更换结点尝试写入datanode。而在两个备份的时候,不更换datanode,直接开始写。对于3个datanode的集群,只要一个节点没响应写入就会出问题,所以可以关掉。

6,错误:org.apache.hadoop.util.DiskChecker$DiskErrorException:Could not find any valid local directory for & a&A1 p; W2 _

14/05/08 18:24:59 INFO mapreduce.Job: Task Id :attempt_1399539856880_0016_m_000029_2, Status : FAILED

Error:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any validlocal directory for attempt_1399539856880_0016_m_000029_2_spill_0.out% w1 N0 b0 d0 a

atorg.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)

atorg.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)0 k4 o$ [+ |# a+ i7 L" h

atorg.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)

atorg.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)! J; G1 l( Q9 U9 G1 [

atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)

atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1467)

atorg.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:769)' f; ~, o! _2 P, ]7 U1 S

atorg.apache.hadoop.mapred.MapTask.run(MapTask.java:339)

atorg.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)3 |5 g3 ^4 `6 b0 @3 D: E7 Y% T, h

atjava.security.AccessController.doPrivileged(Native Method)

atjavax.security.auth.Subject.doAs(Subject.java:415)( \  j3 A# O% V; Q& l

atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)3 h# X2 ^7 }5 ]1 |1 l# l+ V

atorg.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)8 S/ k! o( }- S1 o' ~

Container killed by the ApplicationMaster.

/ _, p, U7 _. ]% p' o% o

原因:两种可能,hadoop.tmp.dir或者data目录存储空间不足。# {' [% V  `+ y# I

! X- V4 n* ~+ W# _8 ^$ N$ L! c

解决办法:看了一下我的dfs状态,data使用率不到40%,所以推测是hadoop.tmp.dir空间不足,导致无法创建Jog临时文件。查看core-site.xml发现没有配置hadoop.tmp.dir,因此使用的是默认的/tmp目录,在这目录一旦服务器重启数据就会丢失,因此需要修改。添加:

/ q3 J# X# R2 }, b1 P/ y

hadoop.tmp.dir! ~9 m" j  l- k

/data/tmp

7 L# h' K+ E$ J* y

然后重新格式化:hadoopnamenode -format( K' i+ ?2 b2 E1 N, w% I([7 C

重启。

7.

2014-06-19 10:00:32,181 INFO[org.apache.hadoop.mapred.MapTask] - Ignoring exception during close for org.apache.hadoop.mapred.MapTask$NewOutputCollector@17bda0f2

java.io.IOException: Spill failed

atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)

atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1447)

atorg.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)

atorg.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1997)% \7 v, X5 B0 h5 F. S- Z

atorg.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:773)6 Z' M  @' E) c6 Z, {0 {: X2 p4 }$ T

atorg.apache.hadoop.mapred.MapTask.run(MapTask.java:339)

atorg.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)- P" x+ `$ ]; U. D

atjava.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

atjava.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)

atjava.util.concurrent.FutureTask.run(FutureTask.java:166)

atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)+ p* S) F2 D7 i  d

atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)& ~3 x' y6 G* M9 t& o: `

atjava.lang.Thread.run(Thread.java:722)! D: U  ?5 T% a* R

Caused by:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any validlocal directory for output/spill0.out

atorg.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398). M5 S# U+ }5 r0 y9 M6 ]

atorg.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)  u& M4 P3 Z# o* q' ^

atorg.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)8 H% H! A2 L; s8 [8 N, {* J

atorg.apache.hadoop.mapred.MROutputFiles.getSpillFileForWrite(MROutputFiles.java:146)% Y8 ]/ U  ?9 ^: J. E( w0 {7 f$ ]/ \

atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)3 U4 \; G2 h" k) |- v2 e1 P

atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)

atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)" @; u3 b+ P8 ~9 K$ e& d$ \' s

7 C% H3 i  H: Q5 i

( w! T2 d6 z  j# h; j; U

错误原因:本地磁盘空间不足非hdfs (我是在myeclipse中调试程序,本地tmp目录占满)

解决办法:清理、增加空间。

8,

2014-06-23 10:21:01,479 INFO [IPC Server handler 3 on 45207]org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttemptattempt_1403488126955_0002_m_000000_0 is : 0.308017162014-06-23 10:21:01,512FATAL [IPC Server handler 2 on 45207]org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:attempt_1403488126955_0002_m_000000_0 - exited : java.io.IOException: Spillfailed        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)       atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)       atorg.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)       at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)       atorg.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)       at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)       atcom.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)       atorg.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)       atorg.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)       at java.security.AccessController.doPrivileged(NativeMethod)        at javax.security.auth.Subject.doAs(Subject.java:415)       at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)       atorg.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any validlocal directory for attempt_1403488126955_0002_m_000000_0_spill_53.out       atorg.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)       at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)       atorg.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)       atorg.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)       at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)       atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)       atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-2310:21:01,513 INFO [IPC Server handler 2 on 45207]org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report fromattempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spillfailed        atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)       atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)       atorg.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)       at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)       atorg.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)       atcom.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)       at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)       atorg.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)       at java.security.AccessController.doPrivileged(NativeMethod)        atjavax.security.auth.Subject.doAs(Subject.java:415)       at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)       atorg.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any validlocal directory for attempt_1403488126955_0002_m_000000_0_spill_53.out       atorg.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)       atorg.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)       at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)       atorg.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)       at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)       at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)       atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-2310:21:01,514 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error:java.io.IOException: Spill failed        atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)       at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063)       atorg.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)       at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)       atorg.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)       atcom.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180)       at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1)       atorg.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)       at java.security.AccessController.doPrivileged(NativeMethod)        atjavax.security.auth.Subject.doAs(Subject.java:415)       at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Causedby: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find anyvalid local directory forattempt_1403488126955_0002_m_000000_0_spill_53.out       at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)       atorg.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)       at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)       atorg.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)       atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)       at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852)       atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-2310:21:01,516 INFO [AsyncDispatcher event handler]org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:attempt_1403488126955_0002_m_000000_0 TaskAttempt Transitioned from RUNNING toFAIL_CONTAINER_CLEANUP. C3 s* `3i& q) Q) }

错误很明显,磁盘空间不足,但郁闷的是,进各节点查看,磁盘空间使用不到40%,还有很多空间。4 k2 L+ q9 z6 }4 I6 L% Z+ J' J

郁闷很长时间才发现,原来有个map任务运行时输出比较多,运行出错前,硬盘空间一路飙升,直到100%不够时报错。随后任务执行失败,释放空间,把任务分配给其它节点。正因为空间被释放,因此虽然报空间不足的错误,但查看当时磁盘还有很多剩余空间。+ y1 Q8 m; ]' i4 I- G

这个问题告诉我们,运行过程中的监控很重要。

9.

2015-04-07 23:12:39,837 INFOorg.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics systemshutdown complete.) U7 f5 l1 t4 V8 {# j, d

2015-04-07 23:12:39,838 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode:Exception in namenode join

java.io.IOException: There appears to be a gap in the edit log.  Weexpected txid 1, but got txid 41.

at org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)

atorg.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:184)/ Z" l4 r% E  J0 W. l

at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:112)

atorg.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:733)

atorg.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:647)& r) i: B2 }9 N  n/ f( j

atorg.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:264); e) Q' g7 p6 \# J+ E

atorg.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:787)& ~; _) G5 k, a3 {  j

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:568)

atorg.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:443)

atorg.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:491): z/ b: y3 U- l3 u( s* C3 e6 u

atorg.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:684)6 @+ C5 O* E  f' o. G+J  }

atorg.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:669)

at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1254)

atorg.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1320)

2015-04-07 23:12:39,842 INFO org.apache.hadoop.util.ExitUtil: Exiting withstatus 11 U1 G8 g/ H- ](b9 q

原因:namenode元数据被破坏,需要修复

解决:恢复一下namenode6 H4 k1 t0 t  P* a4 ?+ j

hadoop namenode -recover0 O: L! R/ r" Z2 R# V. i) w% q! v

一路选择c,一般就OK了

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:解析整合mybatis
下一篇:微服务可靠性设计
相关文章

 发表评论

暂时没有评论,来抢沙发吧~