Hadoop NameNode启动got unexpected exception GC overhead limit exceeded问题
Hadoop NameNode启动got unexpected exception GC overhead limit exceeded问题
Hadoop NameNode启动时异常结束日志:
2021-04-14 16:45:51,492 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Fast-forwarding stream '/data/namenode/current/edits_0000000000041009112-0000000000041009113' to transaction ID 21599421
2021-04-14 16:45:51,492 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Edits file /data/namenode/current/edits_0000000000041009112-0000000000041009113 of size 42 edits # 2 loaded in 0 seconds
2021-04-14 16:45:51,492 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@54dd2a3c expecting start txid #41009114
2021-04-14 16:45:51,492 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Start loading edits file /data/namenode/current/edits_0000000000041009114-0000000000041009115
2021-04-14 16:45:51,492 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Fast-forwarding stream '/data/namenode/current/edits_0000000000041009114-0000000000041009115' to transaction ID 21599421
2021-04-14 16:45:51,492 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Edits file /data/namenode/current/edits_0000000000041009114-0000000000041009115 of size 42 edits # 2 loaded in 0 seconds
2021-04-14 16:45:51,492 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@41b7f7c4 expecting start txid #41009116
2021-04-14 16:45:51,492 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Start loading edits file /data/namenode/current/edits_0000000000041009116-0000000000041009117
2021-04-14 16:45:51,493 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Fast-forwarding stream '/data/namenode/current/edits_0000000000041009116-0000000000041009117' to transaction ID 21599421
2021-04-14 16:45:51,493 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Edits file /data/namenode/current/edits_0000000000041009116-0000000000041009117 of size 42 edits # 2 loaded in 0 seconds
2021-04-14 16:45:51,493 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@5c83c349 expecting start txid #41009118
2021-04-14 16:45:51,493 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Start loading edits file /data/namenode/current/edits_0000000000041009118-0000000000041009264
2021-04-14 16:45:51,493 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Fast-forwarding stream '/data/namenode/current/edits_0000000000041009118-0000000000041009264' to transaction ID 21599421
2021-04-14 16:45:52,758 ERROR org.apache.hadoop.hdfs.server.namenode.FSImage: Error replaying edit log at offset 8612. Expected transaction ID was 41009182
Recent opcode offsets: 8175 8279 8439 8612
java.io.IOException: got unexpected exception GC overhead limit exceeded
at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$Reader.readOp(FSEditLogOp.java:4601)
at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextOpImpl(EditLogFileInputStream.java:203)
at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextOp(EditLogFileInputStream.java:250)
at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:188)
at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:190)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:143)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:753)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:329)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:984)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:686)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:586)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:646)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:820)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:804)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1516)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1582)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.google.protobuf.CodedInputStream.<init>(CodedInputStream.java:573)
at com.google.protobuf.CodedInputStream.newInstance(CodedInputStream.java:55)
at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:199)
at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:241)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
at org.apache.hadoop.hdfs.protocol.proto.XAttrProtos$XAttrEditLogProto.parseDelimitedFrom(XAttrProtos.java:1282)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp.readXAttrsFromEditLog(FSEditLogOp.java:414)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp.access$3800(FSEditLogOp.java:144)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$AddCloseOp.readFields(FSEditLogOp.java:627)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$Reader.decodeOp(FSEditLogOp.java:4697)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$Reader.readOp(FSEditLogOp.java:4583)
... 18 more
2021-04-14 16:45:57,366 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock held for 258439 ms via
java.lang.Thread.getStackTrace(Thread.java:1559)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:950)
org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:261)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1511)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1014)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:686)
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:586)
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:646)
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:820)
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:804)
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1516)
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1582)
Number of suppressed write-lock reports: 0
Longest write-lock held interval: 258439
2021-04-14 16:45:57,367 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:262)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.closeAllStreams(FSEditLog.java:1640)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:910)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:753)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:329)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:984)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:686)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:586)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:646)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:820)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:804)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1516)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1582)
2021-04-14 16:45:57,368 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2021-04-14 16:45:57,630 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at h1/20.88.2.100
************************************************************/
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
Hadoop的数据都是在内存中的,但是为了避免意外事故宕机,内存中的数据将会全部丢失.
为了避免数据丢失,Hadoop会将数据保存到内存后,先写入磁盘一份,然后才会给客户端返回ack,确认数据写入完成,这样就可以避免服务器宕机造成数据损失了.
Hadoop不断地将操作记录到磁盘,这个文件会越来越大,当达到一定时间或是重启,Hadoop会滚动这个文件,朝一个新的edits中写入日志.
问题原因
Hadoop不断地滚动日志,edits文件就会太多,这个时候,如果重启了Hadoop,Hadoop需要从头到尾的把所有的edits文件读取到内存,还原一遍.
这时候,默认分配给Hadoop的内存就不够用了,启动后不断地滚动还原日志中的内容,内存装不下,就抛出OutOfMemoryError了.
Hadoop是不应该有这么多的edits文件的,通过参数hdfs-site.xml 的dfs.namenode.name.dir参数指定的目录下的current目录中可以找到NameNode数据存储的Edits.
Hadoop是不应该存在这么多的edits文件的,他的开发者考虑到了这件事情,所以还有一个SecondaryNameNode,SecondaryNameNode会辅助NameNode进行工作,而辅助的内容,则是将这些edits的文件读出,加载到自己的内存中,然后将内存创建一个快照保存到磁盘,将这个快照放在NameNode的数据存储目录下.
所以,在NameNode的数据存储目录下会有edits的NameNode操作记录日志,还会有fsimage的文件,这个则是SecondaryNameNode创建的快照文件.
当NameNode启动的时候,他首先应该加载的是fsimage的镜像文件,然后在fsimage截止的这个时间,去加载后面产生的edits文件.不需要从头到尾的去加载所有的edits.
所以,NameNode启动失败的原因是 :edits文件太多了,而会造成edits文件过多的原因是:我的SecondaryNameNode节点故障了.
它的进程还是正常的,但是每次去合并镜像的时候,却都报了错.
2021-04-15 10:11:54,887 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint
java.io.IOException: Inconsistent checkpoint fields.
LV = -63 namespaceID = 1874539643 cTime = 0 ; clusterId = CID-8889c165-10b3-4895-a9bf-b037a4e0a860 ; blockpoolId = BP-1626093461-10.88.8.100-1589541600534.
Expecting respectively: -63; 1711940767; 0; CID-97709a04-7dd7-40d4-bc58-f196a19d9929; BP-2014788863-10.88.8.100-1587723332360.
at org.apache.hadoop.hdfs.server.namenode.CheckpointSignature.validateStorageInfo(CheckpointSignature.java:143)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:531)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:395)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:361)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:357)
at java.lang.Thread.run(Thread.java:748)
2021-04-15 10:12:54,893 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint
java.io.IOException: Inconsistent checkpoint fields.
LV = -63 namespaceID = 1874539643 cTime = 0 ; clusterId = CID-8889c165-10b3-4895-a9bf-b037a4e0a860 ; blockpoolId = BP-1626093461-10.88.8.100-1589541600534.
Expecting respectively: -63; 1711940767; 0; CID-97709a04-7dd7-40d4-bc58-f196a19d9929; BP-2014788863-10.88.8.100-1587723332360.
at org.apache.hadoop.hdfs.server.namenode.CheckpointSignature.validateStorageInfo(CheckpointSignature.java:143)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:531)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:395)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:361)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:357)
at java.lang.Thread.run(Thread.java:748)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
造成这个错误的原因是,在dfs.namenode.name.dir的目录下有个VERSION文件,里面有一个集群ID,他和Hadoop文件缓存的地址(参数: hadoop.tmp.dir指定的目录)中的VERSION文件的集群ID不同造成的,解决方式将dfs.namenode.name.dir目录下的VERSION文件复制一份到hadoop.tmp.dir的目录下.
检查了日志发现SecondaryNameNode的这个故障已经发生很长时间(大概有半年多)了,重启Hadoop后,dfs.namenode.name.dir下的edits文件瞬间少了一半多,Hadoop集群也恢复了正常.
————————————————
版权声明:本文为CSDN博主「荒-于嬉」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/weixin_44455125/article/details/116173459