阅读 266

Hadoop NameNode启动got unexpected exception GC overhead limit exceeded问题

Hadoop NameNode启动got unexpected exception GC overhead limit exceeded问题

Hadoop NameNode启动时异常结束日志:


2021-04-14 16:45:51,492 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Fast-forwarding stream '/data/namenode/current/edits_0000000000041009112-0000000000041009113' to transaction ID 21599421

2021-04-14 16:45:51,492 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Edits file /data/namenode/current/edits_0000000000041009112-0000000000041009113 of size 42 edits # 2 loaded in 0 seconds

2021-04-14 16:45:51,492 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@54dd2a3c expecting start txid #41009114

2021-04-14 16:45:51,492 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Start loading edits file /data/namenode/current/edits_0000000000041009114-0000000000041009115

2021-04-14 16:45:51,492 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Fast-forwarding stream '/data/namenode/current/edits_0000000000041009114-0000000000041009115' to transaction ID 21599421

2021-04-14 16:45:51,492 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Edits file /data/namenode/current/edits_0000000000041009114-0000000000041009115 of size 42 edits # 2 loaded in 0 seconds

2021-04-14 16:45:51,492 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@41b7f7c4 expecting start txid #41009116

2021-04-14 16:45:51,492 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Start loading edits file /data/namenode/current/edits_0000000000041009116-0000000000041009117

2021-04-14 16:45:51,493 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Fast-forwarding stream '/data/namenode/current/edits_0000000000041009116-0000000000041009117' to transaction ID 21599421

2021-04-14 16:45:51,493 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Edits file /data/namenode/current/edits_0000000000041009116-0000000000041009117 of size 42 edits # 2 loaded in 0 seconds

2021-04-14 16:45:51,493 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@5c83c349 expecting start txid #41009118

2021-04-14 16:45:51,493 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Start loading edits file /data/namenode/current/edits_0000000000041009118-0000000000041009264

2021-04-14 16:45:51,493 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Fast-forwarding stream '/data/namenode/current/edits_0000000000041009118-0000000000041009264' to transaction ID 21599421

2021-04-14 16:45:52,758 ERROR org.apache.hadoop.hdfs.server.namenode.FSImage: Error replaying edit log at offset 8612.  Expected transaction ID was 41009182

Recent opcode offsets: 8175 8279 8439 8612

java.io.IOException: got unexpected exception GC overhead limit exceeded

at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$Reader.readOp(FSEditLogOp.java:4601)

at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextOpImpl(EditLogFileInputStream.java:203)

at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextOp(EditLogFileInputStream.java:250)

at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)

at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:188)

at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)

at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:190)

at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:143)

at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)

at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:753)

at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:329)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:984)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:686)

at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:586)

at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:646)

at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:820)

at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:804)

at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1516)

at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1582)

Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

at com.google.protobuf.CodedInputStream.<init>(CodedInputStream.java:573)

at com.google.protobuf.CodedInputStream.newInstance(CodedInputStream.java:55)

at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:199)

at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:241)

at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)

at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)

at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)

at org.apache.hadoop.hdfs.protocol.proto.XAttrProtos$XAttrEditLogProto.parseDelimitedFrom(XAttrProtos.java:1282)

at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp.readXAttrsFromEditLog(FSEditLogOp.java:414)

at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp.access$3800(FSEditLogOp.java:144)

at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$AddCloseOp.readFields(FSEditLogOp.java:627)

at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$Reader.decodeOp(FSEditLogOp.java:4697)

at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$Reader.readOp(FSEditLogOp.java:4583)

... 18 more

2021-04-14 16:45:57,366 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock held for 258439 ms via

java.lang.Thread.getStackTrace(Thread.java:1559)

org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:950)

org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:261)

org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1511)

org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1014)

org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:686)

org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:586)

org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:646)

org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:820)

org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:804)

org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1516)

org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1582)

Number of suppressed write-lock reports: 0

Longest write-lock held interval: 258439

2021-04-14 16:45:57,367 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.

java.lang.OutOfMemoryError: GC overhead limit exceeded

at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:262)

at org.apache.hadoop.hdfs.server.namenode.FSEditLog.closeAllStreams(FSEditLog.java:1640)

at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:910)

at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:753)

at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:329)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:984)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:686)

at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:586)

at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:646)

at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:820)

at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:804)

at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1516)

at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1582)

2021-04-14 16:45:57,368 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1

2021-04-14 16:45:57,630 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: 

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at h1/20.88.2.100

************************************************************/

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

Hadoop的数据都是在内存中的,但是为了避免意外事故宕机,内存中的数据将会全部丢失.


为了避免数据丢失,Hadoop会将数据保存到内存后,先写入磁盘一份,然后才会给客户端返回ack,确认数据写入完成,这样就可以避免服务器宕机造成数据损失了.


Hadoop不断地将操作记录到磁盘,这个文件会越来越大,当达到一定时间或是重启,Hadoop会滚动这个文件,朝一个新的edits中写入日志.


问题原因


Hadoop不断地滚动日志,edits文件就会太多,这个时候,如果重启了Hadoop,Hadoop需要从头到尾的把所有的edits文件读取到内存,还原一遍.


这时候,默认分配给Hadoop的内存就不够用了,启动后不断地滚动还原日志中的内容,内存装不下,就抛出OutOfMemoryError了.


Hadoop是不应该有这么多的edits文件的,通过参数hdfs-site.xml 的dfs.namenode.name.dir参数指定的目录下的current目录中可以找到NameNode数据存储的Edits.


Hadoop是不应该存在这么多的edits文件的,他的开发者考虑到了这件事情,所以还有一个SecondaryNameNode,SecondaryNameNode会辅助NameNode进行工作,而辅助的内容,则是将这些edits的文件读出,加载到自己的内存中,然后将内存创建一个快照保存到磁盘,将这个快照放在NameNode的数据存储目录下.


所以,在NameNode的数据存储目录下会有edits的NameNode操作记录日志,还会有fsimage的文件,这个则是SecondaryNameNode创建的快照文件.


当NameNode启动的时候,他首先应该加载的是fsimage的镜像文件,然后在fsimage截止的这个时间,去加载后面产生的edits文件.不需要从头到尾的去加载所有的edits.


所以,NameNode启动失败的原因是 :edits文件太多了,而会造成edits文件过多的原因是:我的SecondaryNameNode节点故障了.


它的进程还是正常的,但是每次去合并镜像的时候,却都报了错.


2021-04-15 10:11:54,887 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint

java.io.IOException: Inconsistent checkpoint fields.

LV = -63 namespaceID = 1874539643 cTime = 0 ; clusterId = CID-8889c165-10b3-4895-a9bf-b037a4e0a860 ; blockpoolId = BP-1626093461-10.88.8.100-1589541600534.

Expecting respectively: -63; 1711940767; 0; CID-97709a04-7dd7-40d4-bc58-f196a19d9929; BP-2014788863-10.88.8.100-1587723332360.

at org.apache.hadoop.hdfs.server.namenode.CheckpointSignature.validateStorageInfo(CheckpointSignature.java:143)

at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:531)

at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:395)

at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:361)

at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)

at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:357)

at java.lang.Thread.run(Thread.java:748)

2021-04-15 10:12:54,893 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint

java.io.IOException: Inconsistent checkpoint fields.

LV = -63 namespaceID = 1874539643 cTime = 0 ; clusterId = CID-8889c165-10b3-4895-a9bf-b037a4e0a860 ; blockpoolId = BP-1626093461-10.88.8.100-1589541600534.

Expecting respectively: -63; 1711940767; 0; CID-97709a04-7dd7-40d4-bc58-f196a19d9929; BP-2014788863-10.88.8.100-1587723332360.

at org.apache.hadoop.hdfs.server.namenode.CheckpointSignature.validateStorageInfo(CheckpointSignature.java:143)

at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:531)

at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:395)

at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:361)

at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)

at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:357)

at java.lang.Thread.run(Thread.java:748)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

造成这个错误的原因是,在dfs.namenode.name.dir的目录下有个VERSION文件,里面有一个集群ID,他和Hadoop文件缓存的地址(参数: hadoop.tmp.dir指定的目录)中的VERSION文件的集群ID不同造成的,解决方式将dfs.namenode.name.dir目录下的VERSION文件复制一份到hadoop.tmp.dir的目录下.


检查了日志发现SecondaryNameNode的这个故障已经发生很长时间(大概有半年多)了,重启Hadoop后,dfs.namenode.name.dir下的edits文件瞬间少了一半多,Hadoop集群也恢复了正常.

————————————————

版权声明:本文为CSDN博主「荒-于嬉」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。

原文链接:https://blog.csdn.net/weixin_44455125/article/details/116173459


文章分类
后端
版权声明:本站是系统测试站点,无实际运营。本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容, 请发送邮件至 XXXXXXo@163.com 举报,一经查实,本站将立刻删除。
相关推荐