大数据开发问题整理
问题整理
一、大数据环境配置
1.windows本地安装配置hadoop后,cmd执行"hadoop"后报错:ERROR:JAVA_HOME is incorrectly set.
方案:由于JAVA_HOME路径有空格导致,可修改hadoop下\etc\hadoop\hadoop_env.cmd文档中set JAVA_HOME以修复该问题。
eg:set JAVA_HOME=C:\PROGRA~1\Java\jdk1.8.0_161
参考:https://www.cnblogs.com/zlslch/p/8580446.html
2.windows本地安装配置hadoop后,cmd执行"hadoop"后提示:系统找不到指定的批处理标签-print_usage
方案:将hadoop的bin目录中的所有cmd文件用notepad++打开,进行文档格式转换。
打开文件->编辑->文档格式转换->转为Windows(CR LF)->保存
二、hive
1、hive任务执行时,任务失败,日志显示虚拟内存不足
方案:由于集群节点虚拟内存不足导致的,解决办法很简单,直接关闭虚拟内存检测就可以了
修改:yarn-site.xml
文件,添加如下配置
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
分发至集群其他节点,并重启集群。
1、通过load data local overwrite方式向分桶表加载数据,overwrite未生效,数据会追加到目标表中。需要通过insert overwrite table target_table select * from source_table的方式覆盖目标表。
三、Hbase
1、Hbase jar包在集群执行报错如图:
报错原因:没有在hadoop-env.sh文件里面配置HADOOP_CLASSPATH环境变量,所以你执行hadoop jar
命令时,它找不到运行程序所依赖的jar包,所以配置下就行。
解决方案:修改hadoop-env.sh文件,添加HADOOP-CLASSPATH环境变量
[hadoop@node02 hadoop]$ cd /kkb/install/hadoop-3.1.4/etc/hadoop
[hadoop@node02 hadoop]$ vim hadoop-env.sh
export HADOOP_CLASSPATH=/kkb/install/hbase-2.2.2/lib/*
# * 一定要不然报错,注意不要用$HBASE_HOME代替
2、集群Hbase正常启动后,HRegionServer节点几分钟后自动断开
通过查看日志,如图报错
java.lang.NoClassDefFoundError: org/apache/htrace/SpanReceiver
解决方案:复制htrace-core4-4.2.0-incubating.jar至lib目录下
[hadoop@node03 client-facing-thirdparty]$ pwd
/kkb/install/hbase-2.2.2/lib/client-facing-thirdparty
[hadoop@node03 client-facing-thirdparty]$ cp /kkb/install/hbase-2.2.2/lib/client-facing-thirdparty/htrace-core4-4.2.0-incubating.jar /kkb/install/hbase-2.2.2/lib/
四、Flume
1、Flume agent执行如图报错:
报错原因:apache-flume-1.9.0-bin、hadoop-3.1.4都有guava包,但是版本不一致,会造成冲突
解决方案:将hadoop中高版本的guava包,替换flume中低版本的包
cd /kkb/install/apache-flume-1.9.0-bin/lib
rm -f guava-11.0.2.jar
cp /kkb/install/hadoop-3.1.4/share/hadoop/common/lib/guava-27.0-jre.jar .
MySQL
1、导出mysql表数据,指定csv格式
语法:select * from tablename into outfile "目录路径/tablename.csv" fields terminated by ',' lines terminated by '\n';
报错:ERROR 1290 (HY000): The MySQL server is running with the --secure-file-priv option so it cannot execute this statement
解决方案:查看mysql变量secure_file_priv设置,按照设置路径修改导出目录路径即可。
mysql> select * from students into outfile "/tmp/students.csv" fields terminated by ',' lines terminated by '\n';
ERROR 1290 (HY000): The MySQL server is running with the --secure-file-priv option so it cannot execute this statement
mysql> show variables like '*file*';
Empty set (0.00 sec)
mysql> show variables like '*secure_file*';
Empty set (0.00 sec)
mysql> show variables like '%secure_file%';
+------------------+-----------------------+
| Variable_name | Value |
+------------------+-----------------------+
| secure_file_priv | /var/lib/mysql-files/ |
+------------------+-----------------------+
1 row in set (0.00 sec)
mysql> select * from students into outfile "/var/lib/mysql-files/students.csv" fields terminated by ',' lines terminated by '\n';
Query OK, 6 rows affected (0.01 sec)
Spark
1、相同的jar包通过集群cluster的方式,在yarn执行成功,在spark的standalone下执行失败
bin/spark-submit --master spark://node01:7077 \
--deploy-mode cluster \
--class com.kkb.spark.core.SparkCountCluster \
--executor-memory 1G \
--total-executor-cores 2 \
hdfs://node01:8020/original-spark-core-1.0-SNAPSHOT.jar \
hdfs://node01:8020/word.txt hdfs://node01:8020/output
报错:
Launch Command: "/kkb/install/jdk1.8.0_141/bin/java" "-cp" "/kkb/install/spark-2.3.3-bin-hadoop2.7/conf/:/kkb/install/spark-2.3.3-bin-hadoop2.7/jars/*:/kkb/install/hadoop-3.1.4/etc/hadoop/" "-Xmx1024M" "-Dspark.eventLog.enabled=true" "-Dspark.submit.deployMode=cluster" "-Dspark.yarn.historyServer.address=node01:4000" "-Dspark.app.name=com.kkb.spark.core.SparkCountCluster" "-Dspark.driver.supervise=false" "-Dspark.executor.memory=1g" "-Dspark.eventLog.dir=hdfs://node01:8020/spark_log" "-Dspark.master=spark://node01:7077" "-Dspark.driver.extraClassPath=/kkb/install/hadoop-3.1.4/share/hadoop/common/hadoop-lzo-0.4.20.jar" "-Dspark.eventLog.compress=true" "-Dspark.cores.max=2" "-Dspark.executor.extraClassPath=/kkb/install/hadoop-3.1.4/share/hadoop/common/hadoop-lzo-0.4.20.jar" "-Dspark.history.ui.port=4000" "-Dspark.rpc.askTimeout=10s" "-Dspark.jars=hdfs://node01:8020/original-spark-core-1.0-SNAPSHOT.jar" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker@192.168.153.110:41049" "/kkb/install/spark-2.3.3-bin-hadoop2.7/work/driver-20210115100008-0000/original-spark-core-1.0-SNAPSHOT.jar" "com.kkb.spark.core.SparkCountCluster" "hdfs://node01:8020/word.txt" "hdfs://node01:8020/output"
========================================
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:65)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:112)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:78)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136)
at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:187)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.Partitioner$$anonfun$4.apply(Partitioner.scala:78)
at org.apache.spark.Partitioner$$anonfun$4.apply(Partitioner.scala:78)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:78)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:326)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:326)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:325)
at com.kkb.spark.core.SparkCountCluster$.main(SparkCountCluster.scala:12)
at com.kkb.spark.core.SparkCountCluster.main(SparkCountCluster.scala)
... 6 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
... 45 more
Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found.
at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:139)
at org.apache.hadoop.io.compress.CompressionCodecFactory.<init>(CompressionCodecFactory.java:180)
at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)
... 50 more
Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:132)
... 52
解决方案1:
--master指定6066端口,即REST URL: spark://node01.kaikeba.com:6066 (cluster mode)
bin/spark-submit --master spark://node01:6066 \
--deploy-mode cluster \
--class com.kkb.spark.core.SparkCountCluster \
--executor-memory 1G \
--total-executor-cores 2 \
hdfs://node01:8020/original-spark-core-1.0-SNAPSHOT.jar \
hdfs://node01:8020/word.txt hdfs://node01:8020/output
解决方案2:
修改配置文件 spark-defaults.conf,添加配置spark.master spark://node01:7077,node02:7077
作者:扎西的德勒
原文链接:https://www.jianshu.com/p/bd56122b5094