欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

spark开发中遇到的一些问题及解决方法总结

程序员文章站 2022-06-04 14:36:55
...

1.Exception in thread “main” java.lang.NoSuchMethodError: scala.collection.immutable.HashSet$.empty()Lscala/collection/immutable/HashSet;
解决:scala版本不对,将scala 2.11 换成scala 2.10

2.window本地 运行saprk程序会报如下报错误
Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
解决:
1.下载包含winutils.exe的hadoop包
https://www.barik.net/archive/2015/01/19/172716/
2.解压
3.设置HADOOP_HOME环境变量 HADOOP_HOME=E:\1-work\hadoop-2.6.0
4.设置Path = %Path%;$HADOOP_HOME\bin,该目录下有winutils.exe
5.重启电脑

3.window本地调试spark log4j日志级别调整
在项目源码根目录下创建log4j.properties文件,并写入一下内容(以下黑色内容可复制):级别ERROR,DEBGU,INFO,WARN…
log4j.rootCategory=ERROR, file
log4j.appender.file=org.apache.log4j.ConsoleAppender
#如果要把日志输出到某个文件中,则使用FileAppender
#log4j.appender.file=org.apache.log4j.FileAppender
#log4j.appender.file.file=spark.log

log4j.appender.file.append=false
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss.SSS} %t %p %c{1}: %m%n

#Ignore messages below warning level from Jetty, because it’s a bit verbose
log4j.logger.org.eclipse.jetty=ERROR
org.eclipse.jetty.LEVEL=ERROR

4.spark 在yarn提交模式
standalone 模式提交job
spark-submit --master spark://vm-spider-94:7077
–class com.fangdd.daas.hbase.Hdfs2HBaseTable
–total-executor-cores 16
–executor-memory 4g
–conf spark.default.parallelism=16
–driver-memory 2g
/root/hfile/hbase-data-service-2.0-SNAPSHOT-jar-with-dependencies.jar
${debugFlag} ${renameFlag} ${hdfsPath} ${zkQuorum} ${hbaseMaster} ${hbaseZnode} ${partitions} ${writeBufferSize} ${tgtTableName} ${tmpTableName} ${repartitionNum}

yarn-client 模式
spark-submit --master yarn-client
–class com.fangdd.daas.hbase.Hdfs2HBaseTable
–total-executor-cores 16
–executor-memory 4g
–conf spark.default.parallelism=16
–driver-memory 2g
/root/hfile/hbase-data-service-2.0-SNAPSHOT-jar-with-dependencies.jar
${debugFlag} ${renameFlag} ${hdfsPath} ${zkQuorum} ${hbaseMaster} ${hbaseZnode} ${partitions} ${writeBufferSize} ${tgtTableName} ${tmpTableName} ${repartitionNum}

yarn-cluster模式
spark-submit --master yarn-cluster
–class com.fangdd.daas.hbase.Hdfs2HBaseTable
–total-executor-cores 16
–executor-memory 4g
–conf spark.default.parallelism=16
–driver-memory 2g
/root/hfile/hbase-data-service-2.0-SNAPSHOT-jar-with-dependencies.jar
${debugFlag} ${renameFlag} ${hdfsPath} ${zkQuorum} ${hbaseMaster} ${hbaseZnode} ${partitions} ${writeBufferSize} ${tgtTableName} ${tmpTableName} ${repartitionNum}

5.idea spark本地连hive 只能显示default库,连不上其他库解决
1.必须配置好需要连接的hive的地址。。。而这个地址的配置是使用hive-site.xml文件,只需把配置好的这个文件放进resources
2.可能因为读不到resources目录下的hive-site.xml文件
解决:修改pom文件如下配置

<resources>
<resource>
    <directory>src/main/resources/${package.environment}</directory>
</resource>
<resource>
    <directory>src/main/resources/</directory>
</resource>
</resources>

sparkstreaming kafka操作hive 的sparksession声明,要加上enableHiveSupport(),要不也只能读到hive的default库
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).enableHiveSupport().getOrCreate()

如果再不行,则在idea中用maven clean一下,能解决

6.idea spark本地连hive报如下错误
Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app)
用Everything搜索metastore_db,删除之

7.spark开发中map等算子中print打印不出来
是因为spark是延时计算,需要在map后添加.collect等action算子触发计算

8.spark 本地win系统运行jar报 no filesystem for scheme hdfs
解决办法为:将集群的Hadoop/conf/core-site.xml拷贝到你工程的根目录
下,也就是src下。打开此文件,在最后添加以下代码:

	<property>
	<name>fs.hdfs.impl</name>
	<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
	<description>The FileSystem for hdfs: uris.</description>
	</property>

刷新工程,重新运行程序,问题完美解决。

9.spark 本地win系统运行jar,获取mysql jdbc DF 报 Exception in thread “main” java.lang.ClassNotFoundException: Failed to find data source: jdbc. Please find packages at http://spark.apache.org/third-party-projects.html

pom文件添加

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>${spark.version}</version>
</dependency>

代码使用:
def getMysqlJdbcDF(spark: SparkSession, url: String, querySql: String, userName: String, passWord: String): DataFrame = {
val readConnProperties4 = new Properties()
readConnProperties4.put(“driver”, “com.mysql.jdbc.Driver”)
readConnProperties4.put(“user”, userName)
readConnProperties4.put(“password”, passWord)
readConnProperties4.put(“fetchsize”, “3”)
spark.read.jdbc(
url,
s"($querySql) t", // 注意括号和表别名,必须得有,这里可以过滤数据
readConnProperties4)
}

相关标签: spark spark