用hive命令msck repair table table_name修复表分区后spark读取表时分区失效问题

https://blog.csdn.net/weixin_40829577/article/details/109001268

目录

1 原因

2 解决方案

1 原因

为了提高性能spark对元数据做了缓存,如果外部系统更新了元数据,spark使用时要更新缓存过的该表元数据.

/**

* Invalidates and refreshes all the cached data and metadata of the given table. For performance

* reasons, Spark SQL or the external data source library it uses might cache certain metadata

* about a table, such as the location of blocks. When those change outside of Spark SQL, users

* should call this function to invalidate the cache.

*

* If this table is cached as an InMemoryRelation, drop the original cached version and make the

* new version cached lazily.

*

* @param tableName is either a qualified or unqualified name that designates a table/view.

*                  If no database identifier is provided, it refers to a temporary view or

*                  a table/view in the current database.

* @since 2.0.0

*/

defrefreshTable(tableName:String):Unit

2 解决方案

1. 启动客spark-shell客户端

1) 分配executor-memory/driver-memory 足够的内存, 否则会内存溢出;

2) 并发度不宜过大, 否则会超过允许的并发访问次数;

spark-shell \

--name ShyTestError \

--master yarn \

--deploy-mode client \

--num-executors 3 \

--executor-memory 24G \

--executor-cores 2 \

--driver-memory 8G \

--conf spark.dynamicAllocation.enabled=false \

--conf spark.executor.memoryOverhead=4G \

​--conf spark.default.parallelism=12 \

​--conf spark.sql.shuffle.partitions=12

2. 刷新对应表的元数据

spark.catalog.refreshTable("table_name")

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。