聊聊 Spark 作业的 commit 提交机制 - Spark并发更新ORC表失败的问题原因与解决方法

简介: 聊聊 Spark 作业的 commit 提交机制 - Spark并发更新ORC表失败的问题原因与解决方法

1 问题现象

多个Spark 作业并发更新同一张ORC表时,部分作业可能会因为某些临时文件不存在而失败退出,典型报错日志如下:

org.apache.spark.SparkException: Job aborted. Caused by: java.io.FileNotFoundException: File hdfs://kxc-cluster/user/hive/warehouse/hstest_dev.db/test_test_test/_temporary/0/task_202309041054037826600124725546762_0176_m_000002/g=2 does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:981)

2 问题原因

该问题的原因是spark不支持对同一张ORC/PARQUET非分区表或ORC/PARQUET分区表的同一个分区的并发更新,甚至也不支持以静态分区模式并发更新 ORC/PARQUET分区表的不同分区,其底层细节跟 spark作业两阶段提交机制的实现算法有关,详情见后文。

3.问题解决

  • 解决方案1:对于分区表,尽量使用动态分区模式替代静态分区模式: 比如使用insert overwrite table table1 partition (part_date) select client_id, 20230911 as part_date from table0 替代 insert overwrite table table1 partition (part_date=20230911) select client_id from table0; (此时每个作业都有自己独立的临时目录,且位于目录如.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66下,所以互不冲突);
  • 解决方案2:配置spark使用hive serde 而不是spark built-in data source writer: 即配置参数spark.sql.hive.convertInsertingPartitionedTable=false 和spark.sql.hive.convertMetastoreOrc=false,(此时底层使用 hive serde的commit算法,每个作业都有自己独立的临时目录,且位于目录如.hive-staging_hive_2023-09-08_17-35-01_497_4555303478309834157-59下,所以互不冲突);
  • 解决方案3:配置fileoutputcommitter 不对临时目录进行清理,即配置spark参数spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true;

上述方案各自的局限性如下:

  • 方案1只适用于分区表;
  • 方案2在spark/hive的互操作性可能有局限:即spark/hive能否正常读写彼此生成的数据,取决于 spark/hive的版本是否兼容(以及某些相关参数的具体配置);
  • 方案3需要异步手动清理临时目录,否则日积月累临时目录下会有多个空目录(不是空文件);

4 技术背景-概述

SPARK作业采用了两阶段提交的机制,会对 task/job分别进行提交,其细节如下:

  • Job开始执行时,会先创建一个临时目录??????.???/?????????/{appAttemptId} ,作为本次运行的临时输出目录,其中 ${output.dir} 即对应表的根存储路径如/user/hive/warehouse/test.db/tableA;
  • JOB底层的task开始运行时,会进一步创建临时目录 ??????.???/?????????/{appAttemptId}/_temporary/${taskAttemptId},作为该task的临时输出目录;
  • 某task运行完毕后,会检查是否需要commit该任务(如果开启了推测执行机制,有些TASK可能会不需要commit),如果需要 commit, 则会将输出文件??????.???/?????????/{appAttemptId}/_temporary/?????????????/{fileName} 移动到 ??????.???/?????????/{appAttemptId}/${fileName} ;
  • 所有TASK执行完毕后,则会提交作业,此时会将所有输出文件??????.???/?????????/{appAttemptId}/移动到最终目录????????移动到最终目录{output.dir} 下;
  • 作业提交完毕之后,如果没有显示配置spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true,则会清理临时目录,将${output.dir}/_temporary 目录删除;
  • 在采用动态分区模式插入分区表时,还会使用到暂存目录,即临时目录,此时对应的是??????.???/?????????,此时对应的是{output.dir}/.spark-staging-{jobId}/_temporary,比如/user/hive/warehouse/test.db/tableA/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary;且该暂存目录在作业提交后,总是会被删除;
  • 当显示配置mapreduce.fileoutputcommitter.algorithm.version=2时,上述task提交的细节略有不同(底层会将??????.???/?????????/{appAttemptId}/_temporary/?????????????/{fileName} 直接移动到 ${output.dir}下);
  • 正是因为上述提交task/job的细节,所以spark不支持对同一张ORC/PARQUET非分区表或ORC/PARQUET分区表的同一个分区的并发更新,也不支持以静态分区模式并发更新 ORC/PARQUET分区表的不同分区;

640.png


5 技术背景-相关源码及相关参数

相关源码:

- org.apache.spark.internal.io.HadoopMapReduceCommitProtocol
- org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol
- org.apache.spark.internal.io.FileCommitProtocol
- org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
- org.apache.hadoop.mapreduce.OutputCommitter
- org.apache.hadoop.mapreduce.lib.output.PathOutputCommitter
- org.apache.spark.sql.execution.datasources.FileFormatWriter
- org.apache.spark.sql.hive.execution.SaveAsHiveFile

640.png

640.png

  • 相关参数:
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version
spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped
spark.sql.sources.outputCommitterClass
spark.sql.sources.commitProtocolClass
mapreduce.fileoutputcommitter.algorithm.version 
mapreduce.fileoutputcommitter.cleanup.skipped
mapreduce.fileoutputcommitter.cleanup-failures.ignored
mapreduce.fileoutputcommitter.marksuccessfuljobs
mapreduce.fileoutputcommitter.task.cleanup.enabled
mapred.committer.job.setup.cleanup.needed/mapreduce.job.committer.setup.cleanup.needed

6 技术背景-spark并发插入非分区表

  • Job/task执行过程中会生成临时文件:/user/hive/warehouse/test_liming.db/table1/_temporary/0/_temporary/attempt_202309080930006805722025796783378_0038_m_000000_158/part-00000-a1e1410f-6ca1-4d8b-92b6-78883c9e9a22-c000.zlib.orc
  • Task commit后会生成文件:hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1/_temporary/0/task_202309080928591897349793317265177_0025_m_000000
  • Job commit后会生成文件:/user/hive/warehouse/test_liming.db/table1/part-00000-8448b8b5-01b1-4348-8f91-5d3acd682f81-c000.zlib.orc
  • 执行过程中截图如下:

640.png


  • 相关关键日志如下:
关键日志-成功的task:
23/09/08 09:26:29 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/09/08 09:26:29 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/09/08 09:26:29 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
23/09/08 09:26:30 INFO HadoopShimsPre2_7: Can't get KeyProvider for ORC encryption from hadoop.security.key.provider.path.
23/09/08 09:26:30 INFO PhysicalFsWriter: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_test_test1/.hive-staging_hive_2023-09-08_09-26-26_158_278404270035841685-3/-ext-10000/_temporary/0/_temporary/attempt_202309080926277463773081868267263_0002_m_000000_2/part-00000-6c45455c-0201-4ad8-9459-fa8b77f37d0e-c000 with stripeSize: 67108864 blockSize: 268435456 compression: Compress: ZLIB buffer: 262144
23/09/08 09:26:30 INFO WriterImpl: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_test_test1/.hive-staging_hive_2023-09-08_09-26-26_158_278404270035841685-3/-ext-10000/_temporary/0/_temporary/attempt_202309080926277463773081868267263_0002_m_000000_2/part-00000-6c45455c-0201-4ad8-9459-fa8b77f37d0e-c000 with stripeSize: 67108864 options: Compress: ZLIB buffer: 262144
23/09/08 09:26:49 INFO FileOutputCommitter: Saved output of task 'attempt_202309080926277463773081868267263_0002_m_000000_2' to hdfs://nameservice1/user/hive/warehouse/test_test_test1/.hive-staging_hive_2023-09-08_09-26-26_158_278404270035841685-3/-ext-10000/_temporary/0/task_202309080926277463773081868267263_0002_m_000000
23/09/08 09:26:49 INFO SparkHadoopMapRedUtil: attempt_202309080926277463773081868267263_0002_m_000000_2: Committed. Elapsed time: 13 ms.
23/09/08 09:26:49 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 2541 bytes result sent to driver
关键日志-失败的task:
23/09/08 10:22:02 WARN DataStreamer: DataStreamer Exception
java.io.FileNotFoundException: File does not exist: /user/hive/warehouse/test_liming.db/table1/_temporary/0/_temporary/attempt_202309081021577566806638904497462_0003_m_000000_10/part-00000-211a80a3-2cce-4f25-8c10-bfa5ecbd421f-c000.zlib.orc (inode 21688384) Holder DFSClient_attempt_202309081021525253836824694806862_0001_m_000003_4_991233622_49 does not have any open files.

7 技术背景-spark采用静态分区模式并发插入分区表的不同分区

  • Job/task执行过程中会生成文件:/user/hive/warehouse/test_liming.db/table1_pt/_temporary/0/_temporary/attempt_202309081055288611730301255924365_0005_m_000000_20/g=22/part-00000-88afa539-25ba-4b1d-bd6d-df445863dd8d.c000.zlib.orc
  • Task commit后会生成文件:hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/_temporary/0/task_202309081054408080671360087873016_0001_m_000000
  • Job commit后会生成最终文件:/user/hive/warehouse/test_liming.db/table1_pt/g=22/part-00000-0732dc56-ae0f-4c32-8347-012870ad7ab1.c000.zlib.orc
  • 执行过程中截图如下:

640.png

  • 关键日志如下:
关键日志-成功的task:
23/09/08 10:54:48 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/09/08 10:54:48 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/09/08 10:54:48 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
23/09/08 10:54:48 INFO CodeGenerator: Code generated in 25.249011 ms
23/09/08 10:54:48 INFO CodeGenerator: Code generated in 14.669298 ms
23/09/08 10:54:48 INFO CodeGenerator: Code generated in 37.39972 ms
23/09/08 10:54:48 INFO HadoopShimsPre2_7: Can't get KeyProvider for ORC encryption from hadoop.security.key.provider.path.
23/09/08 10:54:48 INFO PhysicalFsWriter: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/_temporary/0/_temporary/attempt_202309081054408080671360087873016_0001_m_000000_1/g=20/part-00000-92811aeb-309c-4c23-acdd-b8286feadcd4.c000.zlib.orc with stripeSize: 67108864 blockSize: 268435456 compression: Compress: ZLIB buffer: 262144
23/09/08 10:54:48 INFO WriterImpl: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/_temporary/0/_temporary/attempt_202309081054408080671360087873016_0001_m_000000_1/g=20/part-00000-92811aeb-309c-4c23-acdd-b8286feadcd4.c000.zlib.orc with stripeSize: 67108864 options: Compress: ZLIB buffer: 262144
23/09/08 10:55:03 INFO FileOutputCommitter: Saved output of task 'attempt_202309081054408080671360087873016_0001_m_000000_1' to hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/_temporary/0/task_202309081054408080671360087873016_0001_m_000000
23/09/08 10:55:03 INFO SparkHadoopMapRedUtil: attempt_202309081054408080671360087873016_0001_m_000000_1: Committed. Elapsed time: 9 ms.
23/09/08 10:55:03 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 3255 bytes result sent to driver
关键日志-job commit报错:
23/09/08 10:55:22 ERROR FileFormatWriter: Aborting job 966601b8-2679-4dc3-86a1-cebc34d9b8c9.
java.io.FileNotFoundException: File hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/_temporary/0 does not exist.
关键日志-task commit报错:
23/09/08 10:55:43 WARN DataStreamer: DataStreamer Exception
java.io.FileNotFoundException: File does not exist: /user/hive/warehouse/test_liming.db/table1_pt/_temporary/0/_temporary/attempt_202309081055288611730301255924365_0005_m_000000_20/g=22/part-00000-88afa539-25ba-4b1d-bd6d-df445863dd8d.c000.zlib.orc (inode 21689816) Holder DFSClient_NONMAPREDUCE_2024885185_46 does not have any open files.

8 技术背景-spark采用动态分区模式插入分区表不同分区

  • Job/task执行过程中会生成文件:hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/_temporary/attempt_202309081121348303587356551291178_0001_m_000002_3/g=23/part-00002-fc1a9f7a-5729-498e-b710-249e90217f66.c000.zlib.orc
  • Task commit后会生成文件:hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/task_202309081121348303587356551291178_0001_m_000002
  • Job commit后会生成文件:/user/hive/warehouse/test_liming.db/table1_pt/g=23/part-00002-2587b707-7675-4547-8ffb-63e2114d1c9b.c000.zlib.orc
  • 执行过程中截图如下:

640.png

640.png

  • 关键日志如下:
关键日志-所有task所有Job都是成功的:
23/09/08 11:21:45 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/09/08 11:21:45 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/09/08 11:21:45 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/09/08 11:21:45 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/09/08 11:21:45 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
23/09/08 11:21:45 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
23/09/08 11:21:45 INFO CodeGenerator: Code generated in 44.80136 ms
23/09/08 11:21:45 INFO CodeGenerator: Code generated in 16.168217 ms
23/09/08 11:21:45 INFO CodeGenerator: Code generated in 53.060559 ms
23/09/08 11:21:45 INFO HadoopShimsPre2_7: Can't get KeyProvider for ORC encryption from hadoop.security.key.provider.path.
23/09/08 11:21:45 INFO HadoopShimsPre2_7: Can't get KeyProvider for ORC encryption from hadoop.security.key.provider.path.
23/09/08 11:21:45 INFO PhysicalFsWriter: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/_temporary/attempt_202309081121348303587356551291178_0001_m_000002_3/g=23/part-00002-fc1a9f7a-5729-498e-b710-249e90217f66.c000.zlib.orc with stripeSize: 67108864 blockSize: 268435456 compression: Compress: ZLIB buffer: 262144
23/09/08 11:21:45 INFO PhysicalFsWriter: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/_temporary/attempt_202309081121344786622643944446305_0001_m_000000_1/g=21/part-00000-fc1a9f7a-5729-498e-b710-249e90217f66.c000.zlib.orc with stripeSize: 67108864 blockSize: 268435456 compression: Compress: ZLIB buffer: 262144
23/09/08 11:21:45 INFO WriterImpl: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/_temporary/attempt_202309081121344786622643944446305_0001_m_000000_1/g=21/part-00000-fc1a9f7a-5729-498e-b710-249e90217f66.c000.zlib.orc with stripeSize: 67108864 options: Compress: ZLIB buffer: 262144
23/09/08 11:21:45 INFO WriterImpl: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/_temporary/attempt_202309081121348303587356551291178_0001_m_000002_3/g=23/part-00002-fc1a9f7a-5729-498e-b710-249e90217f66.c000.zlib.orc with stripeSize: 67108864 options: Compress: ZLIB buffer: 262144
23/09/08 11:22:04 INFO FileOutputCommitter: Saved output of task 'attempt_202309081121344786622643944446305_0001_m_000000_1' to hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/task_202309081121344786622643944446305_0001_m_000000
23/09/08 11:22:04 INFO SparkHadoopMapRedUtil: attempt_202309081121344786622643944446305_0001_m_000000_1: Committed. Elapsed time: 18 ms.
23/09/08 11:22:04 INFO FileOutputCommitter: Saved output of task 'attempt_202309081121348303587356551291178_0001_m_000002_3' to hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/task_202309081121348303587356551291178_0001_m_000002
23/09/08 11:22:04 INFO SparkHadoopMapRedUtil: attempt_202309081121348303587356551291178_0001_m_000002_3: Committed. Elapsed time: 9 ms.
23/09/08 11:22:04 INFO Executor: Finished task 2.0 in stage 1.0 (TID 3). 3470 bytes result sent to driver

9 技术背景-spark通过多个作业采用动态分区模式和静态分区模式分表插入分区表的不同分区

  • 经测试,只要以静态分区形式插入数据的作业数不超过2个(以动态分区形式插入数据的作业可以有多个),就不会报错。
  • 执行过程中截图如下:

640.png


10 技术背景-配置spark使用hive serde 而不是spark built-in data source writer

  • 配置spark使用hive serde 而不是spark built-in data source writer,即配置参数spark.sql.hive.convertInsertingPartitionedTable=false; spark.sql.hive.convertMetastoreOrc=false;(可以在 kyuubi-default.conf 或spark-default.conf中配置;可以user/session 级别配置),此后对非分区表,分区表静态分区模式,分区表动态分区农事,分别进行测试。
  • Job/task执行过程中会生成文件:hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1/.hive-staging_hive_2023-09-08_17-35-01_497_4555303478309834157-59/-ext-10000/_temporary/0/_temporary/attempt_20230908173501912656469073753420_0059_m_000000_59/part-00000-6d83cb93-228e-4717-bf77-83e36c10cbe8-c000
  • Task commit后会生成文件:hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1/.hive-staging_hive_2023-09-08_17-34-58_485_4893366407663162793-58/-ext-10000/_temporary/0/task_202309081734587917020092949673358_0058_m_000000
  • Job commit后会生成文件:/user/hive/warehouse/test_liming.db/table1_pt/g=23/part-00002-6efd7b7b-9a44-410a-b15d-1c5ee49a523f.c000
  • 关键日志如下:
关键日志如下-所有JOB/TASK都是成功的:
23/09/08 17:35:01 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/09/08 17:35:01 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/09/08 17:35:01 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
23/09/08 17:35:01 INFO PhysicalFsWriter: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1/.hive-staging_hive_2023-09-08_17-35-01_497_4555303478309834157-59/-ext-10000/_temporary/0/_temporary/attempt_20230908173501912656469073753420_0059_m_000000_59/part-00000-6d83cb93-228e-4717-bf77-83e36c10cbe8-c000 with stripeSize: 67108864 blockSize: 268435456 compression: Compress: ZLIB buffer: 262144
23/09/08 17:35:02 INFO FileOutputCommitter: Saved output of task 'attempt_202309081734587917020092949673358_0058_m_000000_58' to hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1/.hive-staging_hive_2023-09-08_17-34-58_485_4893366407663162793-58/-ext-10000/_temporary/0/task_202309081734587917020092949673358_0058_m_000000
23/09/08 17:35:02 INFO SparkHadoopMapRedUtil: attempt_202309081734587917020092949673358_0058_m_000000_58: Committed. Elapsed time: 4 ms.
23/09/08 17:35:02 INFO Executor: Finished task 0.0 in stage 58.0 (TID 58). 2498 bytes result sent to driver
23/09/08 17:35:42 INFO FileUtils: Creating directory if it doesn't exist: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.hive-staging_hive_2023-09-08_17-35-42_083_5954793858553566623-61
23/09/08 17:35:42 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/09/08 17:35:42 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/09/08 17:35:42 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
23/09/08 17:49:00 INFO Hive: New loading path = hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.hive-staging_hive_2023-09-08_17-48-20_910_862844915956183505-137/-ext-10000/g=21 with partSpec {g=21}
23/09/08 17:49:00 INFO Hive: New loading path = hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.hive-staging_hive_2023-09-08_17-48-20_910_862844915956183505-137/-ext-10000/g=22 with partSpec {g=22}
23/09/08 17:49:00 INFO Hive: New loading path = hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.hive-staging_hive_2023-09-08_17-48-20_910_862844915956183505-137/-ext-10000/g=23 with partSpec {g=23}
23/09/08 17:49:00 INFO TrashPolicyDefault: Moved: 'hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/g=21/part-00000-aac3aa0e-5de4-4b8d-ae29-725fb692ed1c.c000' to trash at: hdfs://nameservice1/user/hive/.Trash/Current/user/hive/warehouse/test_liming.db/table1_pt/g=21/part-00000-aac3aa0e-5de4-4b8d-ae29-725fb692ed1c.c000
23/09/08 17:49:00 INFO FileUtils: Creating directory if it doesn't exist: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/g=21
23/09/08 17:49:00 INFO TrashPolicyDefault: Moved: 'hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/g=23/part-00002-aac3aa0e-5de4-4b8d-ae29-725fb692ed1c.c000' to trash at: hdfs://nameservice1/user/hive/.Trash/Current/user/hive/warehouse/test_liming.db/table1_pt/g=23/part-00002-aac3aa0e-5de4-4b8d-ae29-725fb692ed1c.c000
23/09/08 17:49:00 INFO FileUtils: Creating directory if it doesn't exist: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/g=23
23/09/08 17:49:01 INFO TrashPolicyDefault: Moved: 'hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/g=22/part-00001-aac3aa0e-5de4-4b8d-ae29-725fb692ed1c.c000' to trash at: hdfs://nameservice1/user/hive/.Trash/Current/user/hive/warehouse/test_liming.db/table1_pt/g=22/part-00001-aac3aa0e-5de4-4b8d-ae29-725fb692ed1c.c000
23/09/08 17:49:01 INFO FileUtils: Creating directory if it doesn't exist: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/g=22
23/09/08 17:49:01 INFO Hive: Loaded 3 partitions
  • 执行过程中截图如下-非分区表:

640.png


  • 执行过程中截图如下-静态分区模式:

640.png

640.png

  • 执行过程中截图如下-动态分区模式:


640.png

640.png


11 技术背景-配置不清理临时目录

  • 配置不清理作业执行过程中的临时目录,即配置spark参数spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true,此后对非分区表,分区表静态分区模式,分区表动态分区农事,分别进行测试。
  • 注意此时作业执行结束后,会残留_temporary目录,需要异步手动清理。
  • 执行过程中截图如下-非分区表:

640.png


  • 执行过程中截图如下-分区表-静态分区:

640.png

  • 执行过程中截图如下-分区表-动态分区 执行过程中会生成:/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-7809b23e-e675-42f4-93fd-97e6467ed5e4/_temporary/0 但执行结束会清理掉/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-7809b23e-e675-42f4-93fd-97e6467ed5e4/ 最终只剩下:

640.png


640.png

相关实践学习
日志服务之使用Nginx模式采集日志
本文介绍如何通过日志服务控制台创建Nginx模式的Logtail配置快速采集Nginx日志并进行多维度分析。
相关文章
|
7月前
|
SQL 分布式计算 资源调度
线上 hive on spark 作业执行超时问题排查案例分享
线上 hive on spark 作业执行超时问题排查案例分享
|
7月前
|
分布式计算 资源调度 监控
没有监控的流处理作业与茫茫大海中的裸泳无异 - 附 flink 与 spark 作业监控脚本实现
没有监控的流处理作业与茫茫大海中的裸泳无异 - 附 flink 与 spark 作业监控脚本实现
|
1天前
|
存储 分布式计算 监控
Spark作业的调度与执行流程
Spark作业的调度与执行流程
|
12月前
|
SQL 分布式计算 Hadoop
|
12月前
|
分布式计算 Ubuntu Java
使用spark-submit工具提交Spark作业
使用spark-submit工具提交Spark作业
|
SQL 分布式计算 HIVE
|
SQL 分布式计算 数据挖掘
SQL、Pandas和Spark:如何实现数据透视表?
数据透视表是一个很重要的数据统计操作,最有代表性的当属在Excel中实现(甚至说提及Excel,个人认为其最有用的当属三类:好用的数学函数、便捷的图表制作以及强大的数据透视表功能)。所以,今天本文就围绕数据透视表,介绍一下其在SQL、Pandas和Spark中的基本操作与使用,这也是沿承这一系列的文章之一。
259 0
SQL、Pandas和Spark:如何实现数据透视表?
|
SQL 缓存 分布式计算
Spark 对 OSS 上的 ORC 数据进行查询加速 | 学习笔记
快速学习 Spark 对 OSS 上的 ORC 数据进行查询加速。
286 0
Spark 对 OSS 上的 ORC 数据进行查询加速 | 学习笔记
|
分布式计算 Java Apache
Spark平台上提交作业到集群生成的日志文件
Spark平台上提交作业到集群生成的日志文件
118 0
Spark平台上提交作业到集群生成的日志文件
|
机器学习/深度学习 分布式计算 网络协议
Spark练习 - 提交作业到集群 - submit job via cluster
Spark练习 - 提交作业到集群 - submit job via cluster
102 0
Spark练习 - 提交作业到集群 - submit job via cluster
http://www.vxiaotou.com