Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KYUUBI #2642] Fix flaky test - JpsApplicationOperation with spark local mode #2669

Closed
wants to merge 5 commits into from

Conversation

turboFei
Copy link
Member

@turboFei turboFei commented May 17, 2022

Why are the changes needed?

To close #2642

In this PR, we put the getApplicationInfoByTag and killApplicationByTag into the same eventually block to try to fix the confused flaky test.
Since we kill the application with signal SIGTERM, so the process is not terminated immediately.
SIGTERM vs SIGKILL: What's the Difference?
So we have to wrap the following getApplicationInfoByTag block(assert the application is finished) with eventually after killApplicationByTag.

BTW, we fix a potential issue in case that the jps command is not in the PATH.

How was this patch tested?

  • Add some test cases that check the changes thoroughly including negative and positive cases if possible

  • Add screenshots for manual tests if appropriate

  • Run test locally before make a pull request

@pan3793
Copy link
Member

pan3793 commented May 17, 2022

Also cc @cxzl25

@turboFei turboFei requested a review from cfmcgrady May 17, 2022 01:26
@turboFei turboFei self-assigned this May 17, 2022
@turboFei turboFei added this to the v1.6.0 milestone May 17, 2022
@turboFei
Copy link
Member Author

It seems that this PR can not fix the issue.

It is confused that, we can get engineId when invoking getApplicationInfoByTag but fail to get engineId when invoking killApplicationByTag.

    assert(jps.isSupported(builder.clusterManager()))
    eventually(Timeout(10.seconds)) {
      val desc1 = jps.getApplicationInfoByTag(id)
      assert(desc1.contains("id"))
      assert(desc1("name").contains(id))
      assert(desc1("state") === "RUNNING")
    }

    val response = jps.killApplicationByTag(id)
    assert(response._1, response._2)
    assert(response._2 startsWith "Succeeded to terminate:")

@codecov-commenter
Copy link

codecov-commenter commented May 17, 2022

Codecov Report

Merging #2669 (14d2ddf) into master (a163f3a) will decrease coverage by 0.04%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master    #2669      +/-   ##
============================================
- Coverage     64.29%   64.24%   -0.05%     
- Complexity       82       84       +2     
============================================
  Files           385      385              
  Lines         18656    18673      +17     
  Branches       2529     2531       +2     
============================================
+ Hits          11994    11996       +2     
- Misses         5515     5531      +16     
+ Partials       1147     1146       -1     
Impacted Files Coverage Δ
...apache/kyuubi/engine/JpsApplicationOperation.scala 83.87% <100.00%> (ø)
.../org/apache/kyuubi/jdbc/hive/KyuubiConnection.java 4.02% <0.00%> (-0.11%) ⬇️
.../org/apache/kyuubi/config/KyuubiReservedKeys.scala 0.00% <0.00%> (ø)
...e/kyuubi/server/KyuubiTBinaryFrontendService.scala 84.37% <0.00%> (ø)
...c/main/java/org/apache/kyuubi/jdbc/hive/Utils.java 36.58% <0.00%> (+0.40%) ⬆️
...he/kyuubi/spark/connector/tpcds/TPCDSCatalog.scala 36.36% <0.00%> (+3.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a163f3a...14d2ddf. Read the comment docs.

@ulysses-you
Copy link
Contributor

seems still flaky ..

- JpsApplicationOperation with spark local mode *** FAILED ***
  response._1 was false APPLICATION_NOT_FOUND (JpsApplicationOperationSuite.scala:87)

@yaooqinn
Copy link
Member

yaooqinn commented May 17, 2022

kill: (11656): No such process
- JpsApplicationOperation with spark local mode *** FAILED ***
  response._1 was false Failed to terminate: 11656 org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --class org.apache.kyuubi.engine.spark.SparkSQLEngine --conf spark.kyuubi.engine.operation.log.dir.root=target/engine_operation_logs --conf spark.kyuubi.session.idle.timeout=PT3M --conf spark.abc=f41c3b50-1025-49c4-b1a4-3f4e01e91345 --conf spark.kyuubi.testing=true --conf spark.kyuubi.frontend.bind.host=localhost --conf spark.master=local --conf spark.kyuubi.metrics.json.location=/home/runner/work/incubator-kyuubi/incubator-kyuubi/kyuubi-server/target/metrics --conf spark.kyuubi.operation.log.dir.root=target/server_operation_logs --proxy-user runner /home/runner/work/incubator-kyuubi/incubator-kyuubi/externals/kyuubi-spark-sql-engine/target/kyuubi-spark-sql-engine_2.12-1.6.0-SNAPSHOT.jar, due to Nonzero exit code: 1 (JpsApplicationOperationSuite.scala:87)

@turboFei
Copy link
Member Author

turboFei commented May 17, 2022

kill: (11656): No such process
- JpsApplicationOperation with spark local mode *** FAILED ***
  response._1 was false Failed to terminate: 11656 org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --class org.apache.kyuubi.engine.spark.SparkSQLEngine --conf spark.kyuubi.engine.operation.log.dir.root=target/engine_operation_logs --conf spark.kyuubi.session.idle.timeout=PT3M --conf spark.abc=f41c3b50-1025-49c4-b1a4-3f4e01e91345 --conf spark.kyuubi.testing=true --conf spark.kyuubi.frontend.bind.host=localhost --conf spark.master=local --conf spark.kyuubi.metrics.json.location=/home/runner/work/incubator-kyuubi/incubator-kyuubi/kyuubi-server/target/metrics --conf spark.kyuubi.operation.log.dir.root=target/server_operation_logs --proxy-user runner /home/runner/work/incubator-kyuubi/incubator-kyuubi/externals/kyuubi-spark-sql-engine/target/kyuubi-spark-sql-engine_2.12-1.6.0-SNAPSHOT.jar, due to Nonzero exit code: 1 (JpsApplicationOperationSuite.scala:87)

It seems that the process should not terminate itself before being killed.

@turboFei
Copy link
Member Author

the UT passed.

Let me re-run them again.

@yaooqinn
Copy link
Member

kill: (11656): No such process
- JpsApplicationOperation with spark local mode *** FAILED ***
  response._1 was false Failed to terminate: 11656 org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --class org.apache.kyuubi.engine.spark.SparkSQLEngine --conf spark.kyuubi.engine.operation.log.dir.root=target/engine_operation_logs --conf spark.kyuubi.session.idle.timeout=PT3M --conf spark.abc=f41c3b50-1025-49c4-b1a4-3f4e01e91345 --conf spark.kyuubi.testing=true --conf spark.kyuubi.frontend.bind.host=localhost --conf spark.master=local --conf spark.kyuubi.metrics.json.location=/home/runner/work/incubator-kyuubi/incubator-kyuubi/kyuubi-server/target/metrics --conf spark.kyuubi.operation.log.dir.root=target/server_operation_logs --proxy-user runner /home/runner/work/incubator-kyuubi/incubator-kyuubi/externals/kyuubi-spark-sql-engine/target/kyuubi-spark-sql-engine_2.12-1.6.0-SNAPSHOT.jar, due to Nonzero exit code: 1 (JpsApplicationOperationSuite.scala:87)

It seems that the process should not terminate itself before being killed.

22/05/17 05:00:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2022-05-17 05:00:38.915 INFO util.SignalRegister: Registering signal handler for TERM
2022-05-17 05:00:38.917 INFO util.SignalRegister: Registering signal handler for HUP
2022-05-17 05:00:38.917 INFO util.SignalRegister: Registering signal handler for INT
2022-05-17 05:00:39.285 INFO conf.HiveConf: Found configuration file null
2022-05-17 05:00:39.543 INFO spark.SparkContext: Running Spark version 3.1.3
2022-05-17 05:00:39.621 INFO resource.ResourceUtils: ==============================================================
2022-05-17 05:00:39.621 INFO resource.ResourceUtils: No custom resources configured for spark.driver.
2022-05-17 05:00:39.622 INFO resource.ResourceUtils: ==============================================================
2022-05-17 05:00:39.622 INFO spark.SparkContext: Submitted application: org.apache.kyuubi.engine.spark.SparkSQLEngine
2022-05-17 05:00:39.680 INFO resource.ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
2022-05-17 05:00:39.683 INFO resource.ResourceProfile: Limiting resource is cpu
2022-05-17 05:00:39.690 INFO resource.ResourceProfileManager: Added ResourceProfile id: 0
2022-05-17 05:00:39.776 INFO spark.SecurityManager: Changing view acls to: runner
2022-05-17 05:00:39.776 INFO spark.SecurityManager: Changing modify acls to: runner
2022-05-17 05:00:39.776 INFO spark.SecurityManager: Changing view acls groups to: 
2022-05-17 05:00:39.777 INFO spark.SecurityManager: Changing modify acls groups to: 
2022-05-17 05:00:39.777 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(runner); groups with view permissions: Set(); users  with modify permissions: Set(runner); groups with modify permissions: Set()
2022-05-17 05:00:40.202 INFO util.Utils: Successfully started service 'sparkDriver' on port 43437.
2022-05-17 05:00:40.269 INFO spark.SparkEnv: Registering MapOutputTracker
2022-05-17 05:00:40.375 INFO spark.SparkEnv: Registering BlockManagerMaster
2022-05-17 05:00:40.413 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2022-05-17 05:00:40.414 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
2022-05-17 05:00:40.419 INFO spark.SparkEnv: Registering BlockManagerMasterHeartbeat
2022-05-17 05:00:40.438 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-f7e1c27f-c2ec-4752-b9fd-5a02a842acf1
2022-05-17 05:00:40.472 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MiB
2022-05-17 05:00:40.503 INFO spark.SparkEnv: Registering OutputCommitCoordinator
2022-05-17 05:00:40.615 INFO util.log: Logging initialized @4628ms to org.sparkproject.jetty.util.log.Slf4jLog
2022-05-17 05:00:40.755 INFO server.Server: jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 1.8.0_332-b09
2022-05-17 05:00:40.851 INFO server.Server: Started @4864ms
2022-05-17 05:00:41.059 INFO server.AbstractConnector: Started ServerConnector@6bfdb014{HTTP/1.1, (http/1.1)}{localhost:42741}
2022-05-17 05:00:41.059 INFO util.Utils: Successfully started service 'SparkUI' on port 42741.
2022-05-17 05:00:41.102 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@367795c7{/jobs,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.106 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@75699e35{/jobs/json,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.109 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4aeaadc1{/jobs/job,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.112 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7daa61f3{/jobs/job/json,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.115 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6e4ea0bd{/stages,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.125 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@78f9ed3e{/stages/json,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.126 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@b0964b2{/stages/stage,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.128 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@64040287{/stages/stage/json,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.128 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6f89f665{/stages/pool,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.129 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4925f4f5{/stages/pool/json,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.130 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3a43d133{/storage,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.130 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5f2afe62{/storage/json,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.131 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@28782602{/storage/rdd,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.143 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@68105edc{/storage/rdd/json,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.147 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@38b972d7{/environment,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.147 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3935e9a8{/environment/json,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.148 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5b56b654{/executors,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.149 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@534243e4{/executors/json,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.149 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@470a9030{/executors/threadDump,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.151 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@27494e46{/executors/threadDump/json,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.166 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1e411d81{/static,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.168 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@39ab59f8{/,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.172 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@111610e6{/api,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.179 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4beddc56{/jobs/job/kill,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.180 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1b812421{/stages/stage/kill,null,AVAILABLE,@Spark}
2022-05-17 05:00:41.182 INFO ui.SparkUI: Bound SparkUI to localhost, and started at http://localhost:42741
2022-05-17 05:00:41.204 INFO spark.SparkContext: Added JAR file:/home/runner/work/incubator-kyuubi/incubator-kyuubi/externals/kyuubi-spark-sql-engine/target/kyuubi-spark-sql-engine_2.12-1.6.0-SNAPSHOT.jar at spark://localhost:43437/jars/kyuubi-spark-sql-engine_2.12-1.6.0-SNAPSHOT.jar with timestamp 1652763639524
2022-05-17 05:00:41.500 INFO executor.Executor: Starting executor ID driver on host localhost
2022-05-17 05:00:41.528 INFO executor.Executor: Using REPL class URI: spark://localhost:43437/classes
2022-05-17 05:00:41.555 INFO executor.Executor: Fetching spark://localhost:43437/jars/kyuubi-spark-sql-engine_2.12-1.6.0-SNAPSHOT.jar with timestamp 1652763639524
2022-05-17 05:00:41.619 INFO client.TransportClientFactory: Successfully created connection to localhost/127.0.0.1:43437 after 42 ms (0 ms spent in bootstraps)
2022-05-17 05:00:41.631 INFO util.Utils: Fetching spark://localhost:43437/jars/kyuubi-spark-sql-engine_2.12-1.6.0-SNAPSHOT.jar to /tmp/spark-fa12c304-8837-445c-864c-e2a3522e8f69/userFiles-1df90a08-5586-4f46-9271-562e051c5790/fetchFileTemp2804306972146769908.tmp
2022-05-17 05:00:41.737 INFO executor.Executor: Adding file:/tmp/spark-fa12c304-8837-445c-864c-e2a3522e8f69/userFiles-1df90a08-5586-4f46-9271-562e051c5790/kyuubi-spark-sql-engine_2.12-1.6.0-SNAPSHOT.jar to class loader
2022-05-17 05:00:41.761 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40877.
2022-05-17 05:00:41.761 INFO netty.NettyBlockTransferService: Server created on localhost:40877
2022-05-17 05:00:41.763 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2022-05-17 05:00:41.780 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, localhost, 40877, None)
2022-05-17 05:00:41.785 INFO storage.BlockManagerMasterEndpoint: Registering block manager localhost:40877 with 366.3 MiB RAM, BlockManagerId(driver, localhost, 40877, None)
2022-05-17 05:00:41.792 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, localhost, 40877, None)
2022-05-17 05:00:41.793 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, localhost, 40877, None)
2022-05-17 05:00:42.341 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@776802b0{/metrics/json,null,AVAILABLE,@Spark}
2022-05-17 05:00:42.993 INFO internal.SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/home/runner/work/incubator-kyuubi/incubator-kyuubi/kyuubi-server/target/work/runner/spark-warehouse/').
2022-05-17 05:00:42.994 INFO internal.SharedState: Warehouse path is 'file:/home/runner/work/incubator-kyuubi/incubator-kyuubi/kyuubi-server/target/work/runner/spark-warehouse/'.
2022-05-17 05:00:43.080 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6c796cc1{/SQL,null,AVAILABLE,@Spark}
2022-05-17 05:00:43.081 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1cb7936c{/SQL/json,null,AVAILABLE,@Spark}
2022-05-17 05:00:43.083 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7123be6c{/SQL/execution,null,AVAILABLE,@Spark}
2022-05-17 05:00:43.086 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@77a2aa4a{/SQL/execution/json,null,AVAILABLE,@Spark}
2022-05-17 05:00:43.172 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@30c0d731{/static/sql,null,AVAILABLE,@Spark}
2022-05-17 05:00:50.995 INFO hive.HiveUtils: Initializing HiveMetastoreConnection version 2.3.7 using Spark classes.
2022-05-17 05:00:51.266 INFO conf.HiveConf: Found configuration file null
2022-05-17 05:00:52.449 INFO session.SessionState: Created local directory: /tmp/runner
2022-05-17 05:00:52.488 INFO session.SessionState: Created HDFS directory: /tmp/hive/runner/4b4bfd19-ddc4-456f-b07a-6f0905e56c93
2022-05-17 05:00:52.513 INFO session.SessionState: Created local directory: /tmp/runner/4b4bfd19-ddc4-456f-b07a-6f0905e56c93
2022-05-17 05:00:52.540 INFO session.SessionState: Created HDFS directory: /tmp/hive/runner/4b4bfd19-ddc4-456f-b07a-6f0905e56c93/_tmp_space.db
2022-05-17 05:00:52.576 INFO client.HiveClientImpl: Warehouse location for Hive client (version 2.3.7) is file:/home/runner/work/incubator-kyuubi/incubator-kyuubi/kyuubi-server/target/work/runner/spark-warehouse/
2022-05-17 05:00:54.465 WARN conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
2022-05-17 05:00:54.466 WARN conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist
2022-05-17 05:00:54.466 INFO metastore.HiveMetaStore: 0: Opening raw store with implementation class:org.apache.hadoop.hive.metastore.ObjectStore
2022-05-17 05:00:54.538 INFO metastore.ObjectStore: ObjectStore, initialize called
2022-05-17 05:00:54.947 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
2022-05-17 05:00:54.952 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
2022-05-17 05:00:58.994 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
2022-05-17 05:01:03.444 INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
2022-05-17 05:01:03.460 INFO metastore.ObjectStore: Initialized ObjectStore
2022-05-17 05:01:03.643 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
2022-05-17 05:01:03.643 WARN metastore.ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore runner@10.1.0.178
2022-05-17 05:01:03.717 WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException
2022-05-17 05:01:04.185 INFO metastore.HiveMetaStore: Added admin role in metastore
2022-05-17 05:01:04.187 INFO metastore.HiveMetaStore: Added public role in metastore
2022-05-17 05:01:04.300 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty
2022-05-17 05:01:04.597 INFO metastore.HiveMetaStore: 0: get_all_functions
2022-05-17 05:01:04.598 INFO HiveMetaStore.audit: ugi=runner	ip=unknown-ip-addr	cmd=get_all_functions	
2022-05-17 05:01:04.732 INFO metastore.HiveMetaStore: 0: get_database: default
2022-05-17 05:01:04.732 INFO HiveMetaStore.audit: ugi=runner	ip=unknown-ip-addr	cmd=get_database: default	
2022-05-17 05:01:04.787 INFO metastore.HiveMetaStore: 0: get_databases: *
2022-05-17 05:01:04.787 INFO HiveMetaStore.audit: ugi=runner	ip=unknown-ip-addr	cmd=get_databases: *	
2022-05-17 05:01:05.401 INFO codegen.CodeGenerator: Code generated in 211.927261 ms
2022-05-17 05:01:05.467 INFO codegen.CodeGenerator: Code generated in 7.722799 ms
2022-05-17 05:01:05.518 INFO util.ThreadUtils: SparkSQLSessionManager-exec-pool: pool size: 100, wait queue size: 100, thread keepalive time: 60000 ms
2022-05-17 05:01:05.522 INFO operation.SparkSQLOperationManager: Service[SparkSQLOperationManager] is initialized.
2022-05-17 05:01:05.524 INFO session.SparkSQLSessionManager: Service[SparkSQLSessionManager] is initialized.
2022-05-17 05:01:05.524 INFO spark.SparkSQLBackendService: Service[SparkSQLBackendService] is initialized.
2022-05-17 05:01:05.565 INFO spark.SparkTBinaryFrontendService: Initializing SparkTBinaryFrontend on localhost:40031 with [9, 999] worker threads
2022-05-17 05:01:05.566 INFO spark.SparkTBinaryFrontendService: Service[SparkTBinaryFrontend] is initialized.
2022-05-17 05:01:05.566 INFO spark.SparkSQLEngine: Service[SparkSQLEngine] is initialized.
2022-05-17 05:01:05.580 INFO operation.SparkSQLOperationManager: Service[SparkSQLOperationManager] is started.
2022-05-17 05:01:05.580 INFO session.SparkSQLSessionManager: Service[SparkSQLSessionManager] is started.
2022-05-17 05:01:05.580 INFO spark.SparkSQLBackendService: Service[SparkSQLBackendService] is started.
2022-05-17 05:01:05.580 INFO spark.SparkTBinaryFrontendService: Service[SparkTBinaryFrontend] is started.
2022-05-17 05:01:05.580 INFO spark.SparkSQLEngine: Service[SparkSQLEngine] is started.
2022-05-17 05:01:05.594 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@13cc3984{/kyuubi,null,AVAILABLE,@Spark}
2022-05-17 05:01:05.594 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5625ba2{/kyuubi/json,null,AVAILABLE,@Spark}
2022-05-17 05:01:05.597 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4bd1b07d{/kyuubi/session,null,AVAILABLE,@Spark}
2022-05-17 05:01:05.598 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5fafd099{/kyuubi/session/json,null,AVAILABLE,@Spark}
2022-05-17 05:01:05.613 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2512155e{/kyuubi/stop,null,AVAILABLE,@Spark}
2022-05-17 05:01:05.616 INFO spark.SparkSQLEngine: 
    Spark application name: org.apache.kyuubi.engine.spark.SparkSQLEngine
          application ID:  local-1652763641274
          application web UI: http://localhost:42741
          master: local
          version: 3.1.3
          driver: [cpu: 1, mem: 1g]
          executor: [cpu: 2, mem: 1g, maxNum: 2]
    Start time: Tue May 17 05:00:39 UTC 2022
    
    User: runner (shared mode: USER)
    State: STARTED
    

the engine log looks pretty normal

@turboFei
Copy link
Member Author

SparkOperationProgressSuite:
2022-05-17 07:19:51,657 ScalaTest-main-running-DiscoverySuite ERROR Filters contains invalid attributes "onMatch", "onMismatch"
2022-05-17 07:19:51,660 ScalaTest-main-running-DiscoverySuite ERROR Filters contains invalid attributes "onMatch", "onMismatch"
- test operation progress *** FAILED ***
  The code passed to eventually never returned normally. Attempted 329 times over 25.002496456 seconds. Last failure message: checkFlag1 was true, but checkFlag2 was false. (SparkOperationProgressSuite.scala:55)

@ulysses-you
Copy link
Contributor

SparkOperationProgressSuite:
2022-05-17 07:19:51,657 ScalaTest-main-running-DiscoverySuite ERROR Filters contains invalid attributes "onMatch", >"onMismatch"
2022-05-17 07:19:51,660 ScalaTest-main-running-DiscoverySuite ERROR Filters contains invalid attributes "onMatch", >"onMismatch"
- test operation progress *** FAILED ***
The code passed to eventually never returned normally. Attempted 329 times over 25.002496456 seconds. Last failure >message: checkFlag1 was true, but checkFlag2 was false. (SparkOperationProgressSuite.scala:55)

cc @wForget

@turboFei
Copy link
Member Author

The UT passed except above SparkOperationProgressSuite.

image

@turboFei
Copy link
Member Author

trigger re-run again.

@yaooqinn
Copy link
Member

please update the PR desc

@turboFei
Copy link
Member Author

please update the PR desc

updated

@turboFei
Copy link
Member Author

re-run again and succeed.

Thanks for your review, merging to master

@turboFei turboFei closed this in c1fb7bf May 17, 2022
@turboFei turboFei deleted the fix_flaky_test branch May 17, 2022 12:11
@@ -81,16 +81,18 @@ class JpsApplicationOperationSuite extends KyuubiFunSuite {
assert(desc1.contains("id"))
assert(desc1("name").contains(id))
assert(desc1("state") === "RUNNING")
val response = jps.killApplicationByTag(id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root cause of this problem is that spark-submit will generate two processes.
First start org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit to generate submission parameters,
Then start org.apache.spark.deploy.SparkSubmit to execute the main logic.

So the call to getApplicationInfoByTag is successful, it sees process 1, but sometimes killApplicationByTag happens to be executed when the two processes are being switched. At this time, jps cannot see the process of the relevant tag.

Now add in the eventually block and it seems to work ok.

Is there some possibility, such as the engine startup is slow, more than 10s, or the kill command fails?

However, after this investigation, the next time we encounter this case, it should be quickly fixed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the comments @cxzl25
cc @cfmcgrady

Copy link
Member Author

@turboFei turboFei May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some possibility, such as the engine startup is slow, more than 10s, or the kill command fails?

It seems difficult for JpsApplicationOperation to handle it.

Copy link
Contributor

@cfmcgrady cfmcgrady May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the JpsApplicationOperation.getApplicationInfoByTag should make sure the started JVM application is org.apache.spark.deploy.SparkSubmit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also cc @yaooqinn

pan3793 pushed a commit that referenced this pull request Nov 30, 2023
…cal mode

# 🔍 Description
## Issue References 🔗

This pull request fixes #2642

## Describe Your Solution 🔧

#2669 (comment)

```bash
jps -ml
```

```
99427 org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver --conf spark.abc=123
```

```
99416 org.apache.spark.deploy.SparkSubmit --conf spark.abc=123 --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver spark-internal
```

## Types of changes 🔖

- [x] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

#### Behavior Without This Pull Request ⚰️
https://github.com/apache/kyuubi/actions/runs/6981210739/job/18997901268?pr=5773
```
- JpsApplicationOperation with spark local mode *** FAILED ***
  The code passed to eventually never returned normally. Attempted 41 times over 10.235822332 seconds. Last failure message: "23463" did not equal null. (JpsApplicationOperationSuite.scala:92)
```

#### Behavior With This Pull Request 🎉
GA

#### Related Unit Tests
JpsApplicationOperationSuite

"JpsApplicationOperation with spark local mode"

---

# Checklists
## 📝 Author Self Checklist

- [x] My code follows the [style guidelines](https://kyuubi.readthedocs.io/en/master/contributing/code/style.html) of this project
- [x] I have performed a self-review
- [ ] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

## 📝 Committer Pre-Merge Checklist

- [ ] Pull request title is okay.
- [ ] No license issues.
- [ ] Milestone correctly set?
- [ ] Test coverage is ok
- [ ] Assignees are selected.
- [ ] Minimum number of approvals
- [ ] No changes are requested

**Be nice. Be informative.**

Closes #5785 from cxzl25/test_jps_local.

Closes #5785

bb456a6 [sychen] Fix flaky test - JpsApplicationOperation with spark local mode

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
pan3793 pushed a commit that referenced this pull request Nov 30, 2023
…cal mode

# 🔍 Description
## Issue References 🔗

This pull request fixes #2642

## Describe Your Solution 🔧

#2669 (comment)

```bash
jps -ml
```

```
99427 org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver --conf spark.abc=123
```

```
99416 org.apache.spark.deploy.SparkSubmit --conf spark.abc=123 --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver spark-internal
```

## Types of changes 🔖

- [x] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

#### Behavior Without This Pull Request ⚰️
https://github.com/apache/kyuubi/actions/runs/6981210739/job/18997901268?pr=5773
```
- JpsApplicationOperation with spark local mode *** FAILED ***
  The code passed to eventually never returned normally. Attempted 41 times over 10.235822332 seconds. Last failure message: "23463" did not equal null. (JpsApplicationOperationSuite.scala:92)
```

#### Behavior With This Pull Request 🎉
GA

#### Related Unit Tests
JpsApplicationOperationSuite

"JpsApplicationOperation with spark local mode"

---

# Checklists
## 📝 Author Self Checklist

- [x] My code follows the [style guidelines](https://kyuubi.readthedocs.io/en/master/contributing/code/style.html) of this project
- [x] I have performed a self-review
- [ ] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

## 📝 Committer Pre-Merge Checklist

- [ ] Pull request title is okay.
- [ ] No license issues.
- [ ] Milestone correctly set?
- [ ] Test coverage is ok
- [ ] Assignees are selected.
- [ ] Minimum number of approvals
- [ ] No changes are requested

**Be nice. Be informative.**

Closes #5785 from cxzl25/test_jps_local.

Closes #5785

bb456a6 [sychen] Fix flaky test - JpsApplicationOperation with spark local mode

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
(cherry picked from commit 19ae399)
Signed-off-by: Cheng Pan <chengpan@apache.org>
pan3793 pushed a commit that referenced this pull request Nov 30, 2023
…cal mode

# 🔍 Description
## Issue References 🔗

This pull request fixes #2642

## Describe Your Solution 🔧

#2669 (comment)

```bash
jps -ml
```

```
99427 org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver --conf spark.abc=123
```

```
99416 org.apache.spark.deploy.SparkSubmit --conf spark.abc=123 --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver spark-internal
```

## Types of changes 🔖

- [x] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

#### Behavior Without This Pull Request ⚰️
https://github.com/apache/kyuubi/actions/runs/6981210739/job/18997901268?pr=5773
```
- JpsApplicationOperation with spark local mode *** FAILED ***
  The code passed to eventually never returned normally. Attempted 41 times over 10.235822332 seconds. Last failure message: "23463" did not equal null. (JpsApplicationOperationSuite.scala:92)
```

#### Behavior With This Pull Request 🎉
GA

#### Related Unit Tests
JpsApplicationOperationSuite

"JpsApplicationOperation with spark local mode"

---

# Checklists
## 📝 Author Self Checklist

- [x] My code follows the [style guidelines](https://kyuubi.readthedocs.io/en/master/contributing/code/style.html) of this project
- [x] I have performed a self-review
- [ ] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

## 📝 Committer Pre-Merge Checklist

- [ ] Pull request title is okay.
- [ ] No license issues.
- [ ] Milestone correctly set?
- [ ] Test coverage is ok
- [ ] Assignees are selected.
- [ ] Minimum number of approvals
- [ ] No changes are requested

**Be nice. Be informative.**

Closes #5785 from cxzl25/test_jps_local.

Closes #5785

bb456a6 [sychen] Fix flaky test - JpsApplicationOperation with spark local mode

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
(cherry picked from commit 19ae399)
Signed-off-by: Cheng Pan <chengpan@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test: JpsApplicationOperation with spark local mode
7 participants