Skip to content

Commit

Permalink
Sigterm + Resume: documentation, renamings, refactorings #390
Browse files Browse the repository at this point in the history
  • Loading branch information
de-jcup committed Sep 26, 2024
1 parent 683f520 commit 4e2192e
Show file tree
Hide file tree
Showing 29 changed files with 316 additions and 158 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
import com.mercedesbenz.sechub.sharedkernel.messaging.JobMessage;
import com.mercedesbenz.sechub.sharedkernel.messaging.MessageDataKeys;
import com.mercedesbenz.sechub.sharedkernel.messaging.MessageID;
import com.mercedesbenz.sechub.sharedkernel.usecases.other.UseCaseSystemHandlesSIGTERM;
import com.mercedesbenz.sechub.sharedkernel.usecases.other.UseCaseSystemSuspendsJobsWhenSigTermReceived;

@Component
public class JobAdministrationMessageHandler implements AsynchronMessageHandler {
Expand Down Expand Up @@ -92,7 +92,7 @@ private void handleJobFailed(DomainMessage request) {
}

@IsReceivingAsyncMessage(MessageID.JOB_SUSPENDED)
@UseCaseSystemHandlesSIGTERM(@Step(number = 7, name = "Administration handles suspension", description = "Administration removes suspended listeners about job suspension"))
@UseCaseSystemSuspendsJobsWhenSigTermReceived(@Step(number = 7, name = "Administration handles suspended job", description = "Administration domain removes suspended job from its running job list"))
private void handleJobSuspended(DomainMessage request) {
JobMessage message = request.get(MessageDataKeys.JOB_SUSPENDED_DATA);
// we do drop job info - we only hold running and waiting jobs. The suspended
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@
/**
* Represents the execution state of a scheduled SecHub job.
*
* Attention: never change the enum values because they are used for persistence
* as identifiers!
* Attention: never change existing enum values because they are used for
* persistence as identifiers! Only add new ones!
*
* @author Albert Tregnaghi
*
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
@startuml

'Hide empty parts:
hide empty fields
hide empty methods

'You can find more examples at https://plantuml.com/class-diagram

package com.mercedesbenz.sechub.domain.schedule {

class SchedulerJobBatchTriggerService {
void triggerExecutionOfNextJob()
}

class SchedulerNextJobResolver {
UUID resolveNextJobUUID();
}

class ScheduleJobMarkerService {
}

class ScheduleResumeJobService {
void resume(ScheduleSecHubJob sechubJob)
}

database DB {
entity ScheduleSecHubJob {
}
}

}


node EventBus {
}

node springcontainer as "Spring boot container" {
}

cloud restartProcess as "Restart job handling" {
}

SchedulerJobBatchTriggerService --> ScheduleJobMarkerService
ScheduleResumeJobService ...> EventBus: REQUEST_JOB_RESTART
restartProcess <. EventBus: REQUEST_JOB_RESTART
SchedulerNextJobResolver <-- ScheduleJobMarkerService
SchedulerNextJobResolver --> ScheduleSecHubJob
SchedulerJobBatchTriggerService --> ScheduleResumeJobService : when RESUMING
ScheduleJobMarkerService ..> ScheduleSecHubJob :updates execution state to RESUMING\nwhen jobs was in state SUSPENDED\n


springcontainer --[#darkgreen,bold]> SchedulerJobBatchTriggerService: scheduled


note top of SchedulerNextJobResolver
At first job uuids of
suspended jobs are resolved.

If no suspended job shall be executed,
the selected schedule strategy is used
to resolve the next job.
end note

@enduml
Original file line number Diff line number Diff line change
Expand Up @@ -8,27 +8,23 @@ hide empty methods

package com.mercedesbenz.sechub.domain.scan {

class ScanProgressMonitor
class ScanProgressStateFetcher

class ScanJobExecutor {
}

class ScanJobExecutionRunnable{
class ScanJobExecutionRunnable {
}
class ScanMessageHandler
}



package com.mercedesbenz.sechub.domain.schedule {

class ScheduleMessageHandler {
handleJobRestartRequested()
}

class SynchronSecHubJobExecutor {
void suspend()
}

class SchedulerTerminationService {
Expand All @@ -40,6 +36,10 @@ package com.mercedesbenz.sechub.domain.schedule {
void triggerExecutionOfNextJob()
}

class SchedulerJobStatusRequestHandler {
DomainMessageSynchronousResult returnStatus();
}

database DB {
entity ScheduleSecHubJob {
}
Expand All @@ -48,25 +48,33 @@ package com.mercedesbenz.sechub.domain.schedule {
}

SchedulerJobBatchTriggerService ..> SchedulerTerminationService
SchedulerTerminationService -> ScheduleSecHubJob : persists with execution state\n`SUSPENDED`
SynchronSecHubJobExecutor -> ScheduleSecHubJob : persists with execution state\n`SUSPENDED`


node EventBus {
}

node springcontainer as "Spring boot container" {
}

springcontainer ..> SchedulerTerminationService: PreDestroy\ncalls terminate()
cloud OS {

}
OS -[#red,bold]> springcontainer: SIGTERM
springcontainer -[#red,bold]> SchedulerTerminationService: PreDestroy\ncalls terminate()

ScanProgressStateFetcher ...> EventBus : REQUEST_SCHEDULER_JOB_STATUS
EventBus ...> SchedulerJobStatusRequestHandler: REQUEST_SCHEDULER_JOB_STATUS
ScanJobExecutor -> ScanProgressStateFetcher
ScanJobExecutor --> ScanJobExecutionRunnable: suspends

SchedulerTerminationService --> SynchronSecHubJobExecutor
SchedulerJobStatusRequestHandler ... ScheduleSecHubJob : reads

note top of SchedulerJobBatchTriggerService
Blocks execution of any new jobs
inside the scheduler instance when
isTerminating() returns true
end note

note bottom of SchedulerTerminationService
Only the schedule domain is allowed to update
the job state! After the job state TODO
end note

@enduml
Original file line number Diff line number Diff line change
Expand Up @@ -2,54 +2,4 @@
[[section-concepts]]
== Cross-cutting Concepts

=== Security tools
include::../shared/concepts/concept_modules_and_module_groups.adoc[]

=== Domain Driven Design
include::../shared/concepts/concept_simple_domain_driven_design.adoc[]

=== Resilience
include::../shared/concepts/concept_simple_resilience.adoc[]

=== Job restarts
include::../shared/concepts/concept_sechub_job_restart_handling.adoc[]

=== Deployment without scheduler stop
include::../shared/concepts/concept_sechub_deployment_without_scheduler_stop.adoc[]

=== Mappings
include::../shared/concepts/concept_mappings.adoc[]

// Product delegation server (headline in include - level3)
include::../shared/concepts/concept_sechub_point_of_view_for_pds.adoc[]

include::../shared/concepts/concept_archive_extraction.adoc[]



// False-positive handling (headline in include - level3)
include::../shared/concepts/concept_falsepositive_handling.adoc[]

// Product execution profiles and executor configurations (headline in include - level3)
include::../shared/concepts/execution-profiles/concept_execution_profiles_and_config.adoc[]

include::../shared/concepts/concept_product_results.adoc[]

include::../shared/concepts/concept_job_status.adoc[]

include::../shared/concepts/concept_job_cancellation.adoc[]

include::../shared/concepts/concept_auto_clean.adoc[]

include::../shared/concepts/pds-solutions/concept_pds_solution.adoc[]

=== Analytics
include::../shared/concepts/concept_analytic.adoc[]

=== Statistics
include::../shared/concepts/concept_statistic.adoc[]

=== Data encryption
include::../shared/concepts/concept_sechub_data_encryption.adoc[]


include::../shared/concepts/concept_include_all.adoc[]
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
// this will include all concepts at level 3. can be used by architecture and tech doc to have same info

=== Security tools
include::concept_modules_and_module_groups.adoc[]

=== Domain Driven Design
include::concept_simple_domain_driven_design.adoc[]

=== Resilience
include::concept_simple_resilience.adoc[]

=== Job restarts
include::concept_sechub_job_restart_handling.adoc[]

=== Deployment without scheduler stop
include::concept_sechub_deployment_without_scheduler_stop.adoc[]

=== Mappings
include::concept_mappings.adoc[]

// Product delegation server (headline in include - level3)
include::concept_sechub_point_of_view_for_pds.adoc[]

include::concept_archive_extraction.adoc[]

// False-positive handling (headline in include - level3)
include::concept_falsepositive_handling.adoc[]

// Product execution profiles and executor configurations (headline in include - level3)
include::execution-profiles/concept_execution_profiles_and_config.adoc[]

include::concept_product_results.adoc[]

include::concept_job_status.adoc[]

include::concept_job_cancellation.adoc[]

include::concept_auto_clean.adoc[]

include::pds-solutions/concept_pds_solution.adoc[]

=== Analytics
include::concept_analytic.adoc[]

=== Statistics
include::concept_statistic.adoc[]

=== Data encryption
include::concept_sechub_data_encryption.adoc[]
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,20 @@ need to provide a `ProductDelegationServer` (in short form `PDS`).
- PDS server provides `auto unzipping` of uploaded resources when configured - see <<section-pds-server-config-file,PDS server configuration file>>
- When a PDS job fails or is done the resources inside job workspace location are *automatically removed*

===== Big picture
===== Communication between SecHub and PDS
The communication between SecHub server and PDS is very similar to the communication between SecHub client and SecHub server.
The `PDS adapter` will do following steps from {sechub} side - as a client of {pds}:

. creates a {pds} job
. _(Optional: Only necessary when {pds} does not resuse {sechub} storage)_ uploads sources and/or binaries to {pds}
. approves {pds} job
. waits until {pds} job has finished
. downloads {pds} report data

As shown in next figure:

plantuml::diagrams/diagram_concept_product_delgation_server_bigpicture.puml[]

==== Details about PDS
For more details please refer to the <<https://mercedes-benz.github.io/sechub/latest/sechub-product-delegation-server.html,PDS documentation>> available at

Original file line number Diff line number Diff line change
Expand Up @@ -8,31 +8,45 @@ the deployment was triggered and after this the scheduler was enabled again and
started again.

This works always, but has a catch: If there are many running jobs it can take a while until all
of those running jobs are done. And also in the mean time no new jobs are started. This means that
when we have a great count of running jobs, the time gap between deployment and start of new
of those running jobs are done. And also in the mean time no new jobs are started. This means that,
if we have a great count of running jobs, the time gap between deployment and start of new
jobs increases.

CI/CD builds or any other use of SecHub takes longer in the meantime, which can be unpleasant /
a bad user experience.

===== SIGTERM handling

plantuml::diagrams/diagram_sechub_sigterm_handling.puml[format=svg, title="SIGTERM handling"]
[[section-shared-concepts-stop-job-processing-on-sigterm]]
===== Stop job processing when SIGTERM received

K8s and other systems will send a `SIGTERM` signal to give application the possibility to shutdown
gracefully.

On a `SIGTERM` signal {sechub} temporarily suspends a job, allowing its {pds} instances to continue
processing it in the background. The next new SecHub server then reactivates the job and proceeds
with the results from the {pds} instances (or wait for them if still not already available).
On a `SIGTERM` signal a {sechub} server instance temporarily suspends a job, allowing its {pds}
instances to continue processing it in the background.

All running {sechub} jobs on terminating instance will be interrupted, marked with execution state
`SUSPENDED` and set `ENDED` time stamp as shown in next figure:

plantuml::diagrams/diagram_sechub_sigterm_handling.puml[format=svg, title="SIGTERM handling"]

All running {sechub} jobs on terminating instance are marked with execution state `SUSPENDED` +
(similar to cancel) which will also update `ENDED` timestamp on job.
See also <<section-usecase-UC_079,UC-079>>

Other servers instances will restart `SUSPENDED` jobs by existing
<<section-shared-concepts-sechub-job-restart-handling,restart mechanism>> (but `SUSPENDED` will
be handled before `READY_TO_START`).
[NOTE]
====
The next new SecHub server will <<section-shared-concepts-resume-suspended-jobs,resume the suspended job>>
and proceeds with the results from the {pds} instances (or wait for them if still not already available).
====

[[section-shared-concepts-resume-suspended-jobs]]
===== Resume suspended jobs
The batch trigger service does trigger the resume operation which leads to `REQUEST_RESTART_JOB` event
which <<section-shared-concepts-sechub-job-restart-handling, restarts the job>>.

To prevent too fast restarts, the `ENDED` timestamp of {sechub} job will be inspected on suspended jobs
and only fetched as next job when the time gap is greater than a defined (configurable) time period.


plantuml::diagrams/diagram_sechub_resume_suspended_jobs.puml[format=svg, title="Resuming suspended jobs]


See also <<section-usecase-UC_080,UC-080>>
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,10 @@ similar: Each adapter is able to store meta data for the current job via callbac
is also responsible to handle existing meta data on restarts.

plantuml::diagrams/diagram_sechub_job_restart_handling.puml[format=svg, title="Job restart handling"]


The event `REQUEST_RESTART_JOB` is also triggered when the batch trigger services
<<section-shared-concepts-resume-suspended-jobs,resumes suspended jobs>>.

[TIP]
====
It is always a good idea to use {pds} instead of direct product handling (via dedicated
Expand Down
Loading

0 comments on commit 4e2192e

Please sign in to comment.