Skip to content

Commit

Permalink
BF: CS-258 reformat qsub, sge_ckpt, sge_hostnameutils
Browse files Browse the repository at this point in the history
  • Loading branch information
ernst-bablick committed Sep 24, 2024
1 parent 26f0aa7 commit 4457f97
Show file tree
Hide file tree
Showing 2 changed files with 89 additions and 138 deletions.
116 changes: 46 additions & 70 deletions doc/markdown/man/man1/gethost.include.md
Original file line number Diff line number Diff line change
@@ -1,127 +1,103 @@
# NAME

gethostname -
get local hostname.
`gethostname` - get local hostname.

gethostbyname -
get local host information for specified hostname.
`gethostbyname` - get local host information for specified hostname.

gethostbyaddr -
get hostname via IP address.
`gethostbyaddr` - get hostname via IP address.

getservbyname -
get configured port number of service
`getservbyname` - get configured port number of service

# SYNTAX

**gethostname \[-help\|-name\|-aname\|-all\]**
`gethostname` \[ `-help` \| `-name` \| `-aname` \| `-all`\]

**gethostbyname \[-help\|-name\|-aname\|-all\]** **\<name>**
`gethostbyname` \[ `-help` \| `-name` \| `-aname` \| `-all` \] \<name>

**gethostbyaddr \[-help\|-name\|-aname\|-all\]** **\<ip>**
`gethostbyaddr` \[ `-help` \| `-name` \| `-aname` \| `-all` \] \<ip>

**getservbyname \[-help\|-number\]** **\<service>**
`getservbyname` \[ `-help` \| `-number` \] \<service>

# DESCRIPTION

*gethostname* and *gethostbyname* are used to get the local resolved
host name. *gethostbyaddr* is used to get the hostname of a specified IP
address (dotted decimal notation). *getservbyname* can be used to get
the configured port number of a service (e.g. from /etc/services).
`gethostname` and `gethostbyname` are used to get the local resolved host name. `gethostbyaddr` is used to get the
hostname of a specified IP address (dotted decimal notation). `getservbyname` can be used to get the configured port
number of a service (e.g. from /etc/services).

The hostname utils are primarily used by the xxQS_NAMExx installation
scripts. *gethostname* , *gethostbyname* and *gethostbyaddr* called
without any option will print out the hostname, all specified aliases,
and the IP address of the locally resolved hostname. Calling
*getservbyname* without any option will print out the full service
entry.
The hostname utils are primarily used by the xxQS_NAMExx installation scripts. `gethostname` , `gethostbyname` and
`gethostbyaddr` called without any option will print out the hostname, all specified aliases, and the IP address of
the locally resolved hostname. Calling `getservbyname` without any option will print out the full service entry.

# OPTIONS

## **-help**
## -help

Prints a list of all options.

## **-name**
## -name

This option only reports the primary name of the host.

## **-aname**
## -aname

If this option is set, the xxQS_NAMExx host alias file is used for host
name resolving. It is necessary to set the environment variable
xxQS_NAME_Sxx_ROOT and, if more than one cell is defined, also
xxQS_NAME_Sxx_CELL.
If this option is set, the xxQS_NAMExx host alias file is used for host name resolving. It is necessary to set
the environment variable \$xxQS_NAME_Sxx_ROOT and, if more than one cell is defined, also \$xxQS_NAME_Sxx_CELL.

This option will print out the xxQS_NAMExx host name.

## **-all**
## -all

By using the **-all** option all available host information will be
printed. This information includes the host name, the xxQS_NAMExx host
name, all host aliases, and the IP address of the host.
By using the `-all` option all available host information will be printed. This information includes the host name,
the xxQS_NAMExx host name, all host aliases, and the IP address of the host.

## **-number**
## -number

This option will print out the port number of the specified service
name.
This option will print out the port number of the specified service name.

## **\<name>**
## \<name>

The host name for which the information is requested.

## **\<ip>**
## \<ip>

The IP address (dotted decimal notation) for which the information is
requested.
The IP address (dotted decimal notation) for which the information is requested.

## **\<service>**
## \<service>

The service name for which the information is requested (e.g. ftp,
sge_qmaster or sge_execd).
The service name for which the information is requested (e.g. ftp, sge_qmaster or sge_execd).

# EXAMPLES

The following example shows how to get the port number of the FTP
service:
The following example shows how to get the port number of the FTP service:

> >getservbyname -number ftp
> 21
> getservbyname -number ftp
21

The next example shows the output of gethostname -all when the host
alias file contains this line:
The next example shows the output of `gethostname -all` when the host alias file contains this line:

gridmaster extern_name extern_name.mydomain
gridmaster extern_name extern_name.mydomain

The local host resolving must also provide the alias name "gridmaster".
Each xxQS_NAMExx host that wants to use the cluster must be able to
resolve the host name "gridmaster".
The local host resolving must also provide the alias name "gridmaster". Each xxQS_NAMExx host that wants to use
the cluster must be able to resolve the host name "gridmaster".

To setup an alias name, edit your /etc/hosts file or modify your NIS
setup to provide the alias for the NIS clients.
To setup an alias name, edit your /etc/hosts file or modify your NIS setup to provide the alias for the NIS clients.

The host alias file must be readable from each host (use e.g. NFS shared
file location).
The host alias file must be readable from each host (use e.g. NFS shared file location).

> >gethostname -all
> Hostname: extern_name.mydomain
> SGE name: gridmaster
> Aliases: loghost gridmaster
> Host Address(es): 192.168.143.99
> gethostname -all
Hostname: extern_name.mydomain
SGE name: gridmaster
Aliases: loghost gridmaster
Host Address(es): 192.168.143.99

# ENVIRONMENTAL VARIABLES

xxQS_NAME_Sxx_ROOT
Specifies the location of the xxQS_NAMExx standard configuration files.

xxQS_NAME_Sxx_CELL
If set, specifies the default xxQS_NAMExx cell.
For a complete list of common environment variables used by all xxQS_NAMExx commands, see xxqs_name_sxx_intro(1).

# SEE ALSO

*xxqs_name_sxx_intro*(1), *host_aliases*(5),
xxqs_name_sxx_intro(1), host_aliases(5),

# COPYRIGHT

See *xxqs_name_sxx_intro*(1) for a full statement of rights and
permissions.
See xxqs_name_sxx_intro(1) for a full statement of rights and permissions.
111 changes: 43 additions & 68 deletions doc/markdown/man/man1/sge_ckpt.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,84 +8,59 @@ date: __DATE__

# NAME

xxQS_NAMExx Checkpointing - the xxQS_NAMExx checkpointing mechanism and
checkpointing support
xxQS_NAMExx Checkpointing - the xxQS_NAMExx checkpointing mechanism and checkpointing support

# DESCRIPTION

xxQS_NAMExx supports two levels of checkpointing: the user level and a
operating system provided transparent level. User level checkpointing
refers to applications, which do their own checkpointing by writing
restart files at certain times or algorithmic steps and by properly
processing these restart files when restarted.

Transparent checkpointing has to be provided by the operating system and
is usually integrated in the operating system kernel.

Checkpointing jobs need to be identified to the xxQS_NAMExx system by
using the *-ckpt* option of the *qsub1*() command. The argument to this
flag refers to a so called checkpointing environment, which defines the
attributes of the checkpointing method to be used (see *checkpoint5*()
for details). Checkpointing environments are setup by the *qconf1*()
options *-ackpt*, *-dckpt*, *-mckpt* and *-sckpt*. The *qsub1*() option
*-c* can be used to overwrite the *when* attribute for the referenced
checkpointing environment.

If a queue is of the type CHECKPOINTING, jobs need to have the
checkpointing attribute flagged (see the **-ckpt** option to *qsub1*())
to be permitted to run in such a queue. As opposed to the behavior for
regular batch jobs, checkpointing jobs are aborted under conditions, for
which batch or interactive jobs are suspended or even stay unaffected.
These conditions are:

- Explicit suspension of the queue or job via *qmod1*() by the
cluster administration or a queue owner if the *x* occasion
specifier (see *qsub1*() *-c* and *checkpoint5*()) was assigned to
the job.

- A load average value exceeding the suspend threshold as configured
for the corresponding queues (see *queue_conf5*().)

- Shutdown of the xxQS_NAMExx execution daemon
*xxqs_name_sxx_execd8*() being responsible for the checkpointing
job.

After abortion, the jobs will migrate to other queues unless they were
submitted to one specific queue by an explicit user request. The
migration of jobs leads to a dynamic load balancing. **Note:** The
abortion of checkpointed jobs will free all resources (memory, swap
space) which the job occupies at that time. This is opposed to the
xxQS_NAMExx supports two levels of checkpointing: the user level and a operating system provided transparent level.
User level checkpointing refers to applications, which do their own checkpointing by writing restart files at certain
times or algorithmic steps and by properly processing these restart files when restarted.

Transparent checkpointing has to be provided by the operating system and is usually integrated in the operating
system kernel.

Checkpointing jobs need to be identified to the xxQS_NAMExx system by using the `-ckpt` option of the qsub(1) command.
The argument to this flag refers to a so called checkpointing environment, which defines the attributes of the
checkpointing method to be used (see checkpoint(5) for details). Checkpointing environments are setup by the qconf(1)
options `-ackpt`, `-dckpt`, `-mckpt` and `-sckpt`. The qsub(1) option `-c` can be used to overwrite the *when*
attribute for the referenced checkpointing environment.

If a queue is of the type *CHECKPOINTING*, jobs need to have the checkpointing attribute flagged (see the `-ckpt`
option to qsub(1)) to be permitted to run in such a queue. As opposed to the behavior for regular batch jobs,
checkpointing jobs are aborted under conditions, for which batch or interactive jobs are suspended or even stay
unaffected. These conditions are:

- Explicit suspension of the queue or job via qmod(1) by the cluster administration or a queue owner if the *x*
occasion specifier (see qsub(1) `-c` and checkpoint(5)) was assigned to the job.

- A load average value exceeding the suspend threshold as configured for the corresponding queues (see queue_conf(5)).

- Shutdown of the xxQS_NAMExx execution daemon xxqs_name_sxx_execd8(8) being responsible for the checkpointing job.

After abortion, the jobs will migrate to other queues unless they were submitted to one specific queue by an explicit
user request. The migration of jobs leads to a dynamic load balancing. Note: The abortion of checkpointed jobs will
free all resources (memory, swap space) which the job occupies at that time. This is opposed to the
situation for suspended regular jobs, which still cover swap space.

# RESTRICTIONS

When a job migrates to a queue on another machine at present no files
are transferred automatically to that machine. This means that all files
which are used throughout the entire job including restart files,
executables and scratch files must be visible or transferred explicitly
(e.g. at the beginning of the job script).

There are also some practical limitations regarding use of disk space
for transparently checkpointing jobs. Checkpoints of a transparently
checkpointed application are usually stored in a checkpoint file or
directory by the operating system. The file or directory contains all
the text, data, and stack space for the process, along with some
additional control information. This means jobs which use a very large
virtual address space will generate very large checkpoint files. Also
the workstations on which the jobs will actually execute may have little
free disk space. Thus it is not always possible to transfer a
transparent checkpointing job to a machine, even though that machine is
idle. Since large virtual memory jobs must wait for a machine that is
both idle, and has a sufficient amount of free disk space, such jobs may
suffer long turnaround times.
When a job migrates to a queue on another machine at present no files are transferred automatically to that machine.
This means that all files which are used throughout the entire job including restart files, executables and scratch
files must be visible or transferred explicitly (e.g. at the beginning of the job script).

There are also some practical limitations regarding use of disk space for transparently checkpointing jobs.
Checkpoints of a transparently checkpointed application are usually stored in a checkpoint file or directory by the
operating system. The file or directory contains all the text, data, and stack space for the process, along with some
additional control information. This means jobs which use a very large virtual address space will generate very
large checkpoint files. Also the workstations on which the jobs will actually execute may have little free disk
space. Thus it is not always possible to transfer a transparent checkpointing job to a machine, even though that
machine is idle. Since large virtual memory jobs must wait for a machine that is both idle, and has a sufficient
amount of free disk space, such jobs may suffer long turnaround times.

# SEE ALSO

*xxqs_name_sxx_intro1*(,) *qconf1*(,) *qmod1*(,) *qsub1*(,)
*checkpoint5*(,) *xxQS_NAMExx Installation and Administration Guide,*
*xxQS_NAMExx User's Guide*
xxqs_name_sxx_intro(1), qconf(1), qmod(1), qsub(1), checkpoint(5)

# COPYRIGHT

See *xxqs_name_sxx_intro1*() for a full statement of rights and
permissions.
See xxqs_name_sxx_intro1() for a full statement of rights and permissions.

0 comments on commit 4457f97

Please sign in to comment.