Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify SAM file encoding (mostly ASCII, some UTF-8 parts) #670

Merged
merged 1 commit into from
May 15, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions SAMv1.tex
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,16 @@ \section{The SAM Format Specification}
BAM file may optionally specify the version being used via the
{\tt @HD VN} tag. For full version history see Appendix~\ref{sec:history}.

Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII \footnote{Charset ANSI\_X3.4-1968 as defined in RFC1345.} in using the POSIX / C locale.
Regular expressions listed use the POSIX / IEEE Std 1003.1 extended syntax.
SAM file contents are 7-bit US-ASCII, except for certain field values as individually specified which may contain other Unicode characters encoded in UTF-8.
Alternatively and equivalently, SAM files are encoded in UTF-8 but non-ASCII characters are permitted only within certain field values as explicitly specified in the descriptions of those fields.%
\footnote{Hence in particular SAM files must not begin with a byte order mark~(BOM) and lines of text are delimited by ASCII line terminator characters only.
% Unicode identifies VT and FF as line break characters as well, but no one uses them in SAM.
In addition to the local platform's text file line termination conventions, implementations may wish to support \textsc{lf} and \textsc{cr\>lf} for interoperability with other platforms.}

Where it makes a difference, SAM file contents should be read and written using the POSIX\,/\,C locale.
For example, floating-point values in SAM always use `{\tt .}' for the decimal-point character.

The regular expressions in this specification are written using the POSIX\,/\,IEEE Std 1003.1 extended syntax.

\subsection{An example}\label{sec:example}
Suppose we have the following alignment with bases in lowercase
Expand Down