Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update string.md #335

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mikethetexan
Copy link

Description

Fixes # [Add the issue number]

Type of PR

Checklist:

Copy link
Collaborator

@cefoo cefoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mikethetexan!

Thank you so much for your edits and all your hard work!

I left some comments to your edits to try to align the article with our Style Guide and the structure of other articles in the repo.

Please let me know what you think. Let's work together! 😊

Thanks!


The computer does not know what letters are, only numbers.
So every character needs to be represented by a unique number (codepoint).
First, one has to decide how many characters we care to represent. That's what is called the **character set**. Each characted is typically assigned a **codepoint**.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure in this sentence would make it hard to machine translate it in the future. That one can lead to quality issues.

Perhaps we can simplify it with titles and simple explanations beneath them:

### Character representation

#### Character count

Characters counts determine the use of character sets.
Each character is typically assigned a codepoint.

#### Character encoding

...

What do you think?

Please note that, following other articles, only the subject of the article is written in bold format.

Also, sentences are to be separated by newlines. See spacing.

@@ -25,12 +25,14 @@ So a string can even be used for a paragraph or book.

## Encoding

In software, every character has a unique number.
Computers deal with numbers, not with characters. In order to represent a character, it has to be encoded into a series of bits. Since computers deal with bytes (8 bits), a character is typically encoded into one or several bytes of data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to avoid parenthesis. For Machine Translate, they are better turned into sentences.

Perhaps we can omit that part altogether, as it may be out of Machine Translate's scope.

How about...?

Computers deal with numbers, not with characters.
A character is typically encoded into one or several bytes of data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd phrase it as:

At the low level, computers use numbers, not characters. So each character is actually represented by a unique number.

This part is optional, in my view:

For a character to be represented with a number, it is encoded as one or more bytes. Each byte is a sequence of bits. Each bit is 0 or 1.

Because the main point for this audience is that each character gets a number, like 65.

The fact that 65 in decimal can also be represented as 1000001 in binary or 41 in hex is irrelevant.

Because that's explaining how integers are encoded, which is out of scope.

In any case, we could add a column to the table to show the binary.

For example, the **ASCII** standard maps English and other particularly useful characters to numbers.
A character set can have one or several possible encodings.

The **ASCII** standard defines 128 characters and a single encoding. Here is an example of a few ASCII characters and their decimal byte values:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also try to avoid words like here, below, etc. to avoid MT risk when translating the site.
Perhaps a title is better?

#### Example of ASCII characters and decimal byte values


In 1991, a new standard emerged called Unicode. Its goal was to represent as many characters as possible in a single character set and solve the interoperability issue. Its first version contained over 7000 characters. As of September 2022, Unicode is in version 15.0 and contains 149,186 characters or codepoints!

Unicode has many different encodings. The most common one is **UTF-8**, but other exist like **UTF-16** or even **UTF-32**. UTF-8 is a variable length encoding, while UTF-32 is fixed length (as long as the character set doesnt try to represent more than 4 billion chars!). UTF-32 is a memory hog, but it is predictable since it's fixed length and it has the advantage that the byte values are an exact match to the codepoint value. UTF-16 is also variable length, not fixed length, a mistake often made by newbie developers. In UTF-16, in order to represent characters beyond the first plane of Unicode (Basic Multilingual plane), the encoding uses **surrogate pairs**. Don't ever assume that a character is always encoded with 2 bytes in UTF-16!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This information is super interesting, but I fear it might not be a little out of scope for Machine Translate, as this article is about strings.
Please bear in mind that we try to keep content minimal.
Do you think we can summarize this part into fewer sentences, just to give minimal history?


The most common problem is when text is encoded with one standard, but decoded with another.
The result is often unreadable.
A common problem is that legacy character sets and their encodings are still supported by many operating systems (for compatibility reasons with old software), while newer systems may use a Unicode encoding. When text is encoded with one standard, but decoded with another, the result is often unreadable.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like giving more detail to the problem!
I would just put it into shorter sentences, to avoid potential MT problems in the future.

@bittlingmayer
Copy link
Collaborator

Thanks for this knowledgeable contribution to a fundamental topic!

As Cecilia hinted, we have to keep the audience in mind:

People who:

  • don't know what a string is:
  • are accessing this page via machine translation

So we only need to cover what is relevant about strings with regard to machine translation, not with regard to all computer science and computer engineering.

Basically, non-technical people use words like "segment" or "sentence", but they may hear the word "string", which is more precise, but doesn't necessarily map to one thing in the product/workflows they know.

So how can we help them understand just enough and get on their way?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants