Update string.md #335

mikethetexan · 2022-09-23T23:13:19Z

Description

Fixes # [Add the issue number]

Type of PR

Edits concepts/string.md

Checklist:

I have read the contributing guidelines.
I have followed the style guide.

cefoo

Hi @mikethetexan!

Thank you so much for your edits and all your hard work!

I left some comments to your edits to try to align the article with our Style Guide and the structure of other articles in the repo.

Please let me know what you think. Let's work together! 😊

Thanks!

cefoo · 2022-09-26T22:34:15Z

concepts/string.md


-The computer does not know what letters are, only numbers.
-So every character needs to be represented by a unique number (codepoint).
+First, one has to decide how many characters we care to represent.   That's what is called the **character set**.   Each characted is typically assigned a **codepoint**.


The structure in this sentence would make it hard to machine translate it in the future. That one can lead to quality issues.

Perhaps we can simplify it with titles and simple explanations beneath them:

### Character representation #### Character count Characters counts determine the use of character sets. Each character is typically assigned a codepoint. #### Character encoding ...

What do you think?

Please note that, following other articles, only the subject of the article is written in bold format.

Also, sentences are to be separated by newlines. See spacing.

cefoo · 2022-09-26T22:39:01Z

concepts/string.md

@@ -25,12 +25,14 @@ So a string can even be used for a paragraph or book.

 ## Encoding

-In software, every character has a unique number.
+Computers deal with numbers, not with characters.  In order to represent a character, it has to be encoded into a series of bits.  Since computers deal with bytes (8 bits), a character is typically encoded into one or several bytes of data.


Try to avoid parenthesis. For Machine Translate, they are better turned into sentences.

Perhaps we can omit that part altogether, as it may be out of Machine Translate's scope.

How about...?

Computers deal with numbers, not with characters. A character is typically encoded into one or several bytes of data.

I'd phrase it as:

At the low level, computers use numbers, not characters. So each character is actually represented by a unique number.

This part is optional, in my view:

For a character to be represented with a number, it is encoded as one or more bytes. Each byte is a sequence of bits. Each bit is 0 or 1.

Because the main point for this audience is that each character gets a number, like 65.

The fact that 65 in decimal can also be represented as 1000001 in binary or 41 in hex is irrelevant.

Because that's explaining how integers are encoded, which is out of scope.

In any case, we could add a column to the table to show the binary.

cefoo · 2022-09-26T22:59:51Z

concepts/string.md

-For example, the **ASCII** standard maps English and other particularly useful characters to numbers.
+A character set can have one or several possible encodings. 
+
+The **ASCII** standard defines 128 characters and a single encoding.  Here is an example of a few ASCII characters and their decimal byte values:


We also try to avoid words like here, below, etc. to avoid MT risk when translating the site.
Perhaps a title is better?

#### Example of ASCII characters and decimal byte values

cefoo · 2022-09-26T23:04:41Z

concepts/string.md

+
+In 1991, a new standard emerged called Unicode.  Its goal was to represent as many characters as possible in a single character set and solve the interoperability issue. Its first version contained over 7000 characters.  As of September 2022, Unicode is in version 15.0 and contains 149,186 characters or codepoints!
+
+Unicode has many different encodings. The most common one is **UTF-8**, but other exist like **UTF-16** or even **UTF-32**.  UTF-8 is a variable length encoding, while UTF-32 is fixed length (as long as the character set doesnt try to represent more than 4 billion chars!). UTF-32 is a memory hog, but it is predictable since it's fixed length and it has the advantage that the byte values are an exact match to the codepoint value.  UTF-16 is also variable length, not fixed length, a mistake often made by newbie developers.  In UTF-16, in order to represent characters beyond the first plane of Unicode (Basic Multilingual plane), the encoding uses **surrogate pairs**.  Don't ever assume that a character is always encoded with 2 bytes in UTF-16!


This information is super interesting, but I fear it might not be a little out of scope for Machine Translate, as this article is about strings.
Please bear in mind that we try to keep content minimal.
Do you think we can summarize this part into fewer sentences, just to give minimal history?

cefoo · 2022-09-26T23:05:42Z

concepts/string.md


-The most common problem is when text is encoded with one standard, but decoded with another.
-The result is often unreadable.
+A common problem is that legacy character sets and their encodings are still supported by many operating systems (for compatibility reasons with old software), while newer systems may use a Unicode encoding. When text is encoded with one standard, but decoded with another, the result is often unreadable.


I like giving more detail to the problem!
I would just put it into shorter sentences, to avoid potential MT problems in the future.

bittlingmayer · 2022-10-01T07:06:37Z

Thanks for this knowledgeable contribution to a fundamental topic!

As Cecilia hinted, we have to keep the audience in mind:

People who:

don't know what a string is:
are accessing this page via machine translation

So we only need to cover what is relevant about strings with regard to machine translation, not with regard to all computer science and computer engineering.

Basically, non-technical people use words like "segment" or "sentence", but they may hear the word "string", which is more precise, but doesn't necessarily map to one thing in the product/workflows they know.

So how can we help them understand just enough and get on their way?

Update string.md

4d28de2

cefoo reviewed Sep 26, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update string.md #335

Update string.md #335

mikethetexan commented Sep 23, 2022

cefoo left a comment

cefoo Sep 26, 2022

cefoo Sep 26, 2022

bittlingmayer Oct 1, 2022

cefoo Sep 26, 2022

cefoo Sep 26, 2022

cefoo Sep 26, 2022

bittlingmayer commented Oct 1, 2022


		In 1991, a new standard emerged called Unicode. Its goal was to represent as many characters as possible in a single character set and solve the interoperability issue. Its first version contained over 7000 characters. As of September 2022, Unicode is in version 15.0 and contains 149,186 characters or codepoints!

		Unicode has many different encodings. The most common one is UTF-8, but other exist like UTF-16 or even UTF-32. UTF-8 is a variable length encoding, while UTF-32 is fixed length (as long as the character set doesnt try to represent more than 4 billion chars!). UTF-32 is a memory hog, but it is predictable since it's fixed length and it has the advantage that the byte values are an exact match to the codepoint value. UTF-16 is also variable length, not fixed length, a mistake often made by newbie developers. In UTF-16, in order to represent characters beyond the first plane of Unicode (Basic Multilingual plane), the encoding uses surrogate pairs. Don't ever assume that a character is always encoded with 2 bytes in UTF-16!

Update string.md #335

Are you sure you want to change the base?

Update string.md #335

Conversation

mikethetexan commented Sep 23, 2022

Description

Type of PR

Checklist:

cefoo left a comment

Choose a reason for hiding this comment

cefoo Sep 26, 2022

Choose a reason for hiding this comment

cefoo Sep 26, 2022

Choose a reason for hiding this comment

bittlingmayer Oct 1, 2022

Choose a reason for hiding this comment

cefoo Sep 26, 2022

Choose a reason for hiding this comment

cefoo Sep 26, 2022

Choose a reason for hiding this comment

cefoo Sep 26, 2022

Choose a reason for hiding this comment

bittlingmayer commented Oct 1, 2022