-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update string.md #335
base: master
Are you sure you want to change the base?
Update string.md #335
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mikethetexan!
Thank you so much for your edits and all your hard work!
I left some comments to your edits to try to align the article with our Style Guide and the structure of other articles in the repo.
Please let me know what you think. Let's work together! 😊
Thanks!
|
||
The computer does not know what letters are, only numbers. | ||
So every character needs to be represented by a unique number (codepoint). | ||
First, one has to decide how many characters we care to represent. That's what is called the **character set**. Each characted is typically assigned a **codepoint**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure in this sentence would make it hard to machine translate it in the future. That one
can lead to quality issues.
Perhaps we can simplify it with titles and simple explanations beneath them:
### Character representation
#### Character count
Characters counts determine the use of character sets.
Each character is typically assigned a codepoint.
#### Character encoding
...
What do you think?
Please note that, following other articles, only the subject of the article is written in bold format.
Also, sentences are to be separated by newlines. See spacing.
@@ -25,12 +25,14 @@ So a string can even be used for a paragraph or book. | |||
|
|||
## Encoding | |||
|
|||
In software, every character has a unique number. | |||
Computers deal with numbers, not with characters. In order to represent a character, it has to be encoded into a series of bits. Since computers deal with bytes (8 bits), a character is typically encoded into one or several bytes of data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to avoid parenthesis. For Machine Translate, they are better turned into sentences.
Perhaps we can omit that part altogether, as it may be out of Machine Translate's scope.
How about...?
Computers deal with numbers, not with characters.
A character is typically encoded into one or several bytes of data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd phrase it as:
At the low level, computers use numbers, not characters. So each character is actually represented by a unique number.
This part is optional, in my view:
For a character to be represented with a number, it is encoded as one or more bytes. Each byte is a sequence of bits. Each bit is 0 or 1.
Because the main point for this audience is that each character gets a number, like 65.
The fact that 65 in decimal can also be represented as 1000001 in binary or 41 in hex is irrelevant.
Because that's explaining how integers are encoded, which is out of scope.
In any case, we could add a column to the table to show the binary.
For example, the **ASCII** standard maps English and other particularly useful characters to numbers. | ||
A character set can have one or several possible encodings. | ||
|
||
The **ASCII** standard defines 128 characters and a single encoding. Here is an example of a few ASCII characters and their decimal byte values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also try to avoid words like here
, below
, etc. to avoid MT risk when translating the site.
Perhaps a title is better?
#### Example of ASCII characters and decimal byte values
|
||
In 1991, a new standard emerged called Unicode. Its goal was to represent as many characters as possible in a single character set and solve the interoperability issue. Its first version contained over 7000 characters. As of September 2022, Unicode is in version 15.0 and contains 149,186 characters or codepoints! | ||
|
||
Unicode has many different encodings. The most common one is **UTF-8**, but other exist like **UTF-16** or even **UTF-32**. UTF-8 is a variable length encoding, while UTF-32 is fixed length (as long as the character set doesnt try to represent more than 4 billion chars!). UTF-32 is a memory hog, but it is predictable since it's fixed length and it has the advantage that the byte values are an exact match to the codepoint value. UTF-16 is also variable length, not fixed length, a mistake often made by newbie developers. In UTF-16, in order to represent characters beyond the first plane of Unicode (Basic Multilingual plane), the encoding uses **surrogate pairs**. Don't ever assume that a character is always encoded with 2 bytes in UTF-16! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This information is super interesting, but I fear it might not be a little out of scope for Machine Translate, as this article is about strings.
Please bear in mind that we try to keep content minimal.
Do you think we can summarize this part into fewer sentences, just to give minimal history?
|
||
The most common problem is when text is encoded with one standard, but decoded with another. | ||
The result is often unreadable. | ||
A common problem is that legacy character sets and their encodings are still supported by many operating systems (for compatibility reasons with old software), while newer systems may use a Unicode encoding. When text is encoded with one standard, but decoded with another, the result is often unreadable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like giving more detail to the problem!
I would just put it into shorter sentences, to avoid potential MT problems in the future.
Thanks for this knowledgeable contribution to a fundamental topic! As Cecilia hinted, we have to keep the audience in mind: People who:
So we only need to cover what is relevant about strings with regard to machine translation, not with regard to all computer science and computer engineering. Basically, non-technical people use words like "segment" or "sentence", but they may hear the word "string", which is more precise, but doesn't necessarily map to one thing in the product/workflows they know. So how can we help them understand just enough and get on their way? |
Description
Fixes # [Add the issue number]
Type of PR
Checklist: