From 93e8df3f74497ce332d5812af6657b8c081fb424 Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Wed, 21 Aug 2024 18:08:08 -0700 Subject: [PATCH] Address I18N-ACTION-90 by adding text from scroll-to-text-fragment#233 This PR includes borrowing text from an example by @hsivonen, which I intend to replace before merging with a better-adapted version. In addition, some of the text or comments from WICG/scroll-to-text-fragment#233 are begin adapted into the prose of this document. **_Submitting as draft. Not ready for review._** --- index.html | 75 +++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 66 insertions(+), 9 deletions(-) diff --git a/index.html b/index.html index f891aea..28fc4a9 100644 --- a/index.html +++ b/index.html @@ -44,7 +44,7 @@
-

This document describes string searching operations on the Web in order to allow greater interoperability. String searching refers to natural language string matching such as the "find" command in a Web browser. This document builds upon the concepts found in Character Model for the World Wide Web 1.0: Fundamentals [[CHARMOD]] and Character Model for the World Wide Web 1.0: String Matching [[CHARMOD-NORM]] to provide authors of specifications, software developers, and content developers the information they need to describe and implement search features suitable for global audiences.

+

This document describes string searching operations on the Web in order to allow greater interoperability. String searching refers to natural language string matching such as the "find" command in a Web browser. This document builds upon the concepts found in Character Model for the World Wide Web 1.0: Fundamentals [[CHARMOD]] and Character Model for the World Wide Web 1.0: String Matching [[CHARMOD-NORM]] to provide authors of specifications, software developers, and content developers the information they need to describe and implement search features suitable for global audiences.

@@ -58,7 +58,7 @@

Introduction

Goals and Scope

-

This document describes the problems, requirements, and considerations for specification or implementations of string searching operations. A common example of string searching is the "find" command in a Web browser, but there are many other forms of searching that a specification might wish to define.

+

This document describes the problems, requirements, and considerations for specification or implementations of string searching operations. A common example of string searching is the "find" command in a Web browser, but there are many other forms of searching that a specification might wish to define.

This document builds on Character Model for the World Wide Web: Fundamentals [[CHARMOD]] and Character Model for the Word Wide Web: String Matching [[CHARMOD-NORM]]. Understanding the concepts in those documents are important to being able to understand and apply this document successfully.

@@ -96,6 +96,8 @@

Terminology

Full-Text Search refers to searches that process the entire contents of the textual document or set of documents. Full-text queries perform linguistic searches against text data in full-text indexes by operating on words and phrases based on the rules of a particular language such as English or Japanese. Full-text queries can include simple words and phrases or multiple forms of a word or phrase.

Frequently this means that a full-text search employs indexes and natural language processing. When you are using a search engine, you are using a form of full text search. Full text search often breaks natural language text into words or phrases (this is called segmentation) and may apply complex processing to get at the semantic "root" values of words (this is called stemming). These processes are sensitive to language, context, and many other aspects of textual variation.

+

Natural Language Processing (NLP) refers to the domain of software designed to understand, process, and manipulate human languages (that is, natural language). This is a very wide ranging term. It can cover relatively simple problems, such as word tokenization, or more complex behaviors, such as deriving "meaning" from text, recognizing parts of speech, performing accurate translation, and much else.

+
@@ -121,17 +123,23 @@

String Searching in Natural Language Content

--> -

Users of the Web often want to search documents for particular words or phrases within the natural language text of a given document. This is different from the sorts of programmatic matching needed by formal languages (such as markup languages such as [[HTML]]; style sheets [[CSS21]]; or data formats such as [[TURTLE]] or [[JSON-LD]]), and which are described by our document [[CHARMOD-NORM]].

+

Users of the Web often want to search documents for particular words or phrases within the natural language text of a given document. This is different from the sorts of programmatic matching needed by formal languages (such as markup languages such as [[HTML]]; style sheets [[CSS21]]; or data formats such as [[TURTLE]] or [[JSON-LD]]), and which are described by our document [[CHARMOD-NORM]].

-

There are different types of string searching. +

There are different types of string searching. A full text search is the type of searching most often found in applications such as a search engine (Examples include Google, Bing, or DuckDuckGo). This type of searching is complex, can be resource intensive, and often depends on processes outside the scope of a given search request.

-

One limited form of full-text search—and the topic of this document—is sub-string matching. One familiar form of sub-string matching is the "find" feature of your browser. A sub-string match searches the body ("corpus") of a document with the user's input, seeking a match.

+

A more limited form of text search—and the topic of this document—is sub-string matching. One familiar form of sub-string matching is the "find" feature of browsers and other types of user-agent. A sub-string match searches the body ("corpus") of a document with the user's input, seeking a match. In browsers, this functionality is often accessed via a key combination such as Cmd+F or Ctrl+F. This might be exposed on the Web via the API window.find, which is currently not fully standardized, or features such as the proposed scroll-to-text-fragment.

+ +

Find operations can have options or implementation details, such as the addition or removal of case sensitivity, or whether the feature supports different aspects of a regular expression language or "wildcards".

+ +

One way that sub-string matching usually differs from full-text search is that, while it may use algorithms in an attempt to suppress or ignore textual variations, it usually does not produce matches that contain additional or unspecified character sequences, words, or phrases, such as would result from stemming or other NLP processes.

-

Find operations can have different options or implementation details, such as the addition or removal of case sensitivity, or whether the feature supports different aspects of a regular expression language or "wildcards".

+

Quite often, the user's input does not use a sequence of code points identical to that in the text being searched. This can happen for a variety of reasons. Sometimes it is because the corpus varies in ways the user cannot predict. In other cases it is because the user's keyboard or input method does not provide ready access to the textual variations needed, or because the user cannot be bothered to input the text accurately. In this section, we examine various common cases known to us.

-

One way that sub-string matching usually differs from other types of full-text search is that, while it may use algorithms in an attempt to suppress or ignore textual variations, it usually does not produce matches that contain additional or unspecified character sequences, words, or phrases.

+

A significant issue with find operations is that the language of the corpus and the language of the search term can affect how the various processes mentioned elsewhere in this document are applied. For example, case folding is occasionally locale-affected. Similarly, throughout this document, the handling of accents, alternate scripts, or encoding is linked to the specific language of the text in question. It's important to emphasize that we mean language here, and not script, for different languages that share a script very often apply different processing or imply different expectations.

-

Quite often, the user's input does not use a sequence of code points identical to that in the text being searched. This can happen for a variety of reasons. Sometimes it is because the text varies in ways the user cannot predict. In other cases it is because the user's keyboard or input method does not provide ready access to the textual variations needed—or because the user cannot be bothered to input the text accurately. In this section, we examine various common cases known to us.

+

Find features in user interfaces often have to guess what language the user intended based solely on the user's input or on readily available information (the operating environment locale, the user agent's localization, the language of the active keyboard). These hints are, at best, a proxy for the user's intent, particularly when the user is searching a document that doesn't match any of these or when the searched document contains more than one language.

+ +

Additional Types of Equivalence

@@ -148,7 +156,7 @@

Additional Types of Equivalence

Case Folding

-

A user might expect a term entered in lowercase to match uppercase equivalents (and perhaps vice-versa). Most sub-string matching feature, such as the browser "find" command, offer a user-selectable option for matching the case of the input to that of the text.

+

A user might expect a term entered in lowercase to match uppercase equivalents (and perhaps vice-versa). Sub-string matching features, such as the browser "find" command, often offer a user-selectable option for matching (or not) the case of the input to that of the text.

For a survey of case folding, see the discussion here in [[CHARMOD-NORM]].

@@ -708,5 +716,54 @@

Types of Search Option

Acknowledgements

The W3C Internationalization Working Group and Interest Group, as well as others, provided many comments and suggestions. The Working Group would like to thank: all of the contributors to the Character Model series of documents over the many years of their development.

+ +
+

Text fragment language

+

This section was borrowed from an example page by Henri Sivonen.

+

+ The root element of this section is tagged as German. The heading above + and this paragraph are tagged as English. The list of links in the end is + not language-tagged and, therefore, should count as German. Note that in + search collations in English (root) “ae” is primary-different from “a” and + “ä”, which in turn are primary-equal with each other, in German “a” is + primary-different from “ae” and “ä”, and in Finnish “a”, “ae”, and “ä” are + all primary-different from each other. Here is a Finnish sentence + language-tagged as Finnish within the English paragraph: + Haen Han Solon. Hän on salakuljettaja. (For the + curious, this translates to: I’ll go get Han Solo. He is a smuggler.) +

+

+ Let’s try that again this time the substring Han Solo, excluding the “n” + language-tagged as English: + Haen Han Solon. Hän on salakuljettaja. +

+

+ And again without tagging “Han Solo” as English but in Normalization Form + D instead of Normalization Form C: + Haen Han Solon. Hän on salakuljettaja. Followed + by a paragraph language-tagged as Finnish: +

+

Haen Han Solon. Hän on salakuljettaja.

+

+ Let’ try what I have been lead to believe means “warm marrow” in Turkish + tagged as Turkish: ılık ilik And as a paragraph: +

+

ılık ilik

+

+ Finally, so fragment links to this page (untagged and, therefore, should + be considered German): +

+ +