Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSML support needs to be possible to feature detect #37

Open
foolip opened this issue Aug 20, 2018 · 23 comments
Open

SSML support needs to be possible to feature detect #37

foolip opened this issue Aug 20, 2018 · 23 comments

Comments

@foolip
Copy link
Member

foolip commented Aug 20, 2018

In web-platform-tests/wpt#12568 web-platform-tests/wpt#12689 I found that apparently SSML isn't supported anywhere.

If there is no immediate implementer interest, I suggest removing the SSML bits from the spec.

Update: it does work on Windows, see #37 (comment)

@foolip
Copy link
Member Author

foolip commented Aug 27, 2018

foolip added a commit that referenced this issue Aug 27, 2018
Based on manual testing in web-platform-tests/wpt#12568,
support for SSML has not been implemented in any browser.

Fixes #37.
foolip added a commit that referenced this issue Aug 27, 2018
Tests: web-platform-tests/wpt#12689

Based on running the added test manually, it appears that support for
SSML has not been implemented in any browser.

Fixes #37.
@AdamSobieski
Copy link

We could add “ssml” to the SpeechSynthesisUtterance interface. When “text” is set with a sentence of text, the getter for “ssml” could return SSML wrapping the text content as per: <speak><p><s>{{text}}</s></p></speak>. When “ssml” is set with valid SSML content, the getter for “text” could return the innerText of the SSML content.

We could also include a feature support interface. A feature support interface includes methods like hasFeature() or isSupported() [1][2]. This would allow JavaScript introspection of the features supported by an implementation.

An interface like DataTransfer could be utilized to support both “text/plain” and “application/ssml+xml” scenarios [3].

[1] https://www.w3.org/TR/DOM-Level-3-Core/core.html#DOMFeatures
[2] https://wicg.github.io/feature-policy/
[3] http://w3c.github.io/html/editing.html#the-datatransfer-interface

@AdamSobieski
Copy link

Also would like to indicate an idea from Dominic Mazzoni from the mailing list [1], "perhaps there should be a way for voices to explicitly identify that they support SSML, that way clients would be able to safely include SSML only if the engine will be interpreting it" which suggests looking to the SpeechSynthesisVoice interface for indicating support for features.

[1] https://lists.w3.org/Archives/Public/public-speech-api/2018Aug/0004.html

@foolip
Copy link
Member Author

foolip commented Aug 29, 2018

I believe that there is at least one important point of agreement so far, and that is that setting text to a string containing SSML shouldn't work. It doesn't work in any implementation today, and importantly it wouldn't be possible to feature detect support if it were added, so web developers couldn't know when it's safe to use.

I'm not an implementer in this discussion, and I believe there has to be implementer engagement if discussing how it should work.

@minorninth @andrenatal, do you think SSML support is on the roadmap for Chromium or Gecko?
@michaelchampion @othermaciej might you be able to identify contacts for Edge and WebKit?

@saschanaz
Copy link
Contributor

saschanaz commented Aug 29, 2018

So the Edge bug says MSEdge implements SSML 1.0, and my installation in fact somehow supports it. (It at least does not speak the XML things.) The <phoneme> tag is functional in Edge.

Edit: Chrome (on Windows, version 68) also supports the sample attached in the bug. It does not speak the XML things for both 1.0 and 1.1, while it doesn't support <phoneme>.

Edit 2: Try this fiddle: http://jsfiddle.net/saschanaz/8pyWZ/18/

@foolip
Copy link
Member Author

foolip commented Aug 29, 2018

Ah, so there are platform differences here in addition to differences between browsers. That must be very frustrating for web developers trying to use SSM.

I can confirm that Edge and Chrome on Windows both somewhat support http://jsfiddle.net/saschanaz/8pyWZ/18/, Edge saying "hello world javascript" will Chrome says "hello world phoneme failed". Firefox for Windows says "xml version one point zero ..."

That makes it harder to figure out what to do here. It already works in some cases, but is impossible to feature detect without trying to utter something and measuring how long it takes.

@saschanaz
Copy link
Contributor

but is impossible to feature detect without trying to utter something and measuring how long it takes.

Yes... Maybe something like canPlayType in HTMLMediaElement would help.

@GLRoylance
Copy link

GLRoylance commented Aug 29, 2018

foolip said

It doesn't work in any implementation today

It works in Edge if one uses SSML 1.0. With any parse error (such as a SSML 1.1 or a bad phoneme), then Edge will speak the XML.

Edge has support beyond phonemes. The SSML can also use sub and say-as with character, date, and telephone formats.

<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2001/10/synthesis  http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
    xml:lang="en-US">
  Here are <say-as interpret-as="characters">SSML</say-as> samples.
  Here are some date tests.
  Try to say September tenth 1960: <say-as interpret-as="date" format="dmy" detail="2">10-9-1960</say-as>.
  Try to say October ninth 1960: <say-as interpret-as="date" format="ymd" detail="2">1960-10-09</say-as>.
  The safe's combination is <say-as interpret-as="characters" detail="2 1 2 1 2">10-24-65</say-as>.
  <say-as interpret-as="telephone">650-555-1234</say-as>.
  <sub alias="World Wide Web Consortium">W3C</sub>
</speak>

I oppose removing SSML from the spec.

However, the specification should change. Currently, SpeechSynthesisUtterance.text is a DOMSTRING.

text attribute, of type DOMString

This attribute specifies the text to be synthesized and spoken for this utterance. This may be either plain text or a complete, well-formed SSML document. [SSML] For speech synthesis engines that do not support SSML, or only support certain tags, the user agent or speech engine must strip away the tags they do not support and speak the text. There may be a maximum length of the text, it may be limited to 32,767 characters.

Consequently, an implementation must look at the DOMString and guess what it is. Instead, the user should supply a DOMString to be spoken or supply an SSML document (already parsed). It seems rather silly for me to build the document I want to speak and then convert it to a DOMString for SpeechSynthesisUtterance to re-parsed.

The current spec allows an implementation to ignore the document markup and just speak the .textContent. I believe that is what Chrome does.

@foolip
Copy link
Member Author

foolip commented Aug 29, 2018

@GLRoylance how are you currently dealing with this only working on some browsers/platforms? Is UA sniffing is the only way?

@foolip
Copy link
Member Author

foolip commented Aug 29, 2018

Consequently, an implementation must look at the DOMString and guess what it is.

Indeed, this is #10

The current spec allows an implementation to ignore the document markup and just speak the .textContent. I believe that is what Chrome does.

That would be consistent with the behavior observed for Chrome on Windows, but on Linux it just reads the markup out loud.

@foolip foolip changed the title SSML support isn't implemented, remove it from spec? SSML support needs to be possible to feature detect Aug 29, 2018
@othermaciej
Copy link

Safari also reads the markup out loud. If support for a speech-specific markup is to be added (no opinion on whether this is worthwhile or not), I think it would be better to make it a new property instead of overloading an existing property. It seems better to make this feature detectable, and that's the standard way to do it. I don't think a canPlayType-style solution is as good. That's a solution more suited to cases where the underlying types are relatively homogenous and open-ended enough that it would't make sense to have a distinct entry point for each. Video types meet that bar but I am not sure the set of {plain text, SSML} does.

I will try to find relevant contacts on whether we are interested in supporting SSML in general.

@GLRoylance
Copy link

I don't sniff the UA, but the text is only spoken if the user clicks a speaker icon after an IPA string. I scan the page for class="IPA" and add an appropriate onclick action if found. The phoneme string is not .textContent but rather an attribute of the phoneme element. The SSML body will be something like

<phoneme alphabet="ipa" ph="t&#x259;mei&#x325;&#x27E;ou&#x325;">Unavailable</phoneme>

On Edge, it speaks the IPA phonemes. On Chrome, it says "Unavailable". On Firefox, it speaks a mountain of XML. Not tried on Unix or Mac, but I don't expect any support there.

The bottom line is implementations that follow the current spec and grab speak.textContent would have acceptable behavior (i.e., not speak the XML and quicky announce the feature is not available). That behavior would also be true for many other applications. Even if the UA ignores instructions about aliases, date formats, and telephone numbers, the result is often acceptable.

@othermaciej
Copy link

I talked to the folks who work on Speech API at Apple. We are interested in SSML. We don't like overloading the .text attribute. We think it would be better to expose it via a separate attribute instead. Besides providing for feature detection, this avoids the need to sniff XML out of a string.

@guest271314
Copy link

.02

FWIW, cobbled together JavaScript code which parses SSML according to the specification, with tests for <break>, <p>, <prosody>, <s>, <say-as>, <sub>, <voice> elements, so far.

From perspective here SSML support should be relatively straightforward to implement and be shipped at all browsers, provided the will to do so.

@GLRoylance
Copy link

GLRoylance commented Sep 12, 2018

As far as SSML, I would have new SpeechSynthesisUtterance() take a text DOMString or an SSML document.

I'd be OK with the utterance object having both a .text (DOMString only) and a .ssml (SSML document only) property rather than overload .text. I would require the minimal support of SSML to be speak the documentElement.textContent. I would not require the utterance to keep the two in sync. If there is a .ssml, then .text is ignored.

The Web Speech API should be able to specify every property in the Speech CSS. I want the box model, and I want properties such as azimuth and elevation in addition to volume, rate, pitch. There's often a significant pause between two utterances, but the only way I can change the voice is to use two utterances. I may have only one voice, so separating them by azimuth can be helpful. Currently, Web Speech API does even let me determine a voice's gender (I need to guess from a name; is it is Sean like Sean Penn or Sean like Sean Young?). A voice needs more attributes: gender, age, ....

@saschanaz
Copy link
Contributor

saschanaz commented Sep 12, 2018

A voice needs more attributes: gender, age, ....

(Maybe off-topic, but I would prefer more physical attributes e.g. tone pitch rather than gender or age to not intensify stereotypes.)

@AdamSobieski
Copy link

AdamSobieski commented Sep 13, 2018

We might want to consider detecting feature support for SSML 1.0 and SSML 1.1 . That is, we might want to consider detecting the version of SSML supported, if any. There could also be an SSML 2.0 someday.

@GLRoylance , I agree about CSS Speech Module features (https://www.w3.org/TR/css3-speech/#ssml-rel). Also, azimuth, distance and elevation are interesting ideas; in general, 3D, binaural, or ambisonic audio. That's a bit ahead of the platform implementations of client-side speech synthesis such as MS SAPI, however.

@saschanaz , if avoiding stereotyping is a goal with respect to the SpeechSynthesisVoice interface properties, some ideas include articulatory synthesis parameters such as length of vocal chords, lung capacity, and other properties of the organs of articulation. That's also a bit ahead of platform implementations of client-side speech synthesis such as MS SAPI, however.

A topic which interests me is speech prosody including the prosody of mathematical expressions. I have considered, in some detail, the speech synthesis of mathematical expressions, how we vocalize symbols or variables as they occur multiple times in simple and complex mathematical expressions, that vocalization conveying meaning. We can consider, for instance, the quadratic equation and that the prosody, intonation and pauses are all important for a natural sounding synthesis of mathematics expressions. We can consider that natural sounding speech synthesis is important for purposes beyond aesthetics including attention, cognitive load and comprehension.

We can observe recent advancements to speech synthesis (e.g. WaveNet, https://cloud.google.com/text-to-speech/) and consider server-side speech synthesis scenarios. A client could receive a hypertext document which references an audio stream and utilize media overlays (see also: EPUB) to synchronize visual cues to the playback of an audio version of the hypertext document.

In theory, we could consider both client-side and server-side speech synthesis and speech recognition scenarios for a Web Speech API 2.0 . For both client-side and server-side speech synthesis scenarios, I consider features beyond text-to-speech, for instance hypertext-to-speech.

Is the group is in a mode to correct miscellaneous errata of version 1.0 or are we warming up to discussing a version 1.1 or 2.0? Should we brainstorm utilizing the mailing list or GitHub issues tracker? Should we utilize GitHub editing processes more broadly to propose improvements?

@GLRoylance
Copy link

I do not see male/female as a stereotyping issue. We are not saying doctor voices should be male and nurse voices female.

I am confronted with SSML and CSS Speech Module voice specifications, and they specify gender and age. If I want to do things comparable to them with Web Speech, then I need access to more characteristics of the Web Speech voice.

@guest271314
Copy link

Since the resulting audio from call to speak() where SSML parsing to audio output is not supported will have a greater duration than where SSML parsing to audio output is supported, we can determine SSML support by averaging the duration of the utterance where input text is plain text or SSML and comparing the difference between the two sets of data, here, using timeStamp properties of start and end events of SpeechSynthesisUtterance to get average during of audio output; other more precise methods of time measurement of the audio output can be substituted for the timeStamp property

var ssmlIsSupported = (say) => {
var data = {
text: [],
ssml: []
};
var results = {
ssml: 0,
text: 0
};
var test = (type, utterance, durations) => {
return Array.from({
length: 10
}, () => new SpeechSynthesisUtterance(type === "ssml" ? <speak>${utterance}</speak> : utterance))
.reduce((promise, next) =>
promise.then(_ => new Promise(resolve => {
next.onstart = (e) => {
durations[type].push(e.timeStamp);
}
next.onend = (e) => {
durations[type][durations[type].length - 1] = e.timeStamp - durations[type][durations[type].length - 1];
resolve();
}
window.speechSynthesis.speak(next);
})), Promise.resolve())
.then(_ => durations)
}
return Object.keys(data).reduce((promise, key) => promise.then(_ => test(key, say, data)), Promise.resolve())
.then(data => {
Object.keys(data).forEach((key) => {
var curr = data[key].reduce((curr, next) => {
return curr / 10000 + next / 10000;
}) / data[key].length;
results[key] = curr;
});
return Math.max.apply(Math, Object.values(results)) - Math.min.apply(Math, Object.values(results)) < 10;
})
}

var ssmlSupport = ssmlIsSupported("hello").then(result => console.log(result));

@guest271314
Copy link

@foolip Have you considered defining the requirement yourself instead of deferring to the existing specification might be? Or, does your suggestion to remove the SSML portions from the Web Speech API? One route to ship SSML support to browsers would be to utilize the open source project espeak-ng and possibly speech-dispatcher (with appropriate settings) shipped with the browser. Conversely, there are also enough existing issues with regards to the various implementations of SSML and appears to be adequate interest here to just start from scratch (specification and code implementing Web Speech API). What are you ultimately trying to achieve?

@AdamSobieski
Copy link

Client-side, Server-side and Third-party Speech Recognition, Synthesis and Translation pertains to Issue 1 in the specification.

Issue 1: The group has discussed whether WebRTC might be used to specify selection of audio sources and remote recognizers. See Interacting with WebRTC, the Web Audio API and other external sources thread on public-speech-api@w3.org.

A question is how or whether topics raised in Client-side, Server-side and Third-party Speech Recognition, Synthesis and Translation and Issue 1 pertain to a version 1.0 or a version 2.0 of the Web Speech API.

Perhaps the Speech API Community Group could develop Web Speech API versions 1.0 and 2.0 simultaneously?

@guest271314
Copy link

@foolip

That makes it harder to figure out what to do here. It already works in some cases, but is impossible to feature detect without trying to utter something and measuring how long it takes.

At least at Chromium/Chrome, if this patch was incorporated into Chromium source code we could at least be aware that SSML parsing was turned on by default for use with speech-dispatcher (speechd) and espeak-ng, see How to set SSML parsing to on at user configuration file?.

That is, using properly configured open source GitHub repositories speechd and espeak-ng shipped with open source browsers Chromium and Firefox/Nightly should at least provide SSML feature detection and parsing by default now at open source browsers - as the Web Speech API does not perform any parsing in and of itself, but relies entirely on software installed at the operating system.

Therefore, we should be able to use existing GitHub repositories to meet the requirement that this issue addresses. Else, the issue will not resolve itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants