-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSML support needs to be possible to feature detect #37
Comments
Based on manual testing in web-platform-tests/wpt#12568, support for SSML has not been implemented in any browser. Fixes #37.
Tests: web-platform-tests/wpt#12689 Based on running the added test manually, it appears that support for SSML has not been implemented in any browser. Fixes #37.
We could add “ssml” to the SpeechSynthesisUtterance interface. When “text” is set with a sentence of text, the getter for “ssml” could return SSML wrapping the text content as per: We could also include a feature support interface. A feature support interface includes methods like hasFeature() or isSupported() [1][2]. This would allow JavaScript introspection of the features supported by an implementation. An interface like DataTransfer could be utilized to support both “text/plain” and “application/ssml+xml” scenarios [3]. [1] https://www.w3.org/TR/DOM-Level-3-Core/core.html#DOMFeatures |
Also would like to indicate an idea from Dominic Mazzoni from the mailing list [1], "perhaps there should be a way for voices to explicitly identify that they support SSML, that way clients would be able to safely include SSML only if the engine will be interpreting it" which suggests looking to the SpeechSynthesisVoice interface for indicating support for features. [1] https://lists.w3.org/Archives/Public/public-speech-api/2018Aug/0004.html |
Implementation bugs: |
I believe that there is at least one important point of agreement so far, and that is that setting I'm not an implementer in this discussion, and I believe there has to be implementer engagement if discussing how it should work. @minorninth @andrenatal, do you think SSML support is on the roadmap for Chromium or Gecko? |
So the Edge bug says MSEdge implements SSML 1.0, and my installation in fact somehow supports it. Edit: Chrome (on Windows, version 68) also supports the sample attached in the bug. It does not speak the XML things for both 1.0 and 1.1, while it doesn't support Edit 2: Try this fiddle: http://jsfiddle.net/saschanaz/8pyWZ/18/ |
Ah, so there are platform differences here in addition to differences between browsers. That must be very frustrating for web developers trying to use SSM. I can confirm that Edge and Chrome on Windows both somewhat support http://jsfiddle.net/saschanaz/8pyWZ/18/, Edge saying "hello world javascript" will Chrome says "hello world phoneme failed". Firefox for Windows says "xml version one point zero ..." That makes it harder to figure out what to do here. It already works in some cases, but is impossible to feature detect without trying to utter something and measuring how long it takes. |
Yes... Maybe something like canPlayType in HTMLMediaElement would help. |
foolip said
It works in Edge if one uses SSML 1.0. With any parse error (such as a SSML 1.1 or a bad phoneme), then Edge will speak the XML. Edge has support beyond phonemes. The SSML can also use
I oppose removing SSML from the spec. However, the specification should change. Currently, SpeechSynthesisUtterance.text is a DOMSTRING.
Consequently, an implementation must look at the DOMString and guess what it is. Instead, the user should supply a DOMString to be spoken or supply an SSML document (already parsed). It seems rather silly for me to build the document I want to speak and then convert it to a DOMString for SpeechSynthesisUtterance to re-parsed. The current spec allows an implementation to ignore the document markup and just speak the |
@GLRoylance how are you currently dealing with this only working on some browsers/platforms? Is UA sniffing is the only way? |
Indeed, this is #10
That would be consistent with the behavior observed for Chrome on Windows, but on Linux it just reads the markup out loud. |
Safari also reads the markup out loud. If support for a speech-specific markup is to be added (no opinion on whether this is worthwhile or not), I think it would be better to make it a new property instead of overloading an existing property. It seems better to make this feature detectable, and that's the standard way to do it. I don't think a canPlayType-style solution is as good. That's a solution more suited to cases where the underlying types are relatively homogenous and open-ended enough that it would't make sense to have a distinct entry point for each. Video types meet that bar but I am not sure the set of {plain text, SSML} does. I will try to find relevant contacts on whether we are interested in supporting SSML in general. |
I don't sniff the UA, but the text is only spoken if the user clicks a speaker icon after an IPA string. I scan the page for
On Edge, it speaks the IPA phonemes. On Chrome, it says "Unavailable". On Firefox, it speaks a mountain of XML. Not tried on Unix or Mac, but I don't expect any support there. The bottom line is implementations that follow the current spec and grab speak.textContent would have acceptable behavior (i.e., not speak the XML and quicky announce the feature is not available). That behavior would also be true for many other applications. Even if the UA ignores instructions about aliases, date formats, and telephone numbers, the result is often acceptable. |
I talked to the folks who work on Speech API at Apple. We are interested in SSML. We don't like overloading the .text attribute. We think it would be better to expose it via a separate attribute instead. Besides providing for feature detection, this avoids the need to sniff XML out of a string. |
.02 FWIW, cobbled together JavaScript code which parses SSML according to the specification, with tests for From perspective here SSML support should be relatively straightforward to implement and be shipped at all browsers, provided the will to do so. |
As far as SSML, I would have new SpeechSynthesisUtterance() take a text DOMString or an SSML document. I'd be OK with the utterance object having both a .text (DOMString only) and a .ssml (SSML document only) property rather than overload .text. I would require the minimal support of SSML to be speak the documentElement.textContent. I would not require the utterance to keep the two in sync. If there is a .ssml, then .text is ignored. The Web Speech API should be able to specify every property in the Speech CSS. I want the box model, and I want properties such as azimuth and elevation in addition to volume, rate, pitch. There's often a significant pause between two utterances, but the only way I can change the voice is to use two utterances. I may have only one voice, so separating them by azimuth can be helpful. Currently, Web Speech API does even let me determine a voice's gender (I need to guess from a name; is it is Sean like Sean Penn or Sean like Sean Young?). A voice needs more attributes: gender, age, .... |
(Maybe off-topic, but I would prefer more physical attributes e.g. |
We might want to consider detecting feature support for SSML 1.0 and SSML 1.1 . That is, we might want to consider detecting the version of SSML supported, if any. There could also be an SSML 2.0 someday. @GLRoylance , I agree about CSS Speech Module features (https://www.w3.org/TR/css3-speech/#ssml-rel). Also, azimuth, distance and elevation are interesting ideas; in general, 3D, binaural, or ambisonic audio. That's a bit ahead of the platform implementations of client-side speech synthesis such as MS SAPI, however. @saschanaz , if avoiding stereotyping is a goal with respect to the A topic which interests me is speech prosody including the prosody of mathematical expressions. I have considered, in some detail, the speech synthesis of mathematical expressions, how we vocalize symbols or variables as they occur multiple times in simple and complex mathematical expressions, that vocalization conveying meaning. We can consider, for instance, the quadratic equation and that the prosody, intonation and pauses are all important for a natural sounding synthesis of mathematics expressions. We can consider that natural sounding speech synthesis is important for purposes beyond aesthetics including attention, cognitive load and comprehension. We can observe recent advancements to speech synthesis (e.g. WaveNet, https://cloud.google.com/text-to-speech/) and consider server-side speech synthesis scenarios. A client could receive a hypertext document which references an audio stream and utilize media overlays (see also: EPUB) to synchronize visual cues to the playback of an audio version of the hypertext document. In theory, we could consider both client-side and server-side speech synthesis and speech recognition scenarios for a Web Speech API 2.0 . For both client-side and server-side speech synthesis scenarios, I consider features beyond text-to-speech, for instance hypertext-to-speech. Is the group is in a mode to correct miscellaneous errata of version 1.0 or are we warming up to discussing a version 1.1 or 2.0? Should we brainstorm utilizing the mailing list or GitHub issues tracker? Should we utilize GitHub editing processes more broadly to propose improvements? |
I do not see male/female as a stereotyping issue. We are not saying doctor voices should be male and nurse voices female. I am confronted with SSML and CSS Speech Module voice specifications, and they specify gender and age. If I want to do things comparable to them with Web Speech, then I need access to more characteristics of the Web Speech voice. |
Since the resulting audio from call to var ssmlIsSupported = (say) => { var ssmlSupport = ssmlIsSupported("hello").then(result => console.log(result)); |
@foolip Have you considered defining the requirement yourself instead of deferring to the existing specification might be? Or, does your suggestion to remove the SSML portions from the Web Speech API? One route to ship SSML support to browsers would be to utilize the open source project espeak-ng and possibly speech-dispatcher (with appropriate settings) shipped with the browser. Conversely, there are also enough existing issues with regards to the various implementations of SSML and appears to be adequate interest here to just start from scratch (specification and code implementing Web Speech API). What are you ultimately trying to achieve? |
Client-side, Server-side and Third-party Speech Recognition, Synthesis and Translation pertains to Issue 1 in the specification.
A question is how or whether topics raised in Client-side, Server-side and Third-party Speech Recognition, Synthesis and Translation and Issue 1 pertain to a version 1.0 or a version 2.0 of the Web Speech API. Perhaps the Speech API Community Group could develop Web Speech API versions 1.0 and 2.0 simultaneously? |
At least at Chromium/Chrome, if this patch was incorporated into Chromium source code we could at least be aware that SSML parsing was turned on by default for use with That is, using properly configured open source GitHub repositories Therefore, we should be able to use existing GitHub repositories to meet the requirement that this issue addresses. Else, the issue will not resolve itself. |
In
web-platform-tests/wpt#12568web-platform-tests/wpt#12689 I found that apparently SSML isn't supported anywhere.If there is no immediate implementer interest, I suggest removing the SSML bits from the spec.
Update: it does work on Windows, see #37 (comment)
The text was updated successfully, but these errors were encountered: