CoolTTS - JavaScript TTS Player with SSML

ATTENTION! June 29, 2025: Google has fixed the speechSynthesis bug in Google Chrome browser. speechSynthesis and coolTTS will now work with Google voices again! (Chrome Version 138.0.7204.50)

CoolTTS Demonstration Test Box

Introduction

Summary: CoolTTS combines TTS and SSML using pure JavaScript.

Terms:
TTS = Text-to-speech
SSML = Speech Synthesis Markup Language
speechSynthesis = The JavaScript Interface built-in to many browsers for text to speech

Description: speechSynthesis has been part of many browsers for years. The original W3 specification said that speechSynthesis should work with SSML. However, to this day, no browsers have built SSML into the speechSynthesis interface. "CoolTTS Javascript TTS Player" attempts to fix this problem by providing simple JavaScript functions for TTS and SSML including the SSML _<break> tag to add pauses to the speech. CoolTTS also supports the SSML _<mark> tag. The other problem with speechSynthesis is that each browser (or voice) doesn't implement some of the basic features of the speechSynthesis interface or they have annoying bugs. "CoolTTS Javascript TTS Player" attempts to fix these problems as well. Sadly, there isn't a lot you can do with some of the speechSynthesis limitations of browsers. In addition, updates to the browsers often introduce new bugs.

Features

JavaScript speechSynthesis
	Microsoft Local Voices	Microsoft Online Voices (Edge for Desktop)	Google Voices (Chrome for Desktop)	iOS Voices	Android Voices
Avg Time between utterances	1100ms	900ms	800ms	500ms	500ms
Avg Time between sentences	1100ms	800ms	550ms	500ms	500ms
Word Boundary Event	✔	✔		✔
Sentence Boundary Event	✔
Pause Event	✔			✔
Resume Event	✔			✔
speechSynthesis.paused	✔			✔
SSML	❌

JavaScript speechSynthesis With CoolTTS
	Microsoft Local Voices	Microsoft Online Voices (Edge for Desktop)	Google Voices (Chrome for Desktop)	iOS Voices	Android Voices
Avg Time between utterances	1100ms	900ms	800ms	500ms	500ms
Avg Time between sentences	1100ms	800ms	550ms	500ms	500ms
Word Boundary Event	✔	✔		✔
Sentence Boundary Event	✔	✔	✔	✔	✔
Pause Event	✔	✔	✔	✔	✔
Resume Event	✔	✔	✔	✔	✔
cooltts.paused	✔	✔	✔	✔	✔
SSML	✔	✔	✔	✔	✔

Pricing

You may use this website for free for testing of CoolTTS. Make sure that you test in different browsers to see how speechSynthesis and CoolTTS has different voices and works differently in each browser.

To use CoolTTS on your own website, for a limited time, you can download the JavaScript file for the price of a donation. The suggested donation is $19.99 USD.

Download

Please thoroughly test CoolTTS using this web page in different browsers before downloading cooltts.js. Make sure that you understand the limitations and differences of speechSynthesis in different browsers and with different voices. CoolTTS uses ONLY the free voices that are built-in to the browser. It will not work with the voices that are available with subscription TTS services.

dlc_b

Download

Downloaded 0 times.

Please make a donation to reveal the download link.

For support for CoolTTS, please leave a comment below.

How To Use

Upload cooltts.js to your server in the same folder as your html file. Paste the following code in your html file to load cooltts.js:
<script type="text/javascript" src="cooltts.js"></script>

In your html file, for every element that you want a CoolTTS Player to appear above it, you must add a cooltts class to it. Example:

	<div class="cooltts">

The player controls will not appear above the elements until most of the page is loaded. So if the web page has a lot of external scripts, advertisements or images then it may take a while for the player controls to appear. (*Note: On iOS devices speechSynthesis can stop working on web pages with external resources such as Google Ads. It is best not to have speechSynthesis or CoolTTS on websites with external resources like Google Ads.)

You can also send a string or an element directly to cooltts.play() . But because of browser security policies it will probably not start playing speech unless there is a user gesture or interaction that invokes it (a button click).
To send a string: cooltts.play("Hello world!");
To send an element: cooltts.play(document.getElementById("speech_div"));

Other CoolTTS controls: cooltts.stop(); cooltts.pause(); cooltts.resume(); cooltts.rewind(); cooltts.fastforward();

CoolTTS also dispatches custom cooltts events that your web page can listen for with an EventListener: document.addEventListener('cooltts', function() {console.log(event);}, false);
See eventListener for more information about the 'cooltts' event.

<break>

The break tag can insert a pause in the speech to text. It can have one of two attributes: strength or time.
strength can be none, x-weak, weak, medium, strong or x-strong.
time can be in seconds or milliseconds. Examples: time="250ms" or time="3s"
W3 Specification

<audio>

The audio tag can be used to play an audio file during text to speech. When the audio tag is reached then Text-To-Speech will pause while the audio file plays. When the audio file ends then Text-To-Speech should resume. Text in-between the audio open and closing tag will be spoken if the audio file fails to play for some reason.
W3 Specification

Example of a possibly working audio file:

Example of missing audio file:

Captions

CoolTTS will display captions of the text-to-speech if the variable cooltts.captions=true
Or to display captions you can add the cooltts_captions class to any element.

Example:

The emphasis element requests that the contained text be spoken with emphasis or stress. The optional level attribute indicates the strength of emphasis to be applied. Defined values are "strong", "moderate", "none" and "reduced". The default level is "moderate".
W3 Specification

eventListener

CoolTTS dispatches custom cooltts events that you can listen for with an EventListener: document.addEventListener('cooltts', function() {console.log(event);}, false);

The JavaScript speechSynthesis interface built-in to most browsers only dispatches events on SpeechSynthesis utterances. (Except for the "onvoiceschanged" event) A website maker has to add multiple event listeners to each of the utterances that are sent to the speechSynthesis.speak() queue. It dispatches an event when each utterance starts and ends, but it doesn't dispatch an event for the beginning and ending of the entire queue of utterances. CoolTTS tries to solve that issue. Also, many of the better quality voices in most browsers do not dispatch many of the utterance events.

statechange: CoolTTS sends an event with event.detail.type=="statechange" when the TTS player changes state. The value of event.detail.state can be: started, playing, paused, resumed, rewind, fastforward, ended, stopped. The variable cooltts.state can also be checked at any time to see the current state of text-to-speech.
JavaScript Code:

start or end: Whenever a speechSynthesis utterance starts or ends SpeechSynthesisUtterance sends a SpeechSynthesisEvent. CoolTTS also sends the event along with a "cooltts" event.
W3 Specification
JavaScript Code:

boundary: If you use the robotic sounding local Microsoft voices in Windows 11 in a browser (David, Mark and Zira), then the utterances dispatch "boundary" events for "sentence" and "word" boundaries. In Google Chrome, Google voices do not dispatch a "boundary" event. In Microsoft Edge, Microsoft Online (Natural) voices dispatch "boundary" events for "word" boundaries only. CoolTTS doesn't have a way to make a "boundary" event for voices that don't support it. However, CoolTTS does make a pseudo sentence "boundary" event for voices like Google voices and Microsoft Online (Natural voices. CoolTTS divides speechSynthesis utterances into sentences. The "start" and "end" event for each utterance that CoolTTS dispatches provides an event.detail.sentenceIndex and an event.detail.sentenceLength variable. For voices that support the word "boundary" event, you can listen for the 'cooltts' event:
JavaScript Code:

Boundary event example: Smiley face mouth speech movements

pause: JavaScript speechSynthesis has a "pause" event, however for most of the better quality voices the "pause" event is never dispatched. Also speechSynthesis.paused is often "false" in most browsers for most voices even when speechSynthesis is paused. CoolTTS fixes this problem by dispatching a "paused" event whenever the speechSynthesis is paused. Also you can check the variable cooltts.paused. If it is "true" then speechSynthesis is paused.
JavaScript Code:

resume: JavaScript speechSynthesis has a "resume" event, however for most of the better quality voices the "resume" event is never dispatched. Also speechSynthesis.paused is often "false" in most browsers for most voices even when speechSynthesis is paused. CoolTTS fixes this problem by dispatching a "resumed" event whenever the speechSynthesis is resumed. Also you can check the variable cooltts.paused. If it is "true" then speechSynthesis is paused.
JavaScript Code:

error: JavaScript speechSynthesis dispatches an "error" event when there is an error. It also dispatches an "error" event when SpeechSynthesis is stopped and it dispatches error: "canceled" or "interrupted". CoolTTS also passes the error event along.
W3 Specification
JavaScript Code:

mark: The W3 Specification for the JavaScript speechSynthesis interface says that a mark event should be fired when a mark tag is reached. Also speechSynthesisUtterance has an "onmark" event listener. But it is apparently never fired presumably because none of the browsers ever integrated SSML into the speechSynthesis interface. CoolTTS attempts to fix that by providing a "mark" event. See <mark> for how to use it.

Hidden elements

CoolTTS will speak elements that are not visible to the user such as elements with CSS display:none or visibility:hidden. If you do not want CoolTTS to speak these elements then you can add class="cooltts_skip" to the element.

Example:
The element for the player below has CSS display:none;

Example:
The element for the player below has CSS visibility:hidden;

xml:lang

xml:lang is a defined attribute for the speak, lang, desc, p, s, token, and w elements. It accepts a 2 letter language code and an optional 2 letter country code. CoolTTS will add the value of the xml:lang attribute to the utterance being spoken by the speechSynthesis interface. This may change the voice that is being used. However, if a Microsoft Online Multilingual voice is selected then the same voice will likely be used.
W3 Specification

Example:

<mark>

The mark tag can be used to place a mark for an event that you want to happen at that mark. Add an event listener for the mark. Each mark should have a name attribute, such as: <mark name="my_mark">

According to https://developer.mozilla.org/en-US/docs/Web/API/SpeechSynthesisUtterance/mark_event browsers are supposed to have a built-in "mark" eventListener for speechSynthesis utterances. However, it appears that the event is never fired in browsers because they do not correctly parse SSML. Therefore, CoolTTS has its own method for parsing mark tags and using an event listener for the "cooltts" event.

Sadly, with JavaScript speechSynthesis in the popular browsers there is a slight pause when a mark tag is reached because one utterance ends and another begins and the popular browsers (Google Chrome and Microsoft Edge) and quality voices have a slight pause between utterances. So if the mark tag is in the middle of a sentence it will sound strange.

You can listen for a mark event by adding an event listener for the "cooltts" event: document.addEventListener('cooltts', function() {console.log(event);}, false);

Example:

<p>

A p element represents a paragraph. An s element represents a sentence. Both elements can have a lang attribute.

<s>

An s element represents a sentence. s elements can have a lang attribute. If you use an <s> tag in a standard html document then it will be parsed by the html browser as a strikethrough tag. To stop that from happening you may want to add style="text-decoration: inherit;" to the tag.

The phoneme element provides a phonemic/phonetic pronunciation for the contained text. Unfortunately, there seems to be no good method for doing phoneme pronunciations with the JavaScript speechSynthesis interface in today's browsers. It would always sound jerky and inaccurate. Therefore, the phoneme tag will probably never be part of CoolTTS.
W3 specification

The prosody element permits control of the pitch, speaking rate and volume of the speech output. All of the attributes are optional. Because of limitations in how browsers implement the speechSynthesis interface it is impossible to follow completely the W3 specification for prosody.

<say-as>

The say-as element allows the author to indicate information on the type of text construct contained within the element and to help specify the level of detail for rendering the contained text.
W3 Specification

Skip element

Example:

This is a feature of CoolTTS, not SSML. You can do something similar in SSML using the sub tag and alias="".
Example:

<sub>

The sub element is employed to indicate that the text in the alias attribute value substitutes the contained text for pronunciation. This allows a document to contain both a spoken and written form. The REQUIRED alias attribute specifies the string to be spoken instead of the enclosed string.
W3 Specification

Note: If you use the <sub> tag in an HTML document then the HTML browser will treat the tag as a subscript html tag which can have the undesirable effect of changing the text to be smaller and lower than the surrounding text. To prevent that from happening you may want to use to add: style="font-size: inherit; vertical-align: inherit; to the element.

Example:

<voice>

The voice element allows you to attempt to change the voice by any combination of name, gender, age or language attributes.
W3 Specification

Browser Limitations of JavaScript speechSynthesis

It seems that the only two browsers have put any effort into the JavaScript speechSynthesis interface: Google Chrome and Microsoft Edge browsers. Edge has put a little more effort into the programming and has a nice selection of voices. Other Chromium browsers (Opera, Brave, Vivaldi) and Firefox have not put much effort into speechSynthesis. They MIGHT have a few older, robotic sounding voices available that come with the operating system. In Windows, they might have older Microsoft voices such as Microsoft David, Mark and Zira. Please do not expect CoolTTS to work well with these browsers. If users want a good sounding Text-to-speech interface for free then they need to use either Google Chrome or Microsoft Edge. Note that JavaScript speechSynthesis in every popular browser can only play one utterance at a time. So if a new utterance is started, even in a different tab and with a different website, then the first utterance will stop playing.

Events: Microsoft local voices are available in most Windows browsers. Microsoft local voices are usually low quality but they dispatch more events than other voices including Google voices. They dispatch "pause", "resume", and word and sentence "boundary" events.

Mobile device browsers on iPhones, iPads and Android devices are not very good at speechSynthesis either. The mobile browsers usually don't have the same quality voices as their desktop browser counterparts.

Future Development

If there is enough interest for this project then I will continue to work on it, fixing bugs and possibly adding new features.

There are no plans to make this script work with subscription TTS services. Those services have their own methods for processing JavaScript voices and SSML using their own APIs. Those subscription services can get expensive. The point of this project is to use the JavaScript speechSynthesis interface built-in to many modern day browsers and to provide a method to use it with SSML.

History

6/3/2025 - Version 1.1 - Improved applying settings changes while playing or paused.

CoolTTS - JavaScript TTS Player with SSML

Summary: Combines TTS and SSML using pure JavaScript

CoolTTS Demonstration Test Box

Introduction

Features

Pricing

Download

How To Use

<break>

<audio>

Captions

<emphasis>

eventListener

Smiley face mouth speech movements

Hidden elements

xml:lang

<mark>

<p>

<s>

<phoneme>

<prosody>

<say-as>

Skip element

<sub>

<voice>

Browser Limitations of JavaScript speechSynthesis

Future Development

History