devjournal

=====================================================
2020/02/05 takeshi

Initial development done with:

# node -v
v10.15.3


mrcp_server draft:
  - if voice="dtmf" then do SR/SS locally using gstreamer (node-gir) or spandsp
  - the request starts with SIP INVITE that requests resource "speechsynth" or "speechrecog"
  - the response will contain the channel_identifier that the client app should use when issuing requests on the MRCP TCP connection
  - in the MRCP TCP connection the client will be able to send commands and get events related to the resource
  - the client might create more than one SIP/MRCP session at the same time for the same call. This doesn't matter: we get/send RTP packets from/to the ip:port specified in the SDP.

after the MRCP support gets ready then we can add support for SR/SS at google.apis.


To simplify development, let's use an actor library (Nact).


Actor hierarchy:

  - root_actor creates:
    - sip_server creates (SIP processing is simple and there is no need for a sip_session_handler)
    - mrcp_server creates:
      - mrcp_connection_handler (obs: the same connection can be used to issue commands for different channels)
    - http_server

(http_server will be for administration/monitoring)

Basic processing:
  - MRCP client sends SIP INVITE to sip_server
  - sip_server gets INVITE and extracts the call_id
  - sip_server creates sip_connection_handler passing the sip_stack as parameter (as it is needed to send messages).
  - sip_server extracts
      - call_id (will be set to channel_identifier in MRCP messages to simplify tracking)
      - resource (speechsynth|speechrecog)
      - remote rtp ip and port
  - sip_server informs mrcp_server of session creation (uses call_id as key) with [resource, local_rtp_ip, local_rtp_port, remote_rtp_ip, remote_rtp_port]
  - sip_server replies with '200 OK' to INVITE
  - MRCP client contacts mrcp_server sending channel_identifier (same as SIP call_id@resource)
  - mrcp_server confirms the session exists and creates mrcp_connection_handler with session data
  - mrcp_connection_handler checks the type or resource and process it.

At this point we will simply create the MRCP basic processing. So we will process the request locally by using gstreamer using node-gir as explained in the draft.
So, mrcp_connection_handler will setup a gstreamer pipe with plugin dtmf or dtmfsrc, get commands and notify events.


=====================================================
2020/02/06 takeshi

Regarding Google SpeechSynthesis we can use:
  audioEncoding: LINEAR16 (Uncompressed 16-bit signed little-endian samples (Linear PCM). Audio content returned as LINEAR16 also contains a WAV header.)
  sampleRateHertz: TO VERIFY
  https://cloud.google.com/text-to-speech/docs/reference/rest/v1/text/synthesize#AudioEncoding

=====================================================
2020/02/10 takeshi

We planned to use gstreamer via node-gir. However, node-gir is a fork from gir and both are not maintained anymore.
But there is node-gtk which includes Gst support.

  apt install libgirepository1.0-dev gstreamer1.0-tools gstreamer1.0-plugins-good gstreamer1.0-plugins-bad libgtk-3-dev

However, it will not work on non-desktop servers as currently it is necessary to use Gtk to run the main loop.
So, one alternative would be to use vala-object (https://github.com/antono/vala-object). However, it uses gir and so it will not work either:
  $ grep gir package.json 
    "gir": "*"

=====================================================
2020/02/11 takeshi

Converting wav to raw:

$ play yosemitesam.wav 
play WARN alsa: can't encode 0-bit Unknown or not applicable

yosemitesam.wav:

 File Size: 19.8k     Bit Rate: 64.2k
  Encoding: u-law         
  Channels: 1 @ 14-bit   
Samplerate: 8000Hz       
Replaygain: off         
  Duration: 00:00:02.46  

In:100%  00:00:02.46 [00:00:00.00] Out:19.7k [!=====|=====!]        Clip:0    
Done.

$ sox yosemitesam.wav -e unsigned ys.raw

$ play -t raw -r 8000 -e unsigned -b 8 -c 1 ys.raw 
play WARN alsa: can't encode 0-bit Unknown or not applicable

ys.raw:

 File Size: 19.7k     Bit Rate: 64.0k
  Encoding: Unsigned PCM  
  Channels: 1 @ 8-bit    
Samplerate: 8000Hz       
Replaygain: off         
  Duration: 00:00:02.46  

In:100%  00:00:02.46 [00:00:00.00] Out:19.7k [!=====|=====!]        Clip:0    
Done.


For reference, we use this file:
  https://github.com/googleapis/nodejs-speech/blob/master/samples/resources/audio.raw

which plays properly this way:

$ play -t raw -r 16000 -e signed -b 16 -c 1 nodejs-speech/samples/resources/audio.raw 
play WARN alsa: can't encode 0-bit Unknown or not applicable

nodejs-speech/samples/resources/audio.raw:

 File Size: 58.0k     Bit Rate: 256k
  Encoding: Signed PCM    
  Channels: 1 @ 16-bit   
Samplerate: 16000Hz      
Replaygain: off         
  Duration: 00:00:01.81  

In:100%  00:00:01.81 [00:00:00.00] Out:29.0k [ =====|===== ]        Clip:0    
Done.

However, from the PSTN we will get sample rate of 8000. So we resampled it:

$ sox nodejs-speech/samples/resources/audio.raw -t raw -r 8000 -e signed -b 16 -c 1 a.raw

$ play -t raw -r 8000 -e signed -b 16 -c 1 a.raw
play WARN alsa: can't encode 0-bit Unknown or not applicable

a.raw:

 File Size: 29.0k     Bit Rate: 128k
  Encoding: Signed PCM    
  Channels: 1 @ 16-bit   
Samplerate: 8000Hz       
Replaygain: off         
  Duration: 00:00:01.81  

In:100%  00:00:01.81 [00:00:00.00] Out:14.5k [ =====|===== ]        Clip:0    
Done.


But also, we will get it usually as mu-law.
So we tested conversion to mu-law and it worked:

$ sox -t raw -r 16000 -e signed -b 16 -c 1 nodejs-speech/samples/resources/audio.raw -t raw -r 8000 -e mu-law -b 8 -c 1 a2.raw

$ play -t raw -r 8000 -e mu-law -b 8 -c 1 a2.raw
play WARN alsa: can't encode 0-bit Unknown or not applicable

a2.raw:

 File Size: 14.5k     Bit Rate: 64.0k
  Encoding: u-law         
  Channels: 1 @ 14-bit   
Samplerate: 8000Hz       
Replaygain: off         
  Duration: 00:00:01.81  

In:100%  00:00:01.81 [00:00:00.00] Out:14.5k [ =====|===== ]        Clip:0    
Done.

$ node a.js 
Transcription: How old is the Brooklyn Bridge?

$ grep -i -E 'encoding|rate' a.js
  // The audio file's encoding, sample rate in hertz, and BCP-47 language code
    encoding: 'MULAW',
    sampleRateHertz: 8000,

=====================================================
2020/02/12 takeshi

About silence in RTP:
  https://serverfault.com/questions/771924/detect-silence-in-rtp-payload
ee tested with 0x7F and 0xFF in PCMU and both worked (see research/google_sr_file_stream/)

=====================================================
2020/02/13 takeshi

Google TTS reference:
  https://cloud.google.com/text-to-speech/docs/reference/rest/v1/text/synthesize#audioconfig
We should use audioEncoding=LINEAR16 and (if possible) sampleRateHertz=8000
It seems Google TTS doesn't support streaming so we will dump the data to a temp file and use fs.createFileStream().
Then we will read XXX bytes each 20 ms, convert it to mu-law (http://www.speech.cs.cmu.edu/comp.speech/Section2/Q2.7.html) and send it as RTP

Here is a reference about gstreamer:
  https://stackoverflow.com/questions/55474847/creating-a-mulaw-audio-file-from-l16


=====================================================
2020/02/13 takeshi

We were able to make google tts work. However, we can hear a 'click' in the beginning of the audio. This is because of the wav header that the API inserts in it:
  AudioEncoding: 
    LINEAR16:  Uncompressed 16-bit signed little-endian samples (Linear PCM). Audio content returned as LINEAR16 also contains a WAV header.
  https://cloud.google.com/text-to-speech/docs/reference/rest/v1/text/synthesize#AudioEncoding

So we need to deal to with this header before processing the payload.

Parsing a wav file in c:
  http://truelogic.org/wordpress/2015/09/04/parsing-a-wav-file-in-c/

(however we might find a node module to decode wav files).

UPDATE: issue was solved: we used node module 'wav' to strip the wav header before saving the TTS result.

=====================================================
2020/02/13 takeshi

About writing tests, we need to mock googleapis.
Check: https://github.com/googleapis/google-api-nodejs-client/issues/258

=====================================================
2020/02/18 takeshi

To support DTMF, we can use this module:
  https://github.com/noffle/goertzel-stream
However we will need to write something like a dtmf-stream on top of it.

Sample:

$ cat goertzel_stream_wav_test.js
var goertzel = require('goertzel-stream')

const fs = require('fs');
const wav = require('wav');

var rs = fs.createReadStream('./1234.wav')
var reader = new wav.Reader()

freqs = [697, 1209]


reader.on('format', function(format) {
        console.log(format)

        var detect = goertzel(freqs, {sampleRate: format.sampleRate})

        detect.on('toneStart', (tones) => {
                console.log('Start', tones)
        })

        detect.on('toneEnd', (tones) => {
                console.log('End', tones)
        })

        reader.on('data', (data) => {
                console.log('reader data')
                //console.log(data)

                var a = new Float32Array(data)

                // patching because goertzel-stream expects data for write() to come from AudioBuffer
                a.getChannelData = () => { return a }

                detect.write(a)
        })

        reader.on('end', () => {
                console.log('reader end')
        })
})


rs.pipe(reader)


$ sox --i 1234.wav 

Input File     : '1234.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 25-bit
Duration       : 00:00:01.40 = 22400 samples ~ 105 CDDA sectors
File Size      : 89.7k
Bit Rate       : 512k
Sample Encoding: 32-bit Floating Point PCM


$ node goertzel_stream_wav_test.js
{
  audioFormat: 3,                                                    
  endianness: 'LE',                           
  channels: 1,
  sampleRate: 16000,                                            
  byteRate: 64000,                                             
  blockAlign: 4,
  bitDepth: 32,
  signed: true,                        
  float: true                          
}
reader data
Start {
  '697': { start: 1.9500000000000015 },
  '1209': { start: 1.9500000000000015 }
}
End { '1209': { start: 1.9500000000000015, end: 2.339999999999994 } }
Start { '1209': { start: 2.339999999999994 } }
End {
  '697': { start: 1.9500000000000015, end: 2.4599999999999915 },
  '1209': { start: 2.339999999999994, end: 2.4599999999999915 }
}
Start {
  '697': { start: 2.7499999999999853 },
  '1209': { start: 2.7499999999999853 }
}
End { '697': { start: 2.7499999999999853, end: 2.8799999999999826 } }
Start { '697': { start: 2.8799999999999826 } }
End { '1209': { start: 2.7499999999999853, end: 3.0599999999999787 } }
Start { '1209': { start: 3.0599999999999787 } }
End {
  '697': { start: 2.8799999999999826, end: 3.2599999999999745 },
  '1209': { start: 3.0599999999999787, end: 3.2599999999999745 }
}
Start {
  '697': { start: 3.5499999999999683 },
  '1209': { start: 3.5499999999999683 }
}
End { '697': { start: 3.5499999999999683, end: 3.699999999999965 } }
Start { '697': { start: 3.699999999999965 } }
End {
  '697': { start: 3.699999999999965, end: 4.059999999999958 },
  '1209': { start: 3.5499999999999683, end: 4.059999999999958 }
}
reader data
Start {
  '697': { start: 4.349999999999952 },
  '1209': { start: 4.349999999999952 }
}
End {
  '697': { start: 4.349999999999952, end: 4.859999999999941 },
  '1209': { start: 4.349999999999952, end: 4.859999999999941 }
}
reader end


Notice we will need to:
  - combine frequencies and identify DTMF tones.
  - remove duplication of digits since some End events happen at the same moment a Start event happens.

=====================================================
2020/02/19 takeshi

In case some library gives us data in float PCM encoding (float 32), and we need to convert to LINEAR16:
  https://stackoverflow.com/questions/15087668/how-to-convert-pcm-samples-in-byte-array-as-floating-point-numbers-in-the-range 

=====================================================
2020/04/18 takeshi

About TCP connection lifetime the RFC
  https://tools.ietf.org/html/rfc6787

says:
-----------------------------------------------------------------------------
   When the client wants to deallocate the resource from this session,
   it issues a new SDP offer, according to RFC 3264 [RFC3264], where the
   control "m=" line port MUST be set to 0.  This SDP offer is sent in a
   SIP re-INVITE request.  This deallocates the associated MRCPv2
   identifier and resource.  The server MUST NOT close the TCP or TLS
   connection if it is currently being shared among multiple MRCP
   channels.  When all MRCP channels that may be sharing the connection
   are released and/or the associated SIP dialog is terminated, the
   client or server terminates the connection.
-----------------------------------------------------------------------------

For now, we will not close MRCP/TCP connections: we will leave for the client to do it.
And the closure of the MRCP/TCP connection should have no effect in SIP dialogs.

=====================================================
2020/08/02 takeshi

We will need to add support for DTMF speechrecog and speechsynth.

So let's refactor speechrecoger.js and speechsynther.js and move Google SR/SS specific code to:
  - google_sr_agent
  - google_ss_agent

Then we will add:
  - dtmf_sr_agent
  - dtmf_ss_agent

And later we can have:
  - julius_sr_agent
  - jtalk_ss_agent
  - mbrola_ss_agent
  - polly_ss_agent
and so on.

UPDATE: used a different approach and created Synth and Recog Streams for Google, DTMF and Morse.

=====================================================
2021/06/07 takeshi

Tried to add support for julius using julius_server.
(I had to change to BigEndian and resample from 8000 to 16000).
However, it didn't work: the recognition only happens when we Ctrl-C mrcp_client.
So I suspect silence is not right.