We regularly receive issues about bots having issues and customers are left scratching their heads. The following are considered the top issues that cover a high percentage of issues encountered by customers' bots through the lens of the Channel Service:
- 500 errors on bot
- Timeout (from channel to bot)
- Bot Never responds
- Bot works in ABS (Web Chat Test) but doesn’t work in channel X
In order to help customers quickly diagnose these issues (and enable UI/tools to surface to customers), the approach is to enable Application Insights for the Bot and beef up event data sent from the Channel Service (into the customer's Application Insights container) to completely diagnose issues.
Here is the proposal (Note: Ownership is shared across teams):
(This proposal was sent to Yochay/DanDris/Artur/Carlos/etc on a different thread)
Item | Owner | Description |
---|---|---|
Enable ABS to enable Application Insights | Bot Portal Team | Currently, if you provision a bot through the portal and opt into Application Insights and deploy the ABS template, Application Insights is not configured correctly. The portal team believes there's a way to enable this (bot configuration and appsettings.json) |
Enable W3C distributed tracing | Bot Service Team | To correlate queries with Bot application insights Update! Craig claims this should be working. |
Verify integration with Distributed Tracing | daveta/Bot Service Team | Need to test and verify operation id's are being transferred and correlated correctly. Ideally we set up an integration environment for testing. |
Ensure Request/Dependency/Event/Trace telemetry being logged in Bot and correlated | daveta | Ensure the appropriate initializers installed. |
Ensure ILogger telemetry being logged in Application Insights Trace | daveta | Application Insights has ability to intercept ILogger events. |
Log Http Status code | Bot Service Team | To diagnose 500 errors on bot. |
Log error on Channels into customer App Insights container | Bot Service Team | To diagnose never respond errors, and Channel X errors |
Queries/Documentation to detect 500 errors | daveta/Bot Service Team | Once we have correlated data, we can write some documentation to demonstrate how to diagnose. For true "500" bot errors, we can demonstrate with existing tools and some queries. For Channel "500" errors, this will be a little more difficult depending where it fails within the channel. (Stretch) If 500 error is due to external dependency, clearly state the dependency and timeout in a query result. For example, if CosmosDB is timing out, it should state so. |
Queries/Documentation to detect Bot Timeout errors | daveta/Bot Service Team | Once we have correlated data, we can write some documentation to demonstrate how to diagnose. |
Queries/Documentation to detect Bot Never Responds errors | daveta/Bot Service Team | Once we have correlated data, we can write some documentation to demonstrate how to diagnose. |
Queries/Documentation to detect Bot Works in Web Chat but not in Channel X | daveta/Bot Service Team | Once we have correlated data, we can write some documentation to demonstrate how to diagnose. |
(Yochay's addition - John Taylor (?) doing work). Enable telemetry to understand usage of QnA and LUIS and be able to answer "How many distinct bots are using QnA or LUIS?"
(From Chris Mullins: Distinct? From Jim Lewallen: If we can! There's no PII concerns. From Carlos Castro: What about Http Headers?)
Here is the proposal:
Item | Deliverable | Description |
---|---|---|
LUIS Recognizer modifications | Product | Modify the recognizer user agent string |
QnA Recognizer modifications | Product | Modify the recognizer user agent string |
See the list of reports that should be enabled for bots by enabling Application Insights.
Here is the proposal:
Item | Deliverable | Description |
---|---|---|
Simplify Application Insights registration for Bot Component | Product | Application Insights is a train wreck configuring. Customers need carnal knowledge of their startup sequences which is terrible. Create a platform-level appropriate integration for ASP.Net Core/WebAPI/Node.js. Should be usable with MVC patterns. (Some checks: Ensure ILogger hooked up, ensure appsettings.json correct, ensure our initializer telemetry registered, ensure our ASP.Net Middleware properly registered, etc) |
Simple Disable of Application Insights Feature | Product (Docs/Samples) | Disabling Application Insights should be simple (ideally single line) and any telemetry within the product should honor it. |
AppInsight UserID/SessionID Telemetry | Product | Modifying the events to use Bot user/session ID's enables other Application Insights reports such as "Funnel" in show data in the appropriate user and session context. |
LUIS Recognizer Telemetry | Sample | Keep a sample until such time that : (A) We're confident of our reporting scenarios that will satisfy 90%+ and/or (B) we create a flexible/configuration-based method of collecting data. (See below for details of a flexible storage system). By choosing to distribute as a sample, we won't be addressing endless customer enhancement issues of "Please add property X in LUIS intent telemetry because I need it for my report". The customer can modify for their telemetry themselves. |
QnA Recognizer Telemetry | Sample | See LUIS recognizer above. |
Bot Message Telemetry Middleware | Sample | See LUIS recognizer above. |
(Stretch) Configuration-based mapping collector | Product | A configuration-based method of specifying which properties to pull from specific objects would allow customers flexibility to change the properties being logged, and where the properties will be logged. This will enable storage into the Application Insights "Custom Events" and "Custom Metrics" tables). This feature should support 1:M relationships (for example, storing all intent data from a single LUIS result as separate rows). (I have doubts whether this component should ever be developed - bang for the buck doesn't seem worth it at this time). |
See nice writeup from Virtual Assistant team about the metrics they would like.
Item | Deliverable | Description |
---|---|---|
Telemetry Waterfall Dialog | Product | Enables logging to satisfy "Skills" analytics. More comfortable putting this into product, given the nature of the data being collected that can be enhanced over time. |
Prioritize metrics | Doc | There are several metrics defined in the documentation. Need to prioritize and document order. |
Review metrics with botmetrics@microsoft alias | IPA team/Darren Jefford's team | |
Evaluate LUIS \ QnA Telemetry | Sample | Once metrics prioritized, reuse/modify existing telemetry components. Resulting telemetry should serve both sets of reports (superset) |
New sample demonstrating Telemetry Waterfall | Sample | Something to generate data to drive the new PowerBI template |
Build PowerBI Template | Sample | New reports showing off metrics |
Review PowerBI Template | Meeting | Review with botmetrics@microsoft.com |
(Stretch) Correlation/Causation for Skills | Sample | There are several "Surface to the user the reason why they aren't completing skills". This is semi related to some LUIS scenarios. |
LUIS team very interested in making a SAT/DSAT detector, segmentation/ recommender/etc.
From @Cez:
Item | Deliverable | Description |
---|---|---|
Baseline | Design Doc | Identification of apps we could use and collect/retain logs |
Close the loop | Design doc | Once we have baseline measurement, attempt to close the loop to suggest labls for utterances. |
Segmentation | Design doc | Additional feedback for improving the app and suggestions for content. |