-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cosmos Diagnostics Logging Filters and optimizations. #39897
base: main
Are you sure you want to change the base?
Conversation
Adds the ability to use logging.Filters on diagnostics logs based on the same filterable parameters as the cosmos diagnostics handler. Also adds some optimizations to reduce time spent on logging diagnostics. Lastly improves the formatting of the diagnostics log to improve readabilty especially when errors are logged.
API change check APIView has identified API level changes in this PR and created following API reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks Bryan!
Main thing I'd like though, is to add something similar to your PR description to the README section that talks about diagnostics in the SDK since this is highly relevant and probably the only way that customers will realistically use the logging since otherwise it's too verbose. Adding a section below with explanations on the two different ways to filter as well as links to the relevant samples would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One suggestion would be to lower the amount of headers we log by default. For example these headers, I don't how much value they have for debugging.
'Server': 'Microsoft-HTTPAPI/2.0'
'Access-Control-Allow-Origin': ''
'Access-Control-Allow-Credentials': 'REDACTED'
'Transfer-Encoding': 'chunked'
sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_http_logging_policy.py
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_http_logging_policy.py
Outdated
Show resolved
Hide resolved
# The same filter parameters that are found in the cosmos_diagnostics_handler can be used in the logging filter | ||
class CosmosStatusCodeFilter(logging.Filter): | ||
def filter(self, record): | ||
return hasattr(record, 'status_code') and record.status_code >= 400 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code seems strange - what is the type of record here and why would it have a status_code attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a logger Record, which is where we add diagnostics related information to the logs so that they can be filtered. In this update to diagnostics we are passing in extra attributes to the log record each time we log something. But those attributes don't exist by default. Technically you could also just make a function and pass it as a log filter but I inherited from logging.Filter to make this filter as a sample on how to use it with the knowledge that the log record will contain "status_code", "duration", etc.
This removes the custom diagnostics handler and instead allows the use of logging filters.
``` | ||
|
||
# Create a filter to filter out logs | ||
class CustomFilter(logging.Filter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the sample should apply a filter similar how we do it in Java default logger - filter based on status code plus duration - and for status code filter out status codes that can easily be noisy (404/0, 409/0, 412/0)
@@ -214,7 +214,6 @@ def __init__( # pylint: disable=too-many-statements | |||
logger=kwargs.pop("logger", None), | |||
enable_diagnostics_logging=kwargs.pop("enable_diagnostics_logging", False), | |||
global_endpoint_manager=self._global_endpoint_manager, | |||
diagnostics_handler=kwargs.pop("diagnostics_handler", None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this breaking ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was/is a beta feature that was expected to have significant changes. It was released on a beta build only. The api review for it hasn't happened yet, and now that we won't use it, it also means an API review won't be necessary.
|
||
|
||
def _format_error(payload: str) -> str: | ||
output = json.loads(payload) | ||
return output['message'].replace("\r", " ") | ||
ret_str = "\n\t" + "Code: " + output['code'] + "\n" | ||
message = output["message"].replace("\r\n", "\n\t\t").replace(",", ",\n\t\t") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two replace instructions can be collapsed into one - this would redcue overhead of this string interpolation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe it would be possible to combine into a single replace. The replace method takes in a string literal and not a pattern, and then replaces instances of that if it exists as a substring with the new one you specified. From my understanding, chaining it would already be the most efficient way to do this, doing this another way would involve using the python regex library and might make it more complex and potentially have more overhead. Though ultimately with the second replace we exchange processing time using an extra one for having a much more human readable diagnostics log, but whatever that overhead was, it was more than made up for with some of the other changes in this PR that optimized logging.
database_account: Optional[DatabaseAccount] = None, | ||
*, | ||
enable_diagnostics_logging: bool = False, | ||
diagnostics_handler: Optional[Union[Callable, Mapping]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breaking?
# then read from kwargs (pop if that's the case) | ||
# then use my instance logger | ||
logger = request.context.setdefault("logger", options.pop("logger", self.logger)) | ||
if not logger.isEnabledFor(logging.INFO): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should log level vary based on status code - see Java default logger to get an idea of what I mean
if value and value != HttpLoggingPolicy.REDACTED_PLACEHOLDER: | ||
logger.info(" %r: %r", header, value, extra=cosmos_logger_attributes) | ||
if isinstance(http_request.body, types.GeneratorType): | ||
logger.info("File upload", extra=cosmos_logger_attributes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "File upload" supposed to mean?
pass | ||
if http_request.body: | ||
log_string += "\nA body is sent with the request" | ||
logger.info(log_string, extra=cosmos_logger_attributes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be more efficient to collapse all these log lines into 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it would be, and we do that, but there is an environmental variable that can be set to make it into multiple logs that the parent class from core has. In order to keep the same functionality while allowing us to add filterable attributes I had to make sure i pass the same attribute values to all the logs that we make otherwise we would not return all the needed information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thik the perf results in the PR description are not very helpful - to understand the impact it would be required to udnerstand how much of the improvement is because logs are filtered out now? So, instead having two perf tests - one with high percentage of requetss really being logged. And one where the percentage of records being actually logged is extremely low (could even be zero) would be important.
Last-but-not-least - throughput would be good - Like how many point reads per second can be done with each of the models - with reference to CPU - to ensure that better throughput is not explained by just higher CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments
# Test if we can log into from reading a database | ||
database_id = "database_test-" + str(uuid.uuid4()) | ||
self.client_diagnostic.create_database(id=database_id) | ||
assert all(m.levelname == 'INFO' for m in self.mock_handler_diagnostic.messages) | ||
messages_request = self.mock_handler_diagnostic.messages[13].message.split("\n") | ||
messages_response = self.mock_handler_diagnostic.messages[14].message.split("\n") | ||
elapsed_time = self.mock_handler_diagnostic.messages[5].message.split("\n") | ||
assert "/dbs" in messages_request[0] | ||
assert messages_request[1] == "Request method: 'POST'" | ||
assert 'Request headers:' in messages_request[2] | ||
assert messages_request[15] == 'A body is sent with the request' | ||
assert messages_response[0] == 'Response status: 201' | ||
assert "Elapsed time in seconds:" in elapsed_time[0] | ||
assert "Response headers" in messages_response[1] | ||
messages_request = self.mock_handler_diagnostic.messages[1] | ||
messages_response = self.mock_handler_diagnostic.messages[2] | ||
elapsed_time = messages_request.duration | ||
assert "databaseaccount" == messages_request.resource_type | ||
assert messages_request.verb == "GET" | ||
assert 200 == messages_request.status_code | ||
assert "Read" == messages_request.operation_type | ||
assert elapsed_time is not None | ||
assert "Response headers" in messages_response.message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is all changed because we do the initial background call to set up the database account - however since we are also running the create_database
method, would it be possible to extend the test to also check those entries as opposed to only checking the inital read database call that happens in the background?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment on expanding the test but this looks great, thanks Bryan!
@@ -23,9 +23,11 @@ | |||
|
|||
#### Features Added | |||
* Added ability to replace `computed_properties` through `replace_container` method. See [PR 39543](https://github.com/Azure/azure-sdk-for-python/pull/39543) | |||
* Added ability to use Filters from Logging module on Diagnostics Logging based on Http request/response related attributes. See [PR 39897](https://github.com/Azure/azure-sdk-for-python/pull/39897) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to need to move these entries to the relevant newest one - we're probably running a release between today and tomorrow so might want to wait til then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file still needed? If so, it needs to be made private (underscore prefix).
def on_request(self, request: PipelineRequest[HTTPRequestType]) -> None: | ||
verb = request.http_request.method | ||
|
||
self.logger: logging.Logger = logger or logging.getLogger("azure.cosmos.cosmos_http_logging_policy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name used here doesn't actually match the file name - not sure how important that is for customers filtering logs based on library/submodule?
'colls': 'container', | ||
'dbs': 'database' | ||
} | ||
super().__init__(logger, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like super()
is no longer being called at all - are we missing any logic here? Usually it's good to call super
regardless to make sure we don't get locked out of future Core updates to the policy - but maybe that's less important if we're not relying on super
for any behaviour.
For example - I see the on_request
/on_response
methods still call super()
in the non-diagnostics case. If, in the future, the Core policy added a new attribute to self
that was used in these methods, that could cause a failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an excellent point, Adding a call to super() at the beginning of the constructor shouldn't affect the rest of the logic of the logging policy especially since the main concern was that we want to use our own cosmos default logger which can just handle after calling super.
|
||
cosmos_disallow_list = ["Authorization", "ProxyAuthorization", "Transfer-Encoding"] | ||
cosmos_allow_list = [ | ||
v for k, v in HttpHeaders.__dict__.items() if not k.startswith("_") and k not in cosmos_disallow_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth adding an inline comment (not in the public docstring) to the HttpHeaders
class to inform future devs/maintainers that headers they add to that class will be logged unless they also update the disallow list.
The Core policy by default operates on the opposite assumption - headers are considered unsafe unless opted-into logging. So we should take extra documentation precautions to prevent a future leak.
return | ||
try: | ||
if isinstance(http_request.body, types.AsyncGeneratorType): | ||
logger.info("File upload", extra=cosmos_logger_attributes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is file upload actually a scenario in this SDK?
This might be here to duplicate the original core implementation - if that's the case, you could potentially simplify by removing things that really aren't going to apply to Cosmos payloads (as the Core policy was naturally written to accommodate any traffic).
headers = request.http_request.headers | ||
sub_status_str = http_response.headers.get("x-ms-substatus") | ||
sub_status_code: Optional[int] = int(sub_status_str) if sub_status_str else None | ||
url_obj = http_response.internal_response.url # type: ignore[attr-defined, union-attr] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be safer to access the url from the request object. We really should avoid using internal_response
as this object is unknown. Though it's not going to cause major problems as this policy already nicely handles all the errors - so worst-case scenario would be some logs might go missing if you try a different transport - though if there's an easy alternative it might be best.
logger = context.setdefault("logger", options.pop("logger", self.logger)) | ||
|
||
context["logger_attributes"] = log_data.copy() | ||
self.on_request(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're re-logging the request during the response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are only logging the request when we get to response if the logger has any filters. Otherwise we wouldn't be able to filter related requests and responses together as the request wouldn't have attributes like duration or status code. If no filters are applied request are logged on requests.
Description:
This PR adds the ability to use Filters from the Logging Module to filter Cosmos Diagnostics Logs based on the same filterable parameters as the Cosmos Diagnostics Handler. The Cosmos Diagnostics Filterable parameters are added to the Log Record of the Logger used to log diagnostics.
Each log can be filtered by the following attributes:
status_code
duration
verb
database_name
collection_name
operation_type
url
resource_type
is_request
example of using Filters to filter diagnostics logs:
Additionally, this PR adds some optimizations to significantly reduce the time spent on logging diagnostics. The improvement is more noticeable when large amounts of requests are logged. As a reference If using the old diagnostics for logging 5000 requests, the time spent on logging diagnostics was over 10000 ms, while with the new optimizations, the time spent is reduced to between 3000-5000ms if using the same filterable parameters.
See the following profiles of old diagnostics logging and new diagnostics logging while making 3600 requests:
The logs have also change in format to make the error messages more readable, as well as added some more diagnostic information. Formerly redacted headers are also now simply omitted to optimize logging.
Example: