-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Security Review of TLS1.3 0-RTT #1001
Comments
Viktor Dukhovni points out that some applications want to re-use tickets for several resumptions. In this case, there are two things worth noting:
|
@colmmacc: please provide a PR adding yourself to the Acknowledgements. |
colmmacc
added a commit
to colmmacc/tls13-spec
that referenced
this issue
May 5, 2017
brainhub
added a commit
to brainhub/tls13-spec
that referenced
this issue
Jun 2, 2017
* Remove redundant whitespace in word composition * x-dash-byte * Add article before version [replaces leonklingele's original commit.] * Update 'octets' -> 'bytes' to better preserve context * Command -> 'and' * Use correct plural / singlar form * Add 'as described in' * Make use of 'TLS' more uniform [revised from leonklingele's original by ekr] * Lowercase 'length' * Lowercase 'certificate' * Add missing hyphen symbols * Add unit of a number * Split word * Add missing comma [revised from leonklingele's original by ekr] * Uppercase 'X' in 'X25519' and 'X448' * Use 'an' instead of 'a' * Add myself as contributor * Use MAY / MUST NOT in extension requests definition [revised by ekr] * Uppercase 'SHOULD' * 'MUST behave' * Use MUST / MAY for specification of 'Certificate Request' * Uppercase 'MAY' * Alert MUST be ignored * Uppercase 'MUST' * Closing the connection after receiving a fatal alert is required. Use MUST. * Make the labels the same. Clean up diagram a bit * Update changelog * Clarify what max_early_data_size refers to * Fix the security analysis to indicate that it's the keys derived from the master key that are unique, not the master key itself. Thanks to @davidben for pointing this out * Tidy up singular vs. plural session keys. Follow-up to 3996e45. * MT comments * Fix exporter definition bustage from PR#882. Fixes tlswg#898 * fix consistency of the key schedule diagram * Derive-Secret can now have an argument from the top. Closes tlswg#900 * Remove unnecessary arrowheads * Special-case the hash for CH1 when HRR is used. This allows the server to just store H(CH1) when doing HRR. * Warn about eternal ticket extension. Fixes tlswg#871 * End of ClientHello and EndOfEarlyData messages should be on a record boundary An EndOfEarlyData message signals a key change. A ClientHello can be the last message read before a key is changed, and it never makes sense for a ClientHello to have more data after it in the record. * Fix editorial issues for PR tlswg#901 * Add cipher suite to the HRR. This makes it slightly easier for the client to implement because it knows what hash the server will select. Also clarify the language about HRR and Key Shares. * Clarify the relationship between PSK and certificates. Closes tlswg#870 * Add comma * Text clarification * Update change log * Clarify the HRR->PSK interaction * Update changelog * Explain max_fragment_length * Fix xref * Clarify that status_request in CR extensions is empty * Clean up description of messages in the transcript hash: 1. Move the list to the new transcript hash section. 2. Note that EndOfEarlyData goes in the client Handshake Context. * Suggest more strongly that EndOfEarlyData is deferred. 31f8362 missed a spot. * This was too deep * RFC 5746 isn't mentioned in the text * Contributors appears to be sorted by last name, clean up for consistency * extensions cleanups in ClientHello This tweaks some text in ClientHello with regard to handling the extensions field. The two blocks of text on it in this section are merged together, dropping a little repeated information and putting it in one paragraph after definition of the relevant field instead of covering this in two places in the same section. * Add Brian Sniffen to contributors * Require implementations to verify record boundary when a key change happens rather than on receipt of handshake messages. * Add self to contributors. * Update Jim Roskind's affiliation. Fixes tlswg#927 * Standardize on {client, server} Finished. Closes tlswg#916 * - Remove support for disallowed extensions (cert_type, user_mapping). - Explicitly define RFC 7250 certificate support. * All post-handshake messages must be consecutive. Fixes tlswg#930. * Remove vestigial text about EOED being an alert. Fixes tlswg#918 * require fresh ticket_age_add. Fixes tlswg#913 * Add an extension to negotiate use of post-handshake client authentication Squashed version of MT's draft. * Clarify announcement not negotiation * Add Matt Green's name. Fixes tlswg#928. * IANA considerations section tweak. * Minor editorial * Inline certificate * IDNits reports unused references. * Editorial * Server sends early_data in EE * Sync extension enum with table of extensions Give the values for the extensions we mention as usable. Also some extension-related editorial changes, since apparently I was sloppy about my 'git add -p's. * Add signature_algorithms to the full handshake diagram You need it if you're going to get certificate auth from the server. * Fix NewSessionTicket links With this short anchor we were ending up in the appendix, not the body section that actually talks about the contents. * Swap the order of some text about PSKs/early data It's rather jarring to go straight from EarlyDataIndication to PSKs provisioned via NewSessionTicket. There may be a better place for some of this text, but I didn't see one in a less-than-cursory skim. * No alerts in 0-RTT data? (Mostly I just wanted to take out 'respectively', as there is no previous list to be parallel with. * Always send EndOfEarlyData Not just if the server accepts it. This way if the server can decrypt the messages it doesn't have to do trial decryption to find the end. * Revert "Always send EndOfEarlyData" This reverts commit 7501876e544d7688246309390b8938b3491ee04b. Whoops, we can't do this, since it goes into the transcript now. * Content-type 0 is just invalid, not RESERVED That is, we say _RESERVED means "was used in previous version of TLS", but we are allocating it so as to avoid ambiguity when stripping padding. * Opportunistic encryption is a thing * Apply feedback from @davegarrett * Post-landing cleanup for PR#936 * Ben Kaduk's on-list comments * Address Nikos's straightforward comments * client_certificate_type is CR and CT * Revert "client_certificate_type is CR and CT". Pilot error. This reverts commit 56759ec. * Note some application considerations about padding We allow sending just padding and no application data; be sure you think about what you want to do with that. Also note that the max_early_data_size limit is something of a lie in terms of clients sending lots of padding. * Annotate extension code points with RFC number Show inline which document defines the meaning of that extension, in addition to listing it in the table of extensions. * Move references to the same line * Remove some text I thought was unnecessary * Insert anti-downgrade token when TLS 1.0 or below as well. * Revise the text on ticket age handling on the client and server. Fixes tlswg#919, tlswg#940, tlswg#944.tlswg#944 * Require (2119 SHOULD) that the certificate context for post-handshake be unpredictable in order to prevent pre-computation of CertificateVerify. Maybe this should actually be a MUST? * Formal representation of point format. Fixes tlswg#943. As suggested by Nikos, provide a formal description of the point format modelled on 4492-bis. * Update major differences section to actually be differences from TLS 1.2, not a change log. Fixes tlswg#931, Fixes tlswg#923 * Clean up the Major Differences section * module -> modulo * Update variable names. Fixes tlswg#942. Make the variable names of various secrets correspond to the labels used for Derive-Secret(). This is not a wire format change, but just a change in the internal variable names. * Tweak guidance on clock skew window Mention the assumptions going into the quoted number. Also fix a typo. * Bigger caveat for 0-RTT data * Add references to published analyses Some additions/modifications to tlswg#951 changes * Add references to published analyses Ordering by year * Add references to published analyses added BBK17 * Formatting * Add additional security considerations text provided by Hugo Krawczcyk * Minor editorial * Minor editorial * Update based on comments from Hugo and Ben * Move text about PSK interaction with certificate-based client authentication. Fixes tlswg#934. * Break sentence * Re-enable post-handshake client authentication for PSK handshakes. When we banned client auth and PSK, we only meant to do it for the main handshake, not the post-handshake phase. This reverts that change, as well as clarifies the prophibition on PSK plus cert-based auth. * Remove redundant 'an' * Added contribs on request by ekr * Fix xref * Fix reference * Fix some stragglers * One more straggler * Enhanced the list of TLS 1.3 features * Add post_handshake_auth to the list of extensions in IANA considerations. * Editorial work on the Major Changes section * Add text about PSK entropy. Fixes tlswg#965. As Ilari points out on the list, the PSK mechanism is subject to dictionary attacks based on the PSK binder. Make this clear. Modification of text originally provided by Hannes Tschofenig. * Update text * Revert "Update text" This reverts commit 4e2c304. * Update text again * Use ekr's version of ID template while waiting for MT to fix recent defect * Fix make issue * Revert "Fix make issue" This reverts commit f54385d. * Fix markdown * Fix makefile * Shorter HKDF labels. Fixes tlswg#964. Per mailing list discussion, this allows us to have every HKDF-Expand just have one hash block of info. * Fix up two missing labels * Add changelog and explanatory note * Add a reference to RFC 6960. Fixes tlswg#974. There was a fair amount of on-list debate about how much guidance to give about OCSP. This merely cites 6960, which I think matches the area of consensus overlap. * Revise text about auto-replay of early data. Fixes tlswg#971. This just moves the warnings up so it's clear they generally apply. * Move Decoding Errors section for greater clarity. Fixes tlswg#970. * Change log for -20 * Align SignatureScheme ALL-CAPS-VERBS with RESERVED labels. Values in RESERVED labels, per the note at the top of Appendix B, MUST NOT be sent. This conflicts other text which tags ecdsa_sha1 and dsa_sha1 as SHOULD NOT. Back in early drafts, {*, dsa} and {sha1, ecdsa} were not tagged RESERVED and were merely SHOULD NOT in the text: https://tools.ietf.org/html/draft-ietf-tls-tls13-11#section-6.3.2.1 Then things were redone as SignatureScheme with the intent of preserving SHOULD NOTs and MUST NOTs. Accordingly, dsa_* values were defined, and with SHOULD NOTs in prose. tlswg#404 That was followed up by a cleanup change which left dsa_* values in there, but not defined. Intentionally or not, this took away the SHOULD NOT and left it with something unclear. tlswg@bed7281 Then the RESERVED tag was added, in response to the cleanup. Intentionally or not, this kicked in the Appendix B MUST NOT, which means TLS 1.3 implementations are forbidden from offering DSA to TLS 1.2 servers. Nonetheless, the SHOULD NOT reference to the now non-existent and verboten dsa_sha1 remained. tlswg#434 Next, an oversight in PR tlswg#404 was "corrected". PR tlswg#404 was intended to leave SHOULD NOTs and MUST NOTs as-is but downgraded {sha1, ecdsa} to a MUST NOT by omission. However, I did not notice the Appendix B text, so my correction was, in fact, a no-op. tlswg#488 Restoring ecdsa_sha1 was motivated by existing many implementations still offering {sha1, ecdsa} at TLS 1.2, so it was not clear whether removing it was realistic yet. (Notably, dependence on {sha1, rsa} aka rsa_pkcs1_sha1 is known to be prevalent.) Since then, BoringSSL has removed ecdsa_sha1, so that is some evidence it is unnecessary. NSS still offers it, however. So now we have a small mess on our hands. This PR attempts to bring things to a self-consistent picture. Implementations I'm involved with no longer offer ecdsa_sha1 or dsa_*, so I am personally fine with any self-consistent option. For this PR, I went with: Since PR#488 was accepted and even called out in the changelog, my interpretation was that it should end at SHOULD NOT. That I failed to actually implement originally is a bug. DSA is less clear, but since there were two changes by two separate people who chipped away at the SHOULD NOT, my interpretation is to leave it at MUST NOT. I have taken the two changes to their logical conclusion, removing the named dsa_*_RESERVED values and references to non-existent dsa_sha1. * Add sections on traffic analysis and side-channels. Original by Ben Kaduk. Substantial rewrites by EKR. * Revised per MT * Adding missing "no_application_protocol" alert RFC 7301 defines the ALPN extension and defined a new alert "no_application_protocol". TLS 1.3 uses ALPN but currently misses the alert in Section 6. * Error description for "no_application_protocol" alert added * Add ALPN clarification * Update references. Editorial * Updated with a few -20 changes * removing unused references * Fix build * updating reference for obsoleted normative reference * fixing spacing in 5869 reference * added list of updated and obsoleted RFS to the introduction. * Post-landing fixups for RFC updating text * Incorporate comments on PR#980. 1. Merge the legacy discussion for ecdsa_sha1 and rsa_pkcs1_sha1. 2. Restore the labels for the reserved dsa code points. * Revised per Kaduk's comments * adding me:spt * nit * moving reserved values for hash/sig algorithms registry * More editorial by Kaduk * Add Joe Salowey as contributor * Pre-publish editorial nits * Even more editorial nits * Even more minor tweak * Editorial, verb added * Update Joe's email * fixing syntax errors of ID 20. * "supported_groups" is not MTU in EncryptedExtensions Even when (EC)DHE is in use, the "supported_groups" extension is only mandatory in the ClientHello; it can be omitted from the EncryptedExtensions, as per section 4.2.6. Given that, it is not MTI for the server sending to the client, but the client must be prepared to accept the extension in EncryptedExtensions (at least to the point of not crashing or aborting the handshake), so leave the MTI text unchanged. It would be too awkward to try to express this subtlety in such a concise list. * Clarify "KDF Hash" HKDF-Expand-Label has changed between draft -19 and -20, clarify that "HDF Hash" refers to the hash algorithm and not a version-specific instance of HKDF-Expand-Label which is populated with a hash algorithm. * PSK context for 0-RTT needs version number The 0-RTT key might differ between TLS versions (as demonstrated with the draft -20 changes). Be explicit about storing this version number since section 4.2.9 requires this information too. * cite RFC for alert * Fix plural * 7301 is normative * Fixed Hugo's address * Add contrib ACK for Colm MacCarthaigh per tlswg#1001 (comment) * Clarify that EOED is sent iff server accepts early data * Servers may send extension responses in a Certificate message * Add Matt Caswell as a contributor * Allow clients to use any suitable alert if a non-acceptable cert chain There are a number of different alerts that may be suitable for sending to indicate a non-acceptable cert chain, e.g. certificate_revoked, certificate_expired, unknown_ca, etc. We should not restrict the client to only sending one specific alert. * Always send client's second flight. Fixes tlswg#1017 * A bunch of editorial changes to the security considerations suggested by Hugo Krawczyk. * Editorial * Encourage logging alerts, per Kathleen's AD review. Fixes tlswg#1014. * Update changelog to indicate updated references. Fixes tlswg#1015 * Add Brian Smith as a contributor. * revise to give more implementor flexibility, per comments by Brian Smith * Remove warnings * Not in figures FFS
Added new section on anti-replay |
24 tasks
This was referenced May 5, 2020
This was referenced Jul 1, 2020
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Introduction
At the Eurocrypt/IACR TLS:DIV workshop on 2017-04-30, kindly facilitated by the Everest team, I presented the results of a security review of the TLS 1.3 0-RTT section. The security review was performed as part of the process of implementing TLS1.3 in s2n.
The review focused on two known-issues: the absence of forward secrecy for all data, and the replayability of 0-RTT data. As it turns out, these issues can be worked around, and it is possible to to provide 0RTT, Forward Secrecy and anti-replayability (save for the Gilmor downgrade attack case) at the same time. Many thanks to Eric Rescorla for identifying how the work around can be integrated with the existing TLS1.3 scheme, modulo a neat optimization that Eric also came up with.
However TLS1.3 0-RTT is insecure by default, and based on the current draft, it is likely that TLS implementations not using work arounds will create real-world vulnerabilities. I believe that the attacks enabled by these vulnerabilities are more practical, and more serious, than is generally appreciated. Each attack enabled is more severe than other vulnerabilities that have been considered "must upgrade" for TLS implementations.
This issue is intended as a summary of the attacks, their implications, and the mitigations that an implementation may perform, as well as suggested changes to the specification that would reduce the risk related to these issues.
The most serious issues concern replays, and this summary includes five practical real-world attacks against applications using TLS1.3 as described in the draft. However before discussing replays, it is helpful to understand how TLS1.3 and tickets interact.
TLS1.3 and tickets: STEKs remain a weakness in TLS
To support TLS resumption and 0-RTT a server must know the session parameters to be resumed (PSK, authentication, etc ...). The most common implementations of TLS tickets have the server using Session Ticket Encryption Keys (STEKs) to create an encrypted copy of the session parameters which is then stored by the client. When the client resumes, it supplies this encrypted copy, the server decrypts it, and has the parameters it needs to resume. The server need only remember the STEK.
If a STEK is disclosed to an adversary, then all of the data encrypted by sessions protected by the STEK may be decrypted by an adversary. STEKs are therefore a weak spot in the over-all design of TLS; disclosure of a single small key can result in compromising an unbounded amount of data. While it is never possible to secure data that is transmitted during a compromise, it is a regression from forward secrecy that historical data transmitted prior to the compromise is not protected.
STEKs must be synchronized across large number of hosts. The Akamai CDN, for example, consists of over 200,000 hosts. To be most effective for resumption, a STEK must be accessible on the subset of these the hosts responsible for handling a domain. This subset is measurable as at least tens to hundreds of thousands of hosts based on DNS queries and host finger-printing. In the case of some operators, hosts may also have the STEKs on-disk, subject to risk of physical theft or seizure, depending on the architecture of the provider (though generally large providers such as Akamai do not store keys on disk).
This scope presents a large surface area to attackers. A single vulnerable host, or a vulnerability in how STEKs are synchronized, can lead to STEK disclosure. For the most part these security challenges are handled out of view of public audit, and it is difficult to capture how well best practices are applied. There has been some recent work by Springall, Durumeric, Halderman on Measuring the Security Harm of TLS Crypto Shortcuts which quantifies the use of overly-longlasting STEKs.
Seperate from traditional host security risk, there is also a cryptographic risk. Attackers may record and replay a ticket to a server at will, and tickets are commonly encrypted using AES, and algorithm that is vulnerable to side-channel analysis in some situations. For example, if an attacker can gain a vantage point "close" enough to a non-constant-time non-contant-memory-access implementation of AES (e.g. software encryption on an old host that does not support hardware AES acceleration) then they may be able to discover the STEK through side-channel analysis. At a minimum, server implementations of TLS would be wise to use an algorithm designed for side-channel resistance for ticket encryption, regardless of the encryption algorithm intended for the session itself.
TLS1.3 is much better, but critical data still lacks forward secrecy
TLS1.3 makes huge strides in improving these security risks. Tickets are now related with Resumption Pre-Shared Keys (RPSKs) which are not the same as the keys encrypting the original session. Additionally, upon resumption, TLS1.3 supports a key schedule that means the only user data protected by the RPSK is 0-RTT data, which is optional.
While 0-RTT is intended for a relatively small volume of data at the beginning of a connection, it is unfortunately very likely that this section of data will contain critical credentials: credit card numbers, passwords, cookies and other bearer tokens. Typically these requests, if compromised, can be used to generate the entire response. Thus, with TLS1.3, it remains that a large volume of critical user data remains secured ultimately by STEKs - which as we've seen are a weak spot. In practise, meaningful forward secrecy is not provided with 0-RTT is enabled.
Suggested mitigation: Support Forward Secrecy for 0-RTT data
An alternative to using STEKs and encrypted tickets is to use tickets as keys for a single-use session cache. When a server issues a ticket, it can store the session parameters in a cache. When a server receives a ticket for use, it looks up this ticket in the cache, which supports an atomic retrieve-and-delete operation.
This arrangement provides Forward Secrecy for all sessions successfully retrieved from the cache. If the server, or cache, is compromised then generally only data pertaining to future, yet-to-be-used, sessions is disclosed. Conveniently, operational and application errors favor security; for example downtime, crashes, and so on generally result in key erasure. Economic incentives also favor deleting keys (to make room for new ones) over keeping keys (as with STEKs, where a more long lived key is operationally cheaper).
The transactional requirements for a single-use cache and strike registers (an anti-replay mechanism) are also different. With strike registers, it is critical to know when a strike register was available and unavailable, to discard tickets from any period the strike register may not have recorded observations durably. To perform this all updates to a strike register must be sequential relative to a global checkpoint (i.e. all updates arriving prior to the checkpoint must be commited). A single-use cache is free to make concurrent updates that are unsequenced relative to each other or any checkpoint (but updates and reads to any single key must be sequenced).
If forward secrecy can be provided in this manner, why is this arrangement not common today? The easy answer is that mode is not efficient in current versions of TLS, where tickets are intended for multiple uses. This arrangement also comes at a cost: a server must operate the cache. At least in my view, the operational costs are worth the security benefit of meaningful Forward Secrecy. The cost of memory has lowered considerably since the inception of tickets. For example, an AWS Elasticache Redis instance capable of caching millions of sessions costs as little as $0.017 cents / hour. A custom cache with a dedicated design can be implemented considerably cheaper again.
The second cost is latency; performing a lookup costs time. Within the data-center, this cost too is no longer significant. For example within the AWS EC2 VPC network it is possible to achieve look up times measured in tens of microseconds. This time is not significant in the context of a TLS connection that would benefit from 0-RTT optimization, where latencies are usually over three orders of magnitude greater at tens to hundreds of milliseconds.
The third cost is that it is not feasible to share state across data centers or a wide geographic area (e.g. a global CDN). A ticket issued from a CDN node in one city would not help a user resume a connection if they are later routed to a different city. In my view (having built two CDNs), this is not a significant problem. In practice, when users are routed to different locations it is common for cache misses to occur, and for the related TCP Fast Open optimization (it also speeds up connections) to fail (due to different IP addresses when not using IP any cast). Operators already work hard to maximize user affinity to locations, and any cache misses at the TLS level can be very quickly repaired.
Note also that one of the implications of Krawczyk's 2005 paper on Perfect Forward Secrecy is that some transactionally mutable server side state is required to provide Forward Secrecy for 0-RTT-like message (prior to any handshake). For example, puncturable encryption, another technique aimed at providing 0-RTT forward secrecy, requires transactionally mutable server-side state. In the case where a 0-RTT section arrives on the other side of the globe from wherever the store is located, it takes hundreds of milliseconds to complete a transaction, defeating the point of the 0-RTT feature. Forward secrecy and global resumption are likely mutually incompatible in any scheme. We should favor the security of forward secrecy.
Suggested changes to TLS1.3 draft
Make multiple tickets from the same connection cryptographically independent
In today's TLS1.3 draft a server may issue multiple tickets on a connection, but these tickets are not cryptographically independent. Unfortunately this makes it impossible to for a single-use session cache to distinguish between a ticket issued pre and post authentication, and it prevents servers from issuing meaningfully different tickets in order for a client to build up a pool of tickets.
With #998 , Eric Rescorla has suggested an easy fix for this. I strongly support adopting this change, and thank Eric for the suggestion.
Change clients "SHOULD" use tickets once to "MUST"
The TLS 1.3 draft-20 specifies that:
To provide meaningful forward secrecy on client side, clients "MUST" use each ticket no more than once, and "MUST" delete any session parameters stored with the session.
There is also another reason for this "MUST": any client that attempts to use a ticket multiple times will also likely leak the obfuscated_ticket_age value, which is intended to be secret.
Designate or Discern STEK-less tickets
While today STEKs are a common practice, and many operators do have reasonable implementations in place, over time their presence as a weak spot may (hopefully!) lead to their eradication, similar to how non-PFS key agreement has been greatly diminished through tools such as the ssllabs.com ratings.
It is possible for a client (or eventually an ssllabs.com) to validate that a ticket cannot have used a STEK: if the size of the ticket data is smaller than the RPSK, however today it is not possible for clients to ask for such a ticket, only to reject them.
In practice it would be useful for clients to advertise or encourage support for STEK-less tickets by advertising a maximum ticket size supported, or by having a designated "STEK-free" ticket type. At a minimum, this prevent servers STEK-dependent from generating tickets that clients have no desire to keep.
Replay is a big problem, Replay is a big problem
A delete-on-use session-cache robustly prevents any replay of a ticket (and hence replay of an associated 0-RTT data section) however that is neither common nor required for tickets in TLS 1.3. To the contrary, the draft specification calls out that replay is expected:
This is unworkably insecure and will lead to many practical real-world security issues. As also noted in the draft:
This suggested mechanism is to validate the ticket age presented by a client. The ticket age presented should correspond to the time since the ticket was issued by the server plus the round-trip time between the client and server.
There are three problems with this mechanism. The first problem is that the maximum bound for a round-trip-time is quite high. RTTs of 500ms are not unusual for satellite internet, or rural broadband schemes, and these are the very use-cases 0-RTT most benefits. In a 500ms window an attacker can easily send tens of thousands of replays.
The second problem is that in the real world there is clock skew. Clocks and timers can drift based on factors such as temperature, CPU power saving, system hibernation and more. This is particularly true of low-power devices which are increasingly prevalent. Indeed the TLS1.3 draft suggests tolerating up to 10 seconds. In the real-world, providers may go higher. The maximum life time for a ticket is 7 days. My own private experiment based on requests from low power devices showed that over 7 days that the 99th percentile window clock drift was around 2 seconds, but that the 99.9th percentile window around 40 seconds. A provider therefore has a natural and strong incentive to increase the window of tolerance, in order to permit more clients to resume and use 0-RTT. Regardless, even with a 10 second window, an attacker can send millions of replays.
The third problem is that both of these cases so far are per-single-destination. If many servers share a STEK, as is common, then it is possible to replay at these rates to each server. With CDNs consisting of hundreds of thousands of nodes, it is likely that the attackers ability to generate replays is bounded only by the availability of bandwidth. In short: millions to billions of replays are possible within the scheme outlined by the TLS1.3 draft.
In practise, millions of replays is sufficient to exploit measureable cryptographic side-channels, if the underlying implementation is to vulnerable to any. As we'll see it also enables at least five types of serious attack.
Attack 1: HTTP is not replay tolerant
In the evolution of the TLS1.3 specification, it has been stated that web requests must already be replay tolerant, as browsers will retry interrupted connections. While the latter is true, it does not imply the former.
Firstly, not all use of TLS is for HTTP, and users of protocols other than HTTP are likely to desire and enable the benefits of 0-RTT. Secondly, not all HTTP requests are made by web browsers. In fact, to render a single browser request for amazon.com, Facebook, or a Google search often requires hundreds of "behind the scenes" HTTP requests to internal micro-services. These requests sometimes span data centers and can benefit from 0-RTT optimization. As the "Internet of things" develops, these same types of requests are now also common between industrial and home settings and cloud servers accessed via the WAN. These settings include networks that are not well-secured and are subject to relatively easy eavesdropping.
Many clients for these services do not retry by default, or pay tedious attention to how retries are implemented. This is especially true of requests that implement transactional applications. In some asynchronous transaction schemes, clients need to be careful to provide each commit attempt a unique ID, separate from the unique ID for the commit itself. These applications are often careful to preserve an order between related requests to resolve dead-locks and ties.
More common still is clients that form a "Try, Wait, Read, Retry-if-not-there" cycle to avoid creating duplicate entries. For example a client may try a request, and if that request times out, it may wait a fixed time period, poll for success of the previous request (maybe it did succeed) and only then try it again. Applications such as these are generally not tolerant of even a single uncontrolled replay. Other applications eliminate retries entirely, and make requests at a fixed constant rate as an important measure designed to reduce the risk of overload or cascade outages during partial failure events.
In 2015, Daniel Kahn Gillmor discovered a combined replay and downgrade attack against 0-RTT sections: if an active attacker can block server responses to a 0-RTT request, while also disabling the server's record (strike register) of observed 0-RTT sections (A DOS attack may achieve this), then the server may be forced to refuse 0-RTT data on the subsequent retry. This will force the client to downgrade and repeat the request as regular non-0-RTT data.
In fact, as noted earlier, it is never safe for clients to repeat a ticket if one is concerned about keeping the secrecy of the obfuscated ticket age. So reasonable clients may always retry with a non-0-RTT attempt, or a use a different ticket if it is available to them. Although a provider may hash the entire 0-RTT section to derive a key for use with a strike register, this requires buffering, and it is more common to use a key derived from the ticket. Both of these factors make the makes the Gillmor attack even more practical; the strike register is probably irrelevant anyway.
A single-use cache does not mitigate this attack, but notice that in the Gillmor attack the client is made aware of the original failure, and can control the nature of the retry. The client knows that the request may have failed, or may have succeeded. And so careful clients may enforce their retry requirements. Without an anti 0-RTT anti-replay mechanism a request may be silently replayed millions of times without any knowledge to the client. That is a materially different kind of attack that breaks existing systems in unexpected ways.
Attack 2: Exhausting tolerances
Many applications also fail to take into account fully correct REST design patterns, and implement non-idempotent GET requests. Cloudflare provide a great example in their blog post on Zero-RTT.
While it is true that browsers may retry a request like this today, as we've seen in attack 1, it not true that only browsers make such requests.
More important is that while an attacker may cause a browser to retry this type of request perhaps tens to hundreds of times, that kind of attack is active and consumes the users bandwidth and CPU and must pass through the user's firewalls and other controls.
A 0-RTT replay attack, as we've seen, can be performed up to millions of times, and mostly out of band using the attacker's resources (though the attacker must also be passively in-band, to copy the original 0-RTT section). Repeating a request once, as the Gillmor attack permits may ocasionally trigger a manual refund process. Repeating the request millions of times may bankrupt a business. This is a materially different kind of risk.
How practical is it for applications to mitigate attacks 1 and 2?
For the purposes of attacks 1 and 2, consider what is neccessary on the application's side to mitigate this kind of issue; it must make the requests themselves replay-safe. One popular approach is to make the request itself idempotent by adding an explicit, or synthesized, idempotency key that represents the invokation. See the Stripe blog post on just this topic. The key must be commited to a data store that can provide an atomic uniqueness guarantee, and since this commit must be concurrent to the operation itself, it must generally occur in the data store the application is mutating.
One immediate problem is that not all applications use such data stores. An eventually consistent data store does not provide these kinds of guarantees, though may provide a guarantee around when "eventually" the store is consistent. This is why some clients perform the "Try, poll, Try again" cycle.
It can be tempting to suggest that idempotency could be provided by a logically-seperate component, responsible only for preventing re-occurences. As it turns out, it is not possible to effectively guarantee uniqueness from "outside" of the application's central data store. Consider a theoritical micro-service designed to provide "idempotency as a service", it could accept idempotency keys and commit them on behalf of the application while refusing duplicates. This naive arrangment breaks when the the micro-service accepts and commits the key, but the application's own update to its data store fails. Then the user's operation fails but cannot ever be repeated. To resolve this the micro-service and the application service must use a coupled and distributed transaction protocol and things get complicated quickly.
To underscore how subtle and hard a problem this area of idemptency can be it is worth looking at one of CloudFlare's 0-RTT anti-replay mechanisms. To make things easier for applications, CloudFlare adds a HTTP header to outgoing, proxied, requests that originated in 0-RTT sections:
An application can use this value as a convenient uniqueness key, to mitigate 0-RTT replay. However, this isn't quite sufficient. A retry request triggered by the Gillmor attack will not be associated with the the same PSK binder, and so application level idempotency is required anyway. Of course an application designer who has no idempotency key available to them may decide to use the CloudFlare-provided key pragmatically; at least they are now defended againdt 0-RTT mass-replays, and this is a sensible use of the neat feature. However since applications often evolve to eventually include idempotency keys, an application may be left with a transition period where both uniqueness keys are required. Many NoSQL datastores are limited to a single index and do not provide for enforcing multiple uniqueness constraints.
Attack 3: Compromising secrecy with TLS1.3 0-RTT
To provide the security guarantee of secrecy, it is not sufficient that requests are idempotent and replay-tolerant. The requests must also be handled in a manner that is free of any observable side-effects. This is extremely difficult to achieve. This is a core focus of strong cryptography, where side-effect free programming errors are an area of constant research and frequent vulnerabilities. Higher level applications are generally not concerned with this challenge at all, and are poorly prepared for the implications of replays.
Take for example a simple side effect; caching. A read-only request is by definition idempotent, but if a cache is present this cache can effect observable timings and response-headers that defeat the secrecy of the request.
Suppose a user fetches a piece of content from a CDN using a 0-RTT request, and that piece of content is prohibited and contrary to the principles of a totalitarian regime. Ordinarily only the size of the download of the content is disclosed to a man-in-the-middle attacker, and as we'll see later, TLS1.3 includes support for measures designed to help defeat traffic-analysis attacks that can use this size to identify the content.
But with replays it is now feasible to probe the CDN caches to determine what the content was. First the attacker copies the zero-RTT section and then replays it to a series of CDN nodes. The attacker can choose CDN nodes that are unlikely to have the content already (e.g. in different geographic regions), and replay the request. The CDN will then fetch, and cache, the content.
The attacker can then make probe requests for suspected illicit content and determine if it was cached or not (if it loads quickly, or slowly, or if a cache max-age header lines up with the replayed request). Note that the attacker can take their time with the probes and can spread probe requests over a relatively long time period. Any noise or uncertainty in the process can be countered by using additional replays to more nodes to increase confidence.
This is just one basic example using a typical CDN cache, but applications use caches at many other layers. It is likely that all of these caches can be probed in some way to reveal details of encrypted requests.
Attack 4: Exhausting throttles with TLS1.3 0-RTT
It is a common operational and security measure to throttle application requests. For example, a given customer may be permitted to perform as many as 10,000 requests/second but no more.
To avoid simple spoofing risks, many such systems perform throttling post-authentication. For example the request may be signed cryptographically (see the AWS SIGv4 signing protocol or the OATH signing process), that signature is verified prior to throttling. This post-authentication property is one reason why such protocols are designed to be extremely fast to verify, which often means as much cryptography as possible must be pre-computed, making random nonces infeasible in many cases.
For such systems, 0-RTT data means that legitimately signed requests that were previously considered to be secret and non-spoofable are now re-playable by attackers. This enables a new and realistic denial of service vulnerability capable of locking customers out of their accounts.
Attack 5: Enabling easier Traffic Analysis with TLS1.3 0-RTT
With traffic analysis it is often possible for a passive attacker to decloak what content an encrypted session handled. For example when a user browses Wikipedia an attacker may be able to determine which page the user is viewing because the combination of html, image and CSS sizes on a particular Wikipedia page is highly likely to be unique and even though the content is encrypted, the attacker can observe the sizes. This type of attack has become slightly easier with the recent adoption of stream ciphersuites such as AES-GCM and ChaCha20-Poly1305 that do not mask content-length to at least a block size.
TLS1.3 includes a record layer padding mechanism, designed to make these kinds of attacks more difficult. However 0-RTT replay also enables a new kind of traffic analsys attack. Today, traffic analysis is most effective against fixed-size responses, as in the Wikipedia example. With 0-RTT data an attacker can repeat a request millions to billions of times, and by observing variability in response size and response times can gain additional information that may enable the attacker to decloak data.
Violation of layers and seperation of actors
As we've seen, the problems of replay tolerance are resolveable only at application layer, but solutions can be subtle and hard to reason through and test.In my experience, deliberately replaying requests will uncover surprising issues in many systems. Indeed there are race detectors, trashers and other testing tools for a variety of languages that have evolved to find these kinds of issues. But these bugs remain common, and many systems make the fair assumption that TLS provides anti-replay properties for their messaging/transport layer.
The approach with the TLS1.3 draft is to say that 0-RTT data is optional, that it should not be enabled without a careful analysis and that the application must be made aware that data was potentially replayed. In my view, in light of the above attacks, this advice is unworkable. It is not simple, or maybe even possible, to secure all applications against replay and measurable side effects such as cache timing and throttling. Fully-correct idempotency is very difficult and vanishingly rare.
But beyond that, let's examine the advice given:
There are also several reasons to believe that even this advice will not always be taken, indeed some existing experimental deployments do not follow it.
The first and strongest reason is that the benefits of 0-RTT are considerable, immediate, and measurable. By turning on 0-RTT a provider can save 100s of milliseconds, and it's been reported that savings of 100ms can impact revenue by as much as 1%. Providers also exist in a competitive landscape and are constantly trying to beat each other on every conceivable metric. At the same time, the security risks are non-obvious (indeed, this write up is coming very late in the TLS draft process) and hard to test for. In other words; providers have an extremely large incentive to turn on 0-RTT on, de-prioritizing harder to measure security concerns.
The second practical problem is that the world of application authors, writing high level code for websites and applications is very seperate from the worlds of TLS implementors and server administrators. It is predictable that site administrators will enable 0-RTT without an appreciation for the risk to the application, whose authors are likely not even aware of a change. Indeed acceleration providers are already making this easy in the current experimental deployments of TLS1.3, where 0-RTT support is being offered to websites and applications in a backwards compatible way as a single stream of data towards their own servers.
If providers are to render the advice of the TLS draft moot, and to provide a single-stream of data anyway, then arguably it would be better if the TLS1.3 RFC defined that as the default mode of operation. Maintaining seperate "may be replayed" and "can't be replayed" sections is complexity that can clutter applications and increase risks of application-level state machine bugs.
Even a fully-aware and conscienscous site administrator faces a practical difficulty: applications are often made up of many URLs and request paths. Some may be replay be safe, and others not. But 0-RTT is enabled at a protocol level, for all requests. There are also "Layer 4" TLS proxies which accelerate TLS by terminating it at edge sites (similar to a CDN), or provide security benefits by handling certificate management, but are completely agnostic to the protocol being handled by TLS. Will administrators and providers in these situations resist the temptation the accelerate TLS with 0-RTT mode?
Furthermore the expectations and guarantees provided by layers are expected to be consistent across providers. A customer may legimately use an API proxy offered by one provider, in combination with a load balancer offered by another, together with a CDN or edge-accelerator offered by another. A change in 0-RTT behavior on the part of any one provider can impact the security assumptions of the others. For example, the CDN layer administrator might enable 0-RTT for what appears to be an idempotent request pattern, without being aware that the the API proxy implements request-level throttles. Now an attacker who happens to grab a single 0-RTT data request from an unsecured WiFi network can turn this into a broad denial-of-service attack that may lock the caller out in all locations.
At the core of this problem is that the proposed change with TLS1.3 violates the established layering boundaries of applications and transport protocols, and violates the principle of least surprise. This is somewhat ironic, as TLS itself benefits from important guarantees from its underlying protocol, TCP. For example the Lucky13 vulnerability was practical against DTLS, but not TLS. This is because UDP-based DTLS tolerates a certain amount of replays, while TLS does not, due to the reliable-transmission guarantees provided by TCP. Suppose that the TCP protocol WG were to decide that TCP would sometimes no longer provide reliable tranmission, and that data may be missing or duplicated in a stream, would we be happy with that as TLS maintainers?
Suggested changes to TLS1.3 draft
Require implementations to robustly prevent Ticket re-use for 0-RTT
TLS1.3 should require that TLS implementions handling 0-RTT "MUST" provide a mechanism to prevent duplicate tickets from being used for 0-RTT data. Implementations could use strike-registers, single-use caches, or other mechanisms such as puncturable encryption, to achieve this effect; rejecting 0-RTT sections when uncertain of replay.
While this does leave open the small window of Gillmor-style attacks, these attacks are different in magnitude, consequence, and can be handled reasonably by clients in a manner that existing clients are used to.
Additionally, if TLS implementations are to provide replay protection as a built-in property, it is simpler for applications to expose all TLS plaintext data as a single stream. This appears to be what applications are doing anyway.
Partial mitigation for Gilmor attacks: deliberately duplicate 0-RTT data
If 0-RTT data and regular data are to remain seperate streams, then another way to address Gilmor attacks is to intentionally duplicate 0-RTT sections. If 0-RTT sections are to be replayable, it is better that they should be replayed as an ordinary event. TLS implementations should ocasionally intentionally duplicate zero-RTT data torwads the application. This helps "innoculate" applications against idempotency bugs, triggering them early in a controlled way, before attackers do in an uncontrolled way.
Require TLS proxies to operate 0-RTT transitively
Some, though not all, of the attacks outlined can be lessened by passing full knowledge of 0-RTT state end-to-end to applications. For example, a CDN or TLS accelerator could accept a 0-RTT data request only if the origin also supports 0-RTT. It could then match byte-for-byte the plaintext of the incoming 0-RTT section with an outgoing 0-RTT section. Rather than emulating a single stream, this would allow end applications to reason more precisely about exactly which data was originally replayable.
Conclusion
TLS 1.3 0-RTT is not secure by default, but it is possible to provide both anti-replay and forward-secrecy properties for 0-RTT data with workarounds. As long as TLS 1.3 is not secure by default it is likely to lead to exploitable vulnerabilities that can only be fixed at the application level, distant from the cause. In general, it is also very challenging to fix applications to be idempotent and side-effect free.
Instead of shifting the problem to applications, we should strongly consider modifying the TLS 1.3 draft to make TLS 0-RTT secure by default, at least against replays. While Gillmor-style retry attacks will persist, these attacks may be mitigated with reasonable client behavior, and in many cases the existing client behavior is already fault tolerant.
Lastly, to end on a positive note, in general; TLS1.3 is still a welcome and vast simplification over prior versions of TLS and improves the security posture of TLS generally, including much better forward-secrecy for all non-0-RTT data.
The text was updated successfully, but these errors were encountered: