Skip to content

Latest commit

 

History

History
484 lines (380 loc) · 19.8 KB

FASTPOLL.md

File metadata and controls

484 lines (380 loc) · 19.8 KB

fast_io: A modified version of uasyncio

This version is a "drop in" replacement for official uasyncio. Existing applications should run under it unchanged and with essentially identical performance.

This version has the following features:

  • I/O can optionally be handled at a higher priority than other coroutines PR287.
  • Tasks can yield with low priority, running when nothing else is pending.
  • Callbacks can similarly be scheduled with low priority.
  • A bug whereby bidirectional devices such as UARTS can fail to handle concurrent input and output is fixed.
  • It is compatible with rtc_time.py for micro-power applications documented here.
  • An assertion failure is produced if create_task or run_until_complete is called with a generator function PR292. This traps a common coding error which otherwise results in silent failure.
  • The presence of the fast_io version can be tested at runtime.
  • The presence of an event loop instance can be tested at runtime.
  • run_until_complete(coro()) now returns the value returned by coro() as per CPython micropython-lib PR270.

Note that priority device drivers are written by using the officially supported technique for writing stream I/O drivers. Code using such drivers will run unchanged under the fast_io version. Using the fast I/O mechanism requires adding just one line of code. This implies that if official uasyncio acquires a means of prioritising I/O other than that in this version, application code changes should be minimal.

The high priority mechanism formerly provided in asyncio_priority.py was a workround based on the view that stream I/O written in Python would remain unsupported. This is now available so asyncio_priority.py is obsolete and should be deleted from your system. The facility for low priority coros formerly provided by asyncio_priority.py is now implemented.

Contents

  1. Installation
    1.1 Benchmarks Benchmark and demo programs.
  2. Rationale
    2.1 Latency
    2.2 Timing accuracy
    2.3 Polling in uasyncio
  3. The modified version
    3.1 Fast IO
    3.2 Low Priority
    3.3 Other Features
    3.4 Low priority yield
    3.4.1 Task Cancellation and Timeouts
    3.5 Low priority callbacks
  4. ESP Platforms
  5. Background
  6. Performance

1. Installation

Install and test uasyncio on the target hardware. Replace core.py and __init__.py with the files in the fast_io directory.

In MicroPython 1.9 uasyncio was implemented as a frozen module on the ESP8266. To install this version it is necessary to build the firmware with the above two files implemented as frozen bytecode. See ESP Platforms for general comments on the suitability of ESP platforms for systems requiring fast response.

It is possible to load modules in the filesystem in preference to frozen ones by modifying sys.path. However the ESP8266 probably has too little RAM for this to be useful.

1.1 Benchmarks

The benchmarks directory contains files demonstrating the performance gains offered by prioritisation. They also offer illustrations of the use of these features. Documentation is in the code.

  • benchmarks/latency.py Shows the effect on latency with and without low priority usage.
  • benchmarks/rate.py Shows the frequency with which uasyncio schedules minimal coroutines (coros).
  • benchmarks/rate_esp.py As above for ESP32 and ESP8266.
  • benchmarks/rate_fastio.py Measures the rate at which coros can be scheduled if the fast I/O mechanism is used but no I/O is pending.
  • benchmarks/call_lp.py Demos low priority callbacks.
  • benchmarks/overdue.py Demo of maximum overdue feature.
  • benchmarks/priority_test.py Cancellation of low priority coros.
  • fast_io/ms_timer.py Provides higher precision timing than wait_ms().
  • fast_io/ms_timer_test.py Test/demo program for above.
  • fast_io/pin_cb.py Demo of an I/O device driver which causes a pin state change to trigger a callback.
  • fast_io/pin_cb_test.py Demo of above.

With the exceptions of call_lp, priority and rate_fastio, benchmarks can be run against the official and priority versions of usayncio.

2. Rationale

MicroPython firmware now enables device drivers for stream devices to be written in Python, via uio.IOBase. This mechanism can be applied to any situation where a piece of hardware or an asynchronously set flag needs to be polled. Such polling is efficient because it is handled in C using select.poll, and because the coroutine accessing the device is descheduled until polling succeeds.

Unfortunately official uasyncio polls I/O with a relatively high degree of latency.

Applications may need to poll a hardware device or a flag set by an interrupt service routine (ISR). An overrun may occur if the scheduling of the polling coroutine (coro) is subject to excessive latency. Fast devices with interrupt driven drivers (such as the UART) need to buffer incoming data during any latency period. Lower latency reduces the buffer size requirement.

Further, a coro issuing await asyncio.sleep_ms(t) may block for much longer than t depending on the number and design of other coros which are pending execution. Delays can easily exceed the nominal value by an order of magnitude.

This variant mitigates this by providing a means of scheduling I/O at a higher priority than other coros: if an I/O queue is specified, I/O devices are polled on every iteration of the scheduler. This enables faster response to real time events and also enables higher precision millisecond-level delays to be realised.

The variant also enables coros to yield control in a way which prevents them from competing with coros which are ready for execution. Coros which have yielded in a low priority fashion will not be scheduled until all "normal" coros are waiting on a nonzero timeout. The benchmarks show that the improvement in the accuracy of time delays can exceed two orders of magnitude.

2.1 Latency

Coroutines in uasyncio which are pending execution are scheduled in a "fair" round-robin fashion. Consider these functions:

async def foo():
    while True:
        yield
        # code which takes 4ms to complete

async def handle_isr():
    global isr_has_run
    while True:
        if isr_has_run:
            # read and process data
            isr_has_run = False
        yield

Assume a hardware interrupt handler sets the isr_has_run flag, and that we have ten instances of foo() and one instance of handle_isr(). When handle_isr() issues yield, its execution will pause for 40ms while each instance of foo() is scheduled and performs one iteration. This may be unacceptable: it may be necessary to poll and respond to the flag at a rate sufficient to avoid overruns.

In this version handle_isr() would be rewritten as a stream device driver which could be expected to run with latency of just over 4ms.

Alternatively this latency may be reduced by enabling the foo() instances to yield in a low priority manner. In the case where all coros other than handle_isr() are low priority the latency is reduced to 300μs - a figure of about double the inherent latency of uasyncio.

The benchmark latency.py demonstrates this. Documentation is in the code; it can be run against both official and priority versions. This measures scheduler latency. Maximum application latency, measured relative to the incidence of an asynchronous event, will be 300μs plus the worst-case delay between yields of any one competing task.

2.1.1 I/O latency

The official version of uasyncio has even higher levels of latency for I/O scheduling. In the above case of ten coros using 4ms of CPU time between zero delay yields, the latency of an I/O driver would be 80ms.

2.2 Timing accuracy

Consider these functions:

async def foo():
    while True:
        await asyncio.sleep(0)
        # code which takes 4ms to complete

async def fast():
    while True:
        # Code omitted
        await asyncio.sleep_ms(15)
        # Code omitted

Again assume ten instances of foo() and one of fast(). When fast() issues await asyncio.sleep_ms(15) it will not see a 15ms delay. During the 15ms period foo() instances will be scheduled. When the delay elapses, fast() will compete with pending foo() instances.

This results in variable delays up to 55ms (10 tasks * 4ms + 15ms). A MillisecTimer class is provided which uses stream I/O to achieve a relatively high precision delay:

async def timer_test(n):
    timer = ms_timer.MillisecTimer()
    while True:
        await timer(30)  # More precise timing
        # Code

The test program fast_io/ms_timer_test.py illustrates three instances of a coro with a 30ms nominal timer delay, competing with ten coros which yield with a zero delay between hogging the CPU for 10ms. Using normal scheduling the 30ms delay is actually 300ms. With fast I/O it is 30-34ms.

2.3 Polling in uasyncio

The asyncio library provides various mechanisms for polling a device or flag. Aside from a polling loop these include awaitable classes and asynchronous iterators. If an awaitable class's __iter__() method simply returns the state of a piece of hardware, there is no performance gain over a simple polling loop.

This is because uasyncio schedules tasks which yield with a zero delay, together with tasks which have become ready to run, in a "fair" round-robin fashion. This means that a task waiting on a zero delay will be rescheduled only after the scheduling of all other such tasks (including timed waits whose time has elapsed).

The fast_io version enables awaitable classes and asynchronous iterators to run with lower latency by designing them to use the stream I/O mechanism. The program fast_io/ms_timer.py provides an example.

Practical cases exist where the foo() tasks are not time-critical: in such cases the performance of time critical tasks may be enhanced by enabling foo() to submit for rescheduling in a way which does not compete with tasks requiring a fast response. In essence "slow" operations tolerate longer latency and longer time delays so that fast operations meet their performance targets. Examples are:

  • User interface code. A system with ten pushbuttons might have a coro running on each. A GUI touch detector coro needs to check a touch against sequence of objects. Both may tolerate 100ms of latency before users notice any lag.
  • Networking code: a latency of 100ms may be dwarfed by that of the network.
  • Mathematical code: there are cases where time consuming calculations may take place which are tolerant of delays. Examples are statistical analysis, sensor fusion and astronomical calculations.
  • Data logging.

3. The modified version

The fast_io version adds ioq_len=0 and lp_len=0 arguments to get_event_loop. These determine the lengths of I/O and low priority queues. The zero defaults cause the queues not to be instantiated, in which case the scheduler operates as per the official version. If an I/O queue length > 0 is provided, I/O performed by StreamReader and StreamWriter objects is prioritised over other coros. If a low priority queue length > 0 is specified, tasks have an option to yield in such a way to minimise their competition with other tasks.

Arguments to get_event_loop():

  1. runq_len=16 Length of normal queue. Default 16 tasks.
  2. waitq_len=16 Length of wait queue.
  3. ioq_len=0 Length of I/O queue. Default: no queue is created.
  4. lp_len=0 Length of low priority queue. Default: no queue.

3.1 Fast IO

Device drivers which are to be capable of running at high priority should be written to use stream I/O: see Writing streaming device drivers.

The fast_io version will schedule I/O whenever the ioctl reports a ready status. This implies that devices which become ready very soon after being serviced can hog execution. This is analogous to the case where an interrupt service routine is called at an excessive frequency.

This behaviour may be desired where short bursts of fast data are handled. Otherwise drivers of such hardware should be designed to avoid hogging, using techniques like buffering or timing.

3.2 Low Priority

The low priority solution is based on the notion of "after" implying a time delay which can be expected to be less precise than the asyncio standard calls. The fast_io version adds the following awaitable instances:

  • after(t) Low priority version of sleep(t).
  • after_ms(t) Low priority version of sleep_ms(t).

It adds the following event loop methods:

  • loop.call_after(t, callback, *args)
  • loop.call_after_ms(t, callback, *args)
  • loop.max_overdue_ms(t=None) This sets the maximum time a low priority task will wait before being scheduled. A value of 0 corresponds to no limit. The default arg None leaves the period unchanged. Always returns the period value. If there is no limit and a competing task runs a loop with a zero delay yield, the low priority yield will be postponed indefinitely.

See Low priority callbacks

3.3 Other Features

Variable:

  • version Contains 'fast_io'. Enables the presence of this version to be determined at runtime.

Function:

  • got_event_loop() No arg. Returns a bool: True if the event loop has been instantiated. Enables code using the event loop to raise an exception if the event loop was not instantiated:
class Foo():
    def __init__(self):
        if asyncio.got_event_loop():
            loop = asyncio.get_event_loop()
            loop.create_task(self._run())
        else:
            raise OSError('Foo class requires an event loop instance')

This avoids subtle errors:

import uasyncio as asyncio
bar = Bar()  # Constructor calls get_event_loop()
# and renders these args inoperative
loop = asyncio.get_event_loop(runq_len=40, waitq_len=40)

3.4 Low priority yield

Consider this code fragment:

import uasyncio as asyncio
loop = asyncio.get_event_loop(lp_len=16)

async def foo():
    while True:
        # Do something
        await asyncio.after(1.5)  # Wait a minimum of 1.5s
        # code
        await asyncio.after_ms(20)  # Wait a minimum of 20ms

These await statements cause the coro to suspend execution for the minimum time specified. Low priority coros run in a mutually "fair" round-robin fashion. By default the coro will only be rescheduled when all "normal" coros are waiting on a nonzero time delay. A "normal" coro is one that has yielded by any other means.

This behaviour can be overridden to limit the degree to which they can become overdue. For the reasoning behind this consider this code:

import uasyncio as asyncio
loop = asyncio.get_event_loop(lp_len=16)

async def foo():
    while True:
        # Do something
        await asyncio.after(0)

By default a coro yielding in this way will be re-scheduled only when there are no "normal" coros ready for execution i.e. when all are waiting on a nonzero delay. The implication of having this degree of control is that if a coro issues:

while True:
    await asyncio.sleep(0)
    # Do something which does not yield to the scheduler

low priority tasks will never be executed. Normal coros must sometimes wait on a non-zero delay to enable the low priority ones to be scheduled. This is analogous to running an infinite loop without yielding.

This behaviour can be modified by issuing:

loop = asyncio.get_event_loop(lp_len = 16)
loop.max_overdue_ms(1000)

In this instance a task which has yielded in a low priority manner will be rescheduled in the presence of pending "normal" tasks if they cause a low priority task to become overdue by more than 1s.

3.4.1 Task Cancellation and Timeouts

Tasks which yield in a low priority manner may be subject to timeouts or be cancelled in the same way as normal tasks. See Task cancellation and Coroutines with timeouts.

3.5 Low priority callbacks

The following EventLoop methods enable callback functions to be scheduled to run when all normal coros are waiting on a delay or when max_overdue_ms has elapsed:

call_after(delay, callback, *args) Schedule a callback with low priority. Positional args:

  1. delay Minimum delay in seconds. May be a float or integer.
  2. callback The callback to run.
  3. *args Optional comma-separated positional args for the callback.

The delay specifies a minimum period before the callback will run and may have a value of 0. The period may be extended depending on other high and low priority tasks which are pending execution.

A simple demo of this is benchmarks/call_lp.py. Documentation is in the code.

call_after_ms(delay, callback, *args) Call with low priority. Positional args:

  1. delay Integer. Minimum delay in millisecs before callback runs.
  2. callback The callback to run.
  3. *args Optional positional args for the callback.

4. ESP Platforms

It should be noted that the response of the ESP8266 to hardware interrupts is remarkably slow. This also appears to apply to ESP32 platforms. Consider whether a response in the high hundreds of μs meets project requirements; also whether a priority mechanism is needed on hardware with such poor realtime performance.

5. Background

This has been discussed in detail in issue 2989.

A further discussion on the subject of using the ioread mechanism to achieve fast scheduling took place in issue 2664.

Support was finally added here.

6. Performance

This version is designed to enable existing applications to run without change to code and to minimise the effect on raw scheduler performance in the case where the added functionality is unused.

The benchmark rate.py measures the rate at which tasks can be scheduled. It was run (on a Pyboard V1.1) under official uasyncio V2, then under this version. The benchmark rate_fastio is identical except it instantiates an I/O queue and a low priority queue. Results were as follows.

Script Uasyncio version Period (100 coros) Overhead
rate Official V2 156μs 0%
rate fast_io 162μs 3.4%
rate_fastio fast_io 206μs 32%

If an I/O queue is instantiated I/O is polled on every scheduler iteration (that is its purpose). Consequently there is a significant overhead. In practice the overhead will increase with the number of I/O devices being polled and will be determined by the efficiency of their ioctl methods.