Working on the PROTOCOL documentation

gabrieledarrigo · Aug 31, 2020 · 31b0520 · 31b0520
1 parent c56051a
commit 31b0520
Showing 1 changed file with 94 additions and 27 deletions.
diff --git a/README.md b/README.md
@@ -1,68 +1,123 @@
 # Ducky
-A quack quacky UDP cache server 🦆
+A quack quacky UDP cache server 🦆 developed for the Networking course of ["Sicurezza dei Sistemi e Delle Reti Informatiche"](http://sicurezzaonline.di.unimi.it/) bachelor's degree program.
 
 ![C/C++ CI](https://github.com/gabrieledarrigo/ducky/workflows/C/C++%20CI/badge.svg?branch=master)
 
-### Protocol
+## Rationale
 
-Clients of ducky communicate with the server through **UDP**.  
-Ducky listens on port **20017** for incoming messages; since ducky doesn't support TCP connection, clients simply open a UDP socket 
+Ducky is a network memory cache server.  
+It stores unstructured data sent from a client in memory, ready to be served when a client asks for it using a unique key.  
+Ducky's principal purpose is to reduce networking and computation load from a server (for example, an API, a Web application, or a database) so that a client can store data that is accessed with a high frequency, increasing the overall performance of the system.  
+Ducky is loosely inspired by projects like [Memcached](https://github.com/memcached/memcached) and [Redis](https://github.com/redis/redis).  
+While its approach to the "key-value" server implementation is naive, it worked well as a demonstrative and learning project on how to build a simple networking server.
+
+## Protocol
+
+Clients of Ducky communicate with the server through `UDP`.    
+Ducky listens on port `20017` for incoming messages; since Ducky doesn't support TCP connection, clients simply open a UDP socket 
 with the given port and send the commands within a datagram.  
 Ducky focuses on velocity and bandwidth saving, reducing the latency overhead of a classic TCP connection.  
 
-Data sent in ducky protocol (both requests and respondes) is in ASCII.  
+Data sent in Ducky protocol (both requests and responses) is in `ASCII`.  
 Each message corresponds to a command that a client sends to the server or a response from the server;  
-a message is made up of the name of the command, optional command parameters, and the structured data that clients want to store or retrieve from ducky.  
-A message is always terminated by a `\n` characters that determines where the data blocks end.  
+a message is made up of the name of the command, optional command parameters, and the structured data that clients want to store or retrieve from Ducky.  
+A message is always terminated by a `\n` characters that determine where the data blocks end.  
 Each response from the server has a status code that specifies the result of the command:  
-2xx for a successful operation, 5xx for an errored operation.  
+`2xx` for a successful operation, `5xx` for an errored one.  
+Ducky supports only two commands, `GET` and `SET`.  
 One client/server session has the following lifecycle:  
 
-1. The server listens on port 20017.
-2. A client opens a connection and sends a UDP datagram to the server.
+1. The server listens on port `20017`.
+2. A client opens a connection and sends a `UDP` datagram to the server.
 3. When the server receives the datagram tries to parse it into a known command.
 4. If the command is not recognized the server responds with a proper status code/error message.
-5. If the command is recognized as a GET the server tries to return the request data to the client.
-    - If the operation is successful the server returns the data along with success status code. 
-    - Otherwise, it responds with a proper status code/error message.
-6. If the command is recognized as a SET the server tries to persist the data into the memory.
-    - If the GET operation is successful the server returns a response with a success status code,
+5. If the command is recognized as a `GET` the server tries to return the request data to the client.
+    - If the operation is successful the server returns the data along with the success status code. 
+    - Otherwise, it  responds with a proper status code/error message.
+6. If the command is recognized as a `SET` the server tries to persist the data into the memory.
+    - If the `SET` operation is successful the server returns a response with a success status code,
     - Otherwise, it responds with a proper status code/error message.
+
+![Diagram](https://user-images.githubusercontent.com/1985555/91744013-14e12d00-ebb9-11ea-8300-7eb5b7331949.jpg)
+
+## Network I/O
+
+Designers of networking software can choose various strategies on how to handle client connections and input/output from/towards them.  
+Notably, these strategies are:  
 
+- Fork multiple _child processes_, one per client, to achieve concurrency, and serve multiple requests.
+- Spawn multiple _threads_, one per client, to achieve concurrency, and serve multiple requests.
+- Use asynchronous, non-blocking I/O, using [kqueue(2)](https://man.openbsd.org/kqueue), or [epoll(4)](https://man7.org/linux/man-pages/man7/epoll.7.html).
+- Use synchronous I/O multiplexing using [select(2)](https://man7.org/linux/man-pages/man2/select.2.html) vs [poll(2)](https://man7.org/linux/man-pages/man2/poll.2.html).
 
-#### Keys and memory structure
+Even well known, battle-tested and production-ready servers use different approaches.  
+For example, [Apache HTTP Server](https://httpd.apache.org/) can handle concurrency both with forking child processes (via [prefork](https://httpd.apache.org/docs/2.4/mod/prefork.html) module) or by spawning multiple system's threads (via [worker](https://httpd.apache.org/docs/2.4/mod/worker.html) module), one per each incoming connection.  
+Other servers or backends use the third approach, basing their concurrency model on asynchronous, non-blocking I/O, using an [event loop](https://en.wikipedia.org/wiki/Event_loop) to handle requests from clients.   
+Usually, an event loop is implemented using a library or a framework that abstracts away the underlying kernel system call (kqueue(2), epoll(4), event completions) and offers high-level APIs to handle events on file descriptors.  
+For example, Memcached uses [libevent](https://libevent.org/), an event notification library, to implement its event loop, 
+while Node.Js, the famous JavaScript runtime built on Chrome's V8 JavaScript engine uses [libuv](https://github.com/libuv/libuv).  
 
-Data stored in ducky is identified with the help of a key.  
-A key is a string which uniquely identifies the data for clients that wants to store and retrieve it.  
-Internally ducky uses a Hash Table data structure. It offers a O(1) algorithmic complexity for both store and retrieve data within a given key.
-The hash table implementation...@TBD
+While these solutions are far more efficient, well tested, and probably more elegant I decided to not use any framework; Ducky uses the traditional select(2) system call to be notified when a file descriptor (a client connection) is ready for reading.  
+select(2) performances are poor in comparison to the other strategies we just illustrated; it works linearly, so the more file descriptors select(2) is required to handle, the slower the system gets.  
+Depending on the hardware specification Ducky (and so an application that uses select(2)) can reach few hundreds of open file descriptors before the mere waiting for file descriptor activity becomes a bottleneck.  
+Nevertheless, select(2) has great portability (it's implemented almost everywhere) and served well in designing a simple memory cache server like Ducky.
 
+## Keys and memory structure
+
+Data stored in Ducky is identified with the help of a key.  
+A key is a string that uniquely identifies the data for clients that want to store and retrieve it.  
 The maximum length limit of a key is **100** characters.  
-The maximum size for data to be stored is **1MB**.   
+The maximum size for a single item to be stored is **1MB**.   
+
+Internally Ducky uses a [hash table](http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap12.htm) data structure.  
+It offers an O(1) algorithmic complexity for both storing and retrieving data within a given key.  
+There are several ways to implement a hash table data structure, especially on how to handle element collision.  
+Ducky uses an _open-addressed_, _double-hashed_ hash table:  
+instead of using a linked list for each bucket of the hash table, an _open-addressed_ implementation stores the element in the hash table itself.  
+That is, each table bucket contains either an element of the dynamic set or `NULL`;  
+when searching for an item with a given key, the hash table is systematically examined, until the desired item is found in one of its buckets or it is clear that it is not in the table. 
+There are no lists or elements stored outside the table, as there are in chaining. 
+The index that points the position in which the element needs to be inserted is computed by a double hash function with the following form:
 
-#### Commands
+```
+h(k, i) = (h1(k) + ih2(k)) mod m
+```
 
-Ducky supports only two commands:  
+where _h1_ and _h2_ are the computation of a hash function that:
+
+- Takes the string _k_ as an input.
+- Converts it into a large integer number.
+- Reduces the size of the integer to a fixed range, by taking its remainder mod m, where m is the number of buckets of the hash table.
+- Returns the reduced integer.
+
+When a collision happens, the collided item is placed in some other bucket in the hash table, depending on the result of the double hash function.  
+You should note how the index of the new position depends on the number of _i_ collisions.  
+While an open-addressed hash table can fill up its space Ducky implementation can resize itself when the load of the data structure is above 70%.
+
+## Commands
+
+As we said Ducky supports only two commands:  
 SET to store some unstructured data identified by a key and GET to retrieve some data corresponding
 to a specific key.  
 The semantics is the following:  
 
-##### SET
+#### SET
 
 ```
 SET key data
 ```
 
-##### GET
+#### GET
 
 ```
 GET key
 ```
 
-#### Response and status codes
+## Response and status codes
 
-Each ducky response is made up of a status code and an optional payload.  
+Each Ducky response is made up of a status code and an optional payload; its parsing is up to the client.  
 Status codes are divided into two families:
+
 - The 2xx status codes indicate that the request has succeeded.
 - The 5xx status codes indicate that the server encountered an error.  
 
@@ -128,3 +183,15 @@ The SET commands haven't attached data.
 ```
 506 ERR_NO_DATA
 ```
+
+## References
+
+Developing Ducky was both fun and formative, because it was the cue to learn better how TCP and UDP networking works at the system level, and to gain a better understaing on subjects like concurrency models, blocking and non blocking I/O, data structures.  
+It was impossible to implement Ducky without some great books, article and materials that I used to dig deeper into the subject:  
+
+- I must start with the great [C10K problem article](http://www.kegel.com/c10k.html) by Dan Kegel, full of concepts and references. It's a must read
+- The great [Poll vs Select article](https://daniel.haxx.se/docs/poll-vs-select.html) by Daniel Stenberg
+- The obvious [Cormen reference to the hash table](http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap12.htm) data structure
+- The super cool [C Hash Table implementation](https://github.com/jamesroutley/write-a-hash-table) by James 明良 Routley, that I used as an example to implements the internal memory cache of Ducky
+- A useful book on network programming [Hands-On Network Programming with C](https://www.packtpub.com/networking-and-servers/hands-network-programming-c)
+- The essential [Beej's Guide to Network Programming](http://beej.us/guide/bgnet/), really worth a read!