adding notes

iyzhang · iyzhang · commit 7f8103a8bf08 · 2017-11-17T09:46:20.000-08:00
diff --git a/notes.org b/notes.org
@@ -0,0 +1,110 @@
+* Two questions
+  1. Designing a general-purpose interface between datacenters
+     apps and programmable hardware
+  2. Using hardware features to better schedule low-latency datacenter
+     applications
+
+* Intro
+*** Datacenter servers have increasing amounts of programmable hardware and hardware-acceleration
+    - e.g., I/O virtualization, IOMMU, programmable NICs/flash
+      devices, FPGAs
+    - as a result, the hardware keeps changing and the
+      hardware/software interface keeps changing
+    - what does this mean for apps?
+*** recent work has one off solutions
+    - Arrakis/IX for virtual I/O devices
+    - FlexNIC, FlexTCP for programmable NICs
+    - .. (find more related work
+*** but no general solution for app programmers
+    - if an app programmer wants to make their app future proof, what
+      do they do?
+*** Conclusion: We need new (lib)OS abstractions for datacenter applications
+    - so hardware can change underneath apps, the line between
+      hardware and software can change
+
+* Background: Why won't POSIX work?
+    - POSIX wanted to abstract differences between devices (everything
+      is a file)
+    - not only has the hardware changed since we designed POSIX, the
+      applications are very different too
+***** Conclusion: have new goals
+*** datacenter apps need to move a lot of data, not perform computation (even ML apps are limited by moving data between phases) 
+    - move memoru to network (memcached, http servers)
+    - move ssd to network (file servers)
+***** Some apps don't even look at all of the data
+***** many things can be passed off to hardware
+      - move to hardware hashing for checksumming
+      - ?
+***** Conclusion
+      - need to make it as cheap as possible to move data
+      - starting to look like a middlebox?
+*** hardware is faster at moving data than the processor
+    - I/O devices can now move data faster than processor, so we need
+      to have a zero-copy interface
+    - POSIX is fundamentally not: it's built around copying into app
+      memory and then out to device
+    - we need an interface that is able to hand pre-allocated buffers
+      to app and have to app hand off buffer to another device,
+      potentially without looking at some or any of the buffer
+
+* Zero-copy event queues
+    - replaces socket, file descriptor abstractions
+    - open(), accept() returns an event queue (id)
+    - has concept of granularity (not just a stream of data)
+    - moves data with pointers, not by streaming into a buffer
+    - use COW every time the pointer is transferred to another address
+      space to avoid complex pointer hand-offs
+    - use user-level page tables and a directed TLB shoot-down to
+      reduce cost of setting COW
+***** Interface
+    - qid = open(file)
+    - qid = listen()
+    - qid = accept(qid)
+    - insert(qid, scatter gather array)
+    - *sga = head(qid)
+    - *sga = dequeue(qid)
+    - filter(qid, *filter_func)
+    - merge(qid, qid)
+    - sort(qid, *sort_func)
+* Benefits
+*** No copying latency (at least 2K cycles for a 4K page)
+*** Less cache pollution
+    - Only data that the app needs has to be brought into cache
+*** can be implemented in hardware or software or both
+    - even advanced filtering, merging and sorting can be implemented
+      in hardware easily
+
+* Datacenters have increasingly demanding workloads (low latency, low tail latency) 
+  - driving much of the programmable and hardware acceleration
+  - how can we use this hardware for these workloads
+*** Current solution: Datacenters do not make good use of cores for these apps
+    - context switches are expensive and increase tail latency, so they pin apps to cores
+    - interrupts are expensive and increase tail latency, so they poll
+    - both are terrible for CPU utilization
+*** Key Observation: datacenter apps are event-based programs, not long running serial programs
+    - interrupt scheduling is ineffective for datacenter workloads
+      when they have natural yield points
+    - polling helps but takes too much time to switch back, so only
+      works for low latency workloads if they are pinned
+
+* Cooperative Event Scheduling
+*** Idea
+    - yield between every event to check for higher priority tasks
+    - process with low latency and go back to lower priority processes
+*** Design requirements
+    - scheduling decisions must be fast
+    - context switches must be cheap
+*** Possible implementations
+    - move scheduling into hardware based on queues (IOCPU instead of
+      IOMMU?)
+    - tagged TLBs and partitioned caches kept warm for low latency
+      apps
+    - yielding between events means that old cached data might not be
+      useful for next event anyway (experiment: flush cache between
+      every libevent/memcached handler and check performance)
+
+* Summary
+  - we can't keep changing the hardware without some abstractions to
+    buffer apps from those changes
+  - we can't effectively schedule low latency apps without co-design
+    between the app, the OS and the hardware