finished pages, needs proofreading

reedery · Jan 16, 2017 · 195f94c · 195f94c
1 parent cccaff3
commit 195f94c
Show file tree

Hide file tree

Showing 17 changed files with 79 additions and 226 deletions.
diff --git a/images/irishvr/cam.jpg b/images/irishvr/cam.jpg
diff --git a/images/irishvr/oculus.jpg b/images/irishvr/oculus.jpg
diff --git a/images/moc/bts.jpg b/images/moc/bts.jpg
diff --git a/images/moc/squad.jpg b/images/moc/squad.jpg
diff --git a/images/streaming/data.jpg b/images/streaming/data.jpg
diff --git a/images/streaming/parad.jpg b/images/streaming/parad.jpg
diff --git a/images/streaming/stream.jpg b/images/streaming/stream.jpg
diff --git a/images/yelp/bay.jpg b/images/yelp/bay.jpg
diff --git a/images/yelp/bay2.jpg b/images/yelp/bay2.jpg
diff --git a/projects/data/images.html b/projects/data/images.html
@@ -127,7 +127,7 @@ <h1>Finding Similar Images with LS-Hash</h1>
 
 
 						  <h2>Overview</h2>
-						  <p>For our semester-long project in Algorithms (CSCI 3383)  with <a href="http://www.jbento.net/">Prof. Bento Ayres Periera</a>, we were given the following description: </p>
+						  <p>For our semester-long project in Algorithms (CSCI 3383)  with <a href="http://www.jbento.net/">Prof. Bento Ayres Periera</a>, we were given the following task: </p>
 						  <ul>
 						  <li>You have to develop an algorithm that given a query image finds the “closest” entries to it on a dataset of images</li>
 						  <li>

diff --git a/projects/data/streaming.html b/projects/data/streaming.html
@@ -124,9 +124,25 @@
 						<div class="inner">
 						  <h1>Streaming Data</h1>
 						  <span class="image main"></span>
-						  <p>Donec eget ex magna. Interdum et malesuada fames ac ante ipsum primis in faucibus. Pellentesque venenatis dolor imperdiet dolor mattis sagittis. Praesent rutrum sem diam, vitae egestas enim auctor sit amet. Pellentesque leo mauris, consectetur id ipsum sit amet, fergiat. Pellentesque in mi eu massa lacinia malesuada et a elit. Donec urna ex, lacinia in purus ac, pretium pulvinar mauris. Curabitur sapien risus, commodo eget turpis at, elementum convallis elit. Pellentesque enim turpis, hendrerit tristique.</p>
-							<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis dapibus rutrum facilisis. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Etiam tristique libero eu nibh porttitor fermentum. Nullam venenatis erat id vehicula viverra. Nunc ultrices eros ut ultricies condimentum. Mauris risus lacus, blandit sit amet venenatis non, bibendum vitae dolor. Nunc lorem mauris, fringilla in aliquam at, euismod in lectus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. In non lorem sit amet elit placerat maximus. Pellentesque aliquam maximus risus, vel venenatis mauris vehicula hendrerit.</p>
-							<p>Interdum et malesuada fames ac ante ipsum primis in faucibus. Pellentesque venenatis dolor imperdiet dolor mattis sagittis. Praesent rutrum sem diam, vitae egestas enim auctor sit amet. Pellentesque leo mauris, consectetur id ipsum sit amet, fersapien risus, commodo eget turpis at, elementum convallis elit. Pellentesque enim turpis, hendrerit tristique lorem ipsum dolor.</p>
+
+						<h2>Overview</h2>
+						<p>For Big Data Research Day 2016 at Boston College, I built a demonstration to breakdown how smartphone data is sorganized, streamed and processed in realtime for apllications such as mobile VR. Here is the resulting paper from this research. By Ryan Reede and <a href="github.com/cam9">Cam</a> Lunt <a href="https://github.com/reedery/TangoStreamAndroid">on Github here. </a></p>
+						<p>Smartphones and tablets have incredible hardware built into their systems, but more often than not, the software built for the devices doesn’t tap into it. This project seeks to explore the methods in which massive amount of data points from a mobile sensor can be reliably and quickly reported to a remote server. Thus, with our software we set out to help the layman understand the practical applications of accessing smartphone data streams by reliably transmitting it from a mobile device to a processing service. From mobile Virtual and Augmented Reality, UI/UX research, gaming, navigation and more the applications for transmitting information from mobile devices to servers are endless. We wanted to be able to process and visualize the data coming off the device wirelessly. From the get-go, a number of problems became very clear to us.</p>
+						<hr /> <h2>Understanding the Data</h2>
+						<p>Of the four V’s that describe the difficulty in managing and working with ‘Big Data’ (volume, variety, veracity and velocity), velocity was the biggest hurdle for us to overcome. The Android tablet we were working with was designed for applications such as mapping interiors thanks to its extremely precise sensors and a 2.3Ghz, quad-core mobile processor. As a byproduct of such powerful hardware, our model was capable of producing a ton of data in very little time. In addition to the speed in which the data was being offloaded from the tablet, we needed to find a reliable way to transport the data stream object on the android device to a central server where it can be processed and/or visualized such that a client with limited knowledge of data streams, linear transformations and euler coordinates could understand what the data coming off the tablet represented. Creating a proper visualization of the 3D axis, and transformations became the final major issue for us to tackle given the mentioned properties of our data. </p>
+						<img src="../../images/streaming/data.jpg" width="100%" style="margin: 5px 0px" >
+						<br><br>
+						<p>Although the Project Tango device is capable of producing nearly 250,000 data points per second, we understood the complexity associated with processing a live data streams. Additionally, we didn’t want to clog our project with any extraneous data that wouldn’t be vital to our end goal of helping a non-technical person understand the practical applications of such sensor data. Our work focused on the Rotational data that came in in 4 parts (quaternion) from the tablet. More info on a quaternion (from <a href="http://www.cprogramming.com/tutorial/3d/quaternions.html">cprogramming</a>): </p>
+						<blockquote>A quaternion represents two things. It has an x, y, and z component, which represents the axis about which a rotation will occur. It also has a w component, which represents the amount of rotation which will occur about this axis. In short, a vector, and a float. With these four numbers, it is possible to build a matrix which will represent all the rotations perfectly, with no chance of gimbal lock.</blockquote>
+<hr />
+<h2>Programming Paradigm</h2>
+
+<img src="../../images/streaming/parad.jpg" width="100%" style="margin: 5px 0px" >
+						<br><br>
+<p>There were multiple layers of communication involved in reliably transmitting the sensor data from the tablet to the processing server. Due to the nature of Kafka, and the size of the Kafka library, mobile devices are not recommended to be used as Kafka producers. Kafka is intended to permeate and manage messages, but not necessarily be the the communication agent between devices. For that reason all of the sensor readings were sent through a socket from the mobile device to a host Java instance on the remote server. One the host Java instance received the sensor reading it writes the message to a Kafka topic. By writing to the Kafka topic as soon as the sensor data reaches the server we ensure that no messages get lost, and that we are able to create a robust stream.  The Kafka consumer is a Jetty server process running on the same remote server. The Jetty server consumes Kafka messages from the the sensor data topic and relays those messages to the frontend javascript instance through a Websocket. Kafka consumers cannot be implemented in frontend javascript, so this paradigm must be used. Once the javascript instance, running in an observer&rsquo;s browser receives the sensor data it uses it to update a 3D graphic on the screen.  Using the producer consumer paradigm allows us to have a very generalizable solution for multiple streams and the system could easily be extrapolated to work to receive multiple different types of data from the tablet. Using a queue based paradigm allows for elasticity in the consumption of messages; if the end of the line of communication slows down or pauses the system is safe due to Kafka&rsquo;s permeation of data. The durability of the system ensures no messages get lost and that the system is pause tolerant. Below is the Server class that sends Kafka KeyedMessages through the socket.</p>
+<script src="https://gist.github.com/reedery/47c42ef89370a36e4aeef37f50bd8ce2.js"></script>
+<hr /> <h2>Analysis and Conclusion</h2>
+<p>Although our project did not leverage machine learning algorithms or make predictions of any sort, it was still an extremely valuable exercise for a number of reasons. Primarily, we became much more proficient in working with data streams, streaming objects and how they should be dealt with when communicating between servers and layers. In many enterprise instances of Apache Spark, a live data stream is coming in; rarely a backlog of data in a clean .csv file. Likewise, we familiarized ourselves with Kafka, an industry standard tool for breaking data streams into individually analyzable and fault tolerant chunks. Our animation ended up inducing a noticeable amount of lag time, but this too taught us a valuable lesson: that Kafka is not optimized for realtime data visualization, and that the flow of data through a system should occur through as few layers as possible.  Had the entire development process gone a little smoother and quicker, it would have been nice to implement some sort of data analysis. If there were more time, the point in our system architecture where the data is sent to the browser would have been the most appropriate place to add an instance of Spark into the mix as the data queuing from Kafka was already set in place. Additionally, we could have moved the data that gets sent to the browser come directly from the tablet instead of the Kafka channel so it could have been much smoother.</p>
 						</div>
 					</div>
 

diff --git a/projects/data/yelp.html b/projects/data/yelp.html
@@ -124,9 +124,18 @@
 						<div class="inner">
 						  <h1>Yelp Academic Dataset</h1>
 						  <span class="image main"></span>
-						  <p>Donec eget ex magna. Interdum et malesuada fames ac ante ipsum primis in faucibus. Pellentesque venenatis dolor imperdiet dolor mattis sagittis. Praesent rutrum sem diam, vitae egestas enim auctor sit amet. Pellentesque leo mauris, consectetur id ipsum sit amet, fergiat. Pellentesque in mi eu massa lacinia malesuada et a elit. Donec urna ex, lacinia in purus ac, pretium pulvinar mauris. Curabitur sapien risus, commodo eget turpis at, elementum convallis elit. Pellentesque enim turpis, hendrerit tristique.</p>
-							<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis dapibus rutrum facilisis. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Etiam tristique libero eu nibh porttitor fermentum. Nullam venenatis erat id vehicula viverra. Nunc ultrices eros ut ultricies condimentum. Mauris risus lacus, blandit sit amet venenatis non, bibendum vitae dolor. Nunc lorem mauris, fringilla in aliquam at, euismod in lectus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. In non lorem sit amet elit placerat maximus. Pellentesque aliquam maximus risus, vel venenatis mauris vehicula hendrerit.</p>
-							<p>Interdum et malesuada fames ac ante ipsum primis in faucibus. Pellentesque venenatis dolor imperdiet dolor mattis sagittis. Praesent rutrum sem diam, vitae egestas enim auctor sit amet. Pellentesque leo mauris, consectetur id ipsum sit amet, fersapien risus, commodo eget turpis at, elementum convallis elit. Pellentesque enim turpis, hendrerit tristique lorem ipsum dolor.</p>
+						  <h2>Overview</h2>
+						  <p>For Business Intelligence and Analytics (Fall '14) with <a href="http://www.samransbotham.com/">Prof. Ransbotham,</a> our semester-long group project was tasked with analyzing and making data-based predictions on the Yelp Academic dataset. Below is the portion of the project I worked on that decided how we would implement our own rankings feature for restaurants and attractions. </p>
+						  <p><strong>(disclaimer: I knew nothing about Machine Learning at the time) </strong></p>
+						  <hr />
+						  <h2>Good or Popular?</h2>
+						  <p>We needed to determine how to factor both popularity of a restaurant along with its average review to find the [quantitatively speaking] best food in Austin. With 5 possible stars (in 0.5 steps increments) to judge an overall dining experience using a simple upvote formula such as those found on Instagram or Reddit would not work well. Although heavily generalized, the idea there is to take a post-view count and divide it by the number of upvotes to determine how good a post is, but this only works with a binary like or non-like voting system. We have 5 (or 10 really) stars to use to determine how ‘good’ a restaurant is so a search on the web to find some sort of weighted rank to factor this in ended up turning in some great answers. </p>
+						  <img src="../../images/yelp/bay.jpg" width="100%" style="margin: 3px 3px" ><br><br>
+						  <p>On Math Stack Exchange an answer showed how to use the Bayesian Approach to determine this sort of weighted rank. We applied this formula to our Yelp data as it was for testing, but then applied some tweaks to the weights to make the results more reasonable for our data. For instance, in one test we did, the popularity of the restaurant did not get enough recognition as 5.00 star average restaurants with under 5 reviews were ranked higher than 4.5 star average restaurants with over 50 reviews. Even after some heavy modification, we struggled to find a system that generalized well for all of our data. It was still clear to us however, that using this rank-based aggregate score based on multiple factors was the right way to approach this. Using R, we computed this data and adjuseted the dataframe to store these new values. Once we had this value computed, we looked for ways to visualize this all in a meaningful way.</p>
+						  <hr />
+						  <h2>Data Visualization</h2>
+						  <p>Since we were working with data that so heavily dependent on location, finding a way to present our data with mapping in mind was key. We needed a simple solution to be able to let the valuable lat./long. data show how location can affect rankings. Tableau wound up being the tool we used to get around this. Tableau was incredibly intuitive, and managed to recognize our input .csv file as geographic and defaulted to a map view. To make this visualiation even more telling, we adjusted the parameters of the map to get both the scale of the points and the color temperature of the points to reflect our data properly. The scale ended up showing how many people had reviewed the location and the darker the red of the point was a higher score on the Bayesian weighted rank scale.</p>
+						  <img src="../../images/yelp/bay2.jpg" width="100%" style="margin: 3px 3px" ><br><br>
 						</div>
 					</div>