wdm0006
diff --git a/‎.idea/.name‎
Lines changed: 1 addition & 0 deletions b/‎.idea/.name‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.idea/flink-python-examples.iml‎
Lines changed: 8 additions & 0 deletions b/‎.idea/flink-python-examples.iml‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎.idea/misc.xml‎
Lines changed: 14 additions & 0 deletions b/‎.idea/misc.xml‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎.idea/modules.xml‎
Lines changed: 8 additions & 0 deletions b/‎.idea/modules.xml‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎.idea/vcs.xml‎
Lines changed: 6 additions & 0 deletions b/‎.idea/vcs.xml‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎.idea/workspace.xml‎
Lines changed: 517 additions & 0 deletions b/‎.idea/workspace.xml‎
Lines changed: 517 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 142 additions & 1 deletion b/‎README.md‎
Lines changed: 142 additions & 1 deletion
diff --git a/‎data_enrichment/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎data_enrichment/__init__.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎data_enrichment/data_enrichment.py‎
Lines changed: 45 additions & 0 deletions b/‎data_enrichment/data_enrichment.py‎
Lines changed: 45 additions & 0 deletions
diff --git a/‎data_enrichment/dimensional_data.csv‎
Lines changed: 3 additions & 0 deletions b/‎data_enrichment/dimensional_data.csv‎
Lines changed: 3 additions & 0 deletions
@@ -1,4 +1,145 @@
 Python Flink Examples
 =====================
 
-A collection of examples using Apache Flink's new python API.
+A collection of examples using Apache Flink's new python API.  To set up your local environment with 
+the latest Flink build, see the guide (HERE)[http://willmcginnis.com/2015/11/08/getting-started-with-python-and-apache-flink/].
+
+The examples here use the v1.0 python API (they won't work with the current stable release pre-1.0), and
+are meant to serve as demonstrations of simple use cases.  Currently the python API supports a portion of the DataSet
+API, which has a similar functionality to Spark, from the user's perspective.
+
+To run the examples, I've included a runner script at the top level with methods for each example, simply
+add in the path to your pyflink script and you should be good to go (as long as you have a flask cluster running locally).
+
+The currently included examples are:
+
+Examples
+========
+
+A listing of the examples included here.
+
+Word Count
+----------
+
+An extremely simple analysis program uses a source from a simple string, counts the occurrences of each word
+and outputs to a file on disk (using the overwrite functionality).
+
+
+Trending Hashtags
+-----------------
+
+A very similar example to word count, but includes a filter step to only include hashtags, and different source/sinks.
+The input data in this case is read off of disk, and the output is written as a csv. The file is generated dynamically 
+at run time, so you can play with different volumes of tweets to get an idea of Flink's scalability and performance.
+
+
+Data Enrichment
+---------------
+
+In this example, we have row-wise json in one file, with an attribute field that refers to a csv dimension table with
+colors.  So we load both datasets in, convert the json data into a ordered and typed tuple, and join then two together
+to get a nice dataset of cars and their colors.
+
+
+Features
+========
+
+A quick listing of high level features, and the examples that include them
+
+Text data-source (read\_text)
+----------------------------
+
+ * trending hashtags
+ * data enrichment
+
+CSV data-srouce (read\_csv)
+---------------------------
+    
+ * data enrichment
+
+String data-source (from\_elements)
+-----------------------------------
+
+ * word count
+
+Text output (write\_text)
+-------------------------
+
+ * word count
+ * data enrichment
+
+CSV output (write\_csv)
+-----------------------
+
+ * trending hashtags
+ 
+Log to stdout output (output)
+-----------------------------
+
+
+Transformations: Map
+--------------------
+
+ * word count
+ * trending hashtags
+ * data enrichment
+ 
+Transformations: FlatMap
+------------------------
+
+ * word count
+ * trending hashtags
+
+Transformations: MapPartition
+-----------------------------
+
+
+Transformations: Filter
+-----------------------
+
+ * trending hashtags
+
+Transformations: Reduce
+-----------------------
+
+
+Transformations: ReduceGroup
+----------------------------
+
+ * word count
+ * trending hashtags
+
+Transformations: Join
+--------------------
+
+ * data enrichment
+ 
+Transformations: CoGroup
+------------------------
+
+
+Transformations: Cross
+----------------------
+
+
+Transformations: Union
+----------------------
+
+
+Gotchas We've Found
+===================
+
+As we go through the process of making these examples in an extremely young library, we run across quirks, that we will
+mention here, and if appropriate report as bugs (we will take these down once they are fixed if they are bugs).
+
+Using os.path to set file paths dynamically
+-------------------------------------------
+
+There is a tendency to want to write code without hard-coded paths.  So we may include the path to the output file
+in the word count example as:
+
+    import os
+    output_path = os.path.dirname(os.path.abspath(__file__)) + os.sep + 'output_file.txt'
+    
+But this doesn't seem to work, because some part of how pyflink is executing the python code moves it, so the abspath
+term evaluates to some temp directory. 
@@ -0,0 +1 @@
+__author__ = 'willmcginnis'
@@ -0,0 +1,45 @@
+import os
+import json
+import sys
+from flink.plan.Environment import get_environment
+from flink.plan.Constants import INT, STRING, WriteMode
+from flink.functions.GroupReduceFunction import GroupReduceFunction
+
+__author__ = 'willmcginnis'
+
+
+class Adder(GroupReduceFunction):
+    def reduce(self, iterator, collector):
+        count, word = iterator.next()
+        count += sum([x[0] for x in iterator])
+        collector.collect((count, word))
+
+
+def json_to_tuple(js, fields):
+    return tuple([str(js.get(f, '')) for f in fields])
+
+if __name__ == "__main__":
+    # get the base path out of the runtime params
+    base_path = sys.argv[1]
+
+    # setup paths to input and output files on disk
+    dim_file = 'file://' + base_path + '/data_enrichment/dimensional_data.csv'
+    input_file = 'file://' + base_path + '/data_enrichment/input_data.csv'
+    output_file = 'file://' + base_path + '/data_enrichment/out.txt'
+
+    # remove the output file, if there is one there already
+    if os.path.isfile(output_file):
+        os.remove(output_file)
+
+    # set up the environment with a text file source
+    env = get_environment()
+    input_data = env.read_text(input_file)
+    dimensional_data = env.read_csv(dim_file, types=[STRING, STRING])
+
+    input_data \
+        .map(lambda x: json_to_tuple(json.loads(x), ['car', 'attr']), (STRING, STRING)) \
+        .join(dimensional_data).where(1).equal_to(0) \
+        .map(lambda x: 'This %s is %s' % (x[0][0], x[1][1]), STRING) \
+        .write_text(output_file, write_mode=WriteMode.OVERWRITE)
+
+    env.execute(local=True)
@@ -0,0 +1,3 @@
+1,red
+2,green
+3,blue