update docs

Jackiebibili · Jackiebibili · commit fdd643d02cfb · 2022-11-30T00:42:57.000-06:00
diff --git a/.env b/.env
@@ -1,4 +1,4 @@
-es_host=b963fb901829469e968500b83b71ab81.us-central1.gcp.cloud.es.io
+es_host=391f53d8501748afaf1fbfc591ed43e1.us-central1.gcp.cloud.es.io
 es_username=elastic
 es_port=9243
-es_pw=ytuLqABRF5zE7gbwEqL1aR7l
+es_pw=VjLfWuuVTOnUZbFlI6a9eriq
diff --git a/prod.sh b/prod.sh
@@ -1,10 +1,8 @@
-echo "Starting to deploy docker image..."
-
-AWS_REGION=us-east-2
 DOCKER_CONTAINER_NAME=chatbox-nlp-api-gunicorn-container
 REPOSITORY_URI=public.ecr.aws/q0s5b2t6/chatbox-nlp-api
 DEPLOY_DOCKER_COMPOSE_FILE=/home/ec2-user/server/docker-compose.yml
 
+echo "Starting to deploy docker image..."
 echo "Stopping previous containers..."
 docker ps -q --filter "name=$DOCKER_CONTAINER_NAME" | grep -q . && docker stop $DOCKER_CONTAINER_NAME && docker rm -fv $DOCKER_CONTAINER_NAME
 if [[ "$(docker images -q $REPOSITORY_URI:latest 2> /dev/null)" != "" ]]; then
diff --git a/run.sh b/run.sh
@@ -4,7 +4,4 @@ python -m src.dataset.start
 zip wiki_algos_text.zip src/dataset/general_kb/data/* -j
 
 # connect and write to the document store
-python -m src.haystack.elasticsearch.db > out.txt
-
-# refresh the dependency list
-pipenv lock -r > requirements.txt
+python -m src.haystack.elasticsearch.db
diff --git a/src/dataset/general_kb/data/B-tree.txt b/src/dataset/general_kb/data/B-tree.txt
@@ -13,7 +13,7 @@ According to Knuth's definition, a B-tree of order m is a tree which satisfies t
   1. Every node has at most m children.
   2. Every internal node has at least ⌈m/2⌉ children.
   3. Every non-leaf node has at least two children.
-  4. All leaves appear on the same level and carry no information.
+  4. All leaves appear on the same level.
   5. A non-leaf node with k children contains k−1 keys.
 
 Each internal node's keys act as separation values which divide its subtrees. For example, if an internal node has 3 child nodes (or subtrees) then it must have 2 keys: a1 and a2. All values in the leftmost subtree will be less than a1, all values in the middle subtree will be between a1 and a2, and all values in the rightmost subtree will be greater than a2. 
diff --git a/src/dataset/general_kb/data/Depth-first__search.txt b/src/dataset/general_kb/data/Depth-first__search.txt
@@ -88,6 +88,7 @@ Another possible implementation of iterative depth-first search uses a stack of
     
     procedure DFS_iterative(G, v) is
         let S be a stack
+        label v as discovered
         S.push(iterator of G.adjacentEdges(v))
         while S is not empty do
             if S.peek().hasNext() then
diff --git a/src/dataset/general_kb/data/Heapsort.txt b/src/dataset/general_kb/data/Heapsort.txt
@@ -1,6 +1,6 @@
 In computer science, heapsort is a comparison-based sorting algorithm. Heapsort can be thought of as an improved selection sort: like selection sort, heapsort divides its input into a sorted and an unsorted region, and it iteratively shrinks the unsorted region by extracting the largest element from it and inserting it into the sorted region. Unlike selection sort, heapsort does not waste time with a linear-time scan of the unsorted region; rather, heap sort maintains the unsorted region in a heap data structure to more quickly find the largest element in each step.
 
-Although somewhat slower in practice on most machines than a well-implemented quicksort, it has the advantage of a more favorable worst-case O(n log n) runtime. Heapsort is an in-place algorithm, but it is not a stable sort. 
+Although somewhat slower in practice on most machines than a well-implemented quicksort, it has the advantage of a more favorable worst-case O(n log n) runtime (and as such is used by Introsort as a fallback should it detect that quicksort is becoming degenerate). Heapsort is an in-place algorithm, but it is not a stable sort. 
 
 Heapsort was invented by J. W. J. Williams in 1964. This was also the birth of the heap, presented already by Williams as a useful data structure in its own right. In the same year, Robert W. Floyd published an improved version that could sort an array in-place, continuing his earlier research into the treesort algorithm.
 
diff --git a/src/dataset/general_kb/data/Longest__common__subsequence__problem.txt b/src/dataset/general_kb/data/Longest__common__subsequence__problem.txt
@@ -242,6 +242,10 @@ The third drawback is that of collisions. Since the checksum or hash is not guar
 
 If only the length of the LCS is required, the matrix can be reduced to a <math-expression>2\times \min(n,m)</math-expression> matrix with ease, or to a <math-expression>\min(m,n)+1</math-expression> vector (smarter) as the dynamic programming approach only needs the current and previous columns of the matrix. Hirschberg's algorithm allows the construction of the optimal sequence itself in the same quadratic time and linear space bounds.
 
+### Reduce cache misses
+
+Chowdhury and Ramachandran devised a quadratic-time linear-space algorithm for finding the LCS length along with an optimal sequence which runs faster than Hirschberg's algorithm in practice due to its superior cache performance. The algorithm has an asymptotically optimal cache complexity under the Ideal cache model. Interestingly, the algorithm itself is cache-oblivious meaning that it does not make any choices based on the cache parameters (e.g., cache size and cache line size) of the machine. 
+
 ### Further optimized algorithms
 
 Several algorithms exist that run faster than the presented dynamic programming approach. One of them is Hunt–Szymanski algorithm, which typically runs in <math-expression>O((n+r)\log(n))</math-expression> time (for <math-expression>n>m</math-expression>), where <math-expression>r</math-expression> is the number of matches between the two sequences. For problems with a bounded alphabet size, the Method of Four Russians can be used to reduce the running time of the dynamic programming algorithm by a logarithmic factor.
diff --git a/src/dataset/general_kb/data/Quicksort.txt b/src/dataset/general_kb/data/Quicksort.txt
@@ -12,9 +12,7 @@ The quicksort algorithm was developed in 1959 by Tony Hoare while he was a visit
 
 Quicksort gained widespread adoption, appearing, for example, in Unix as the default library sort subroutine. Hence, it lent its name to the C standard library subroutine qsort and in the reference implementation of Java. 
 
-Robert Sedgewick's PhD thesis in 1975 is considered a milestone in the study of Quicksort where he resolved many open problems related to the analysis of various pivot selection schemes including Samplesort, adaptive partitioning by Van Emden as well as derivation of expected number of comparisons and swaps. Jon Bentley and Doug McIlroy incorporated various improvements for use in programming libraries, including a technique to deal with equal elements and a pivot scheme known as pseudomedian of nine, where a sample of nine elements is divided into groups of three and then the median of the three medians from three groups is chosen. Bentley described another simpler and compact partitioning scheme in his book Programming Pearls that he attributed to Nico Lomuto. Later Bentley wrote that he used Hoare's version for years but never really understood it but Lomuto's version was simple enough to prove correct. Bentley described Quicksort as the "most beautiful code I had ever written" in the same essay. Lomuto's partition scheme was also popularized by the textbook Introduction to Algorithms although it is inferior to Hoare's scheme because it does three times more swaps on average and degrades to O(n) runtime when all elements are equal.
-
-In 2009, Vladimir Yaroslavskiy proposed a new Quicksort implementation using two pivots instead of one. In the Java core library mailing lists, he initiated a discussion claiming his new algorithm to be superior to the runtime library's sorting method, which was at that time based on the widely used and carefully tuned variant of classic Quicksort by Bentley and McIlroy. Yaroslavskiy's Quicksort has been chosen as the new default sorting algorithm in Oracle's Java 7 runtime library after extensive empirical performance tests.
+Robert Sedgewick's PhD thesis in 1975 is considered a milestone in the study of Quicksort where he resolved many open problems related to the analysis of various pivot selection schemes including Samplesort, adaptive partitioning by Van Emden as well as derivation of expected number of comparisons and swaps. Jon Bentley and Doug McIlroy in 1993 incorporated various improvements for use in programming libraries, including a technique to deal with equal elements and a pivot scheme known as pseudomedian of nine, where a sample of nine elements is divided into groups of three and then the median of the three medians from three groups is chosen. Bentley described another simpler and compact partitioning scheme in his book Programming Pearls that he attributed to Nico Lomuto. Later Bentley wrote that he used Hoare's version for years but never really understood it but Lomuto's version was simple enough to prove correct. Bentley described Quicksort as the "most beautiful code I had ever written" in the same essay. Lomuto's partition scheme was also popularized by the textbook Introduction to Algorithms although it is inferior to Hoare's scheme because it does three times more swaps on average and degrades to O(n) runtime when all elements are equal. McIlroy would further produce anAntiQuicksort (aqsort) function in 1998, which consistently drives even his 1993 variant of Quicksort into quadratic behavior by producing adversarial data on-the-fly.
 
 ## Algorithm
 
@@ -273,7 +271,7 @@ Another, less common, not-in-place, version of quicksort uses O(n) space for wor
 
 Quicksort is a space-optimized version of the binary tree sort. Instead of inserting items sequentially into an explicit tree, quicksort organizes them concurrently into a tree that is implied by the recursive calls. The algorithms make exactly the same comparisons, but in a different order. An often desirable property of a sorting algorithm is stability – that is the order of elements that compare equal is not changed, allowing controlling order of multikey tables (e.g. directory or folder listings) in a natural way. This property is hard to maintain for in situ (or in place) quicksort (that uses only constant additional space for pointers and buffers, and O(log n) additional space for the management of explicit or implicit recursion). For variant quicksorts involving extra memory due to representations using pointers (e.g. lists or trees) or files (effectively lists), it is trivial to maintain stability. The more complex, or disk-bound, data structures tend to increase time cost, in general making increasing use of virtual memory or disk. 
 
-The most direct competitor of quicksort is heapsort. Heapsort's running time is O(n log n), but heapsort's average running time is usually considered slower than in-place quicksort. This result is debatable; some publications indicate the opposite. Introsort is a variant of quicksort that switches to heapsort when a bad case is detected to avoid quicksort's worst-case running time. 
+The most direct competitor of quicksort is heapsort. Heapsort's running time is O(n log n), but heapsort's average running time is usually considered slower than in-place quicksort. This result is debatable; some publications indicate the opposite. Introsort is a variant of quicksort that switches to heapsort when a bad case is detected to avoid quicksort's worst-case running time. Major programming languages, such as C++ (in the GNU and LLVM implementations), use introsort.
 
 Quicksort also competes with merge sort, another O(n log n) sorting algorithm. Mergesort is a stable sort, unlike standard in-place quicksort and heapsort, and has excellent worst-case performance. The main disadvantage of mergesort is that, when operating on arrays, efficient implementations require O(n) auxiliary space, whereas the variant of quicksort with in-place partitioning and tail recursion uses only O(log n) space. 
 
@@ -311,6 +309,8 @@ Also developed by Powers as an O(K) parallel PRAM algorithm. This is again a com
 
 In any comparison-based sorting algorithm, minimizing the number of comparisons requires maximizing the amount of information gained from each comparison, meaning that the comparison results are unpredictable. This causes frequent branch mispredictions, limiting performance. BlockQuicksort rearranges the computations of quicksort to convert unpredictable branches to data dependencies. When partitioning, the input is divided into moderate-sized blocks (which fit easily into the data cache), and two arrays are filled with the positions of elements to swap. (To avoid conditional branches, the position is unconditionally stored at the end of the array, and the index of the end is incremented if a swap is needed.) A second pass exchanges the elements at the positions indicated in the arrays. Both loops have only one conditional branch, a test for termination, which is usually taken. 
 
+The BlockQuicksort technique is incorporated into LLVM's C++ STL implementation, libcxx, providing a 50% improvement on random integer sequences. Pattern-defeating quicksort (pdqsort), a version of introsort, also incorporates this technique.
+
 #### Partial and incremental quicksort
 
 Several variants of quicksort exist that separate the k smallest or largest elements from the rest of the input. 
diff --git a/src/dataset/general_kb/data/Red–black__tree.txt b/src/dataset/general_kb/data/Red–black__tree.txt
@@ -10,7 +10,7 @@ Tracking the color of each node requires only one bit of information per node be
 
 In 1972, Rudolf Bayer invented a data structure that was a special order-4 case of a B-tree. These trees maintained all paths from root to leaf with the same number of nodes, creating perfectly balanced trees. However, they were not binary search trees. Bayer called them a "symmetric binary B-tree" in his paper and later they became popular as 2–3–4 trees or just 2–4 trees.
 
-In a 1978 paper, "A Dichromatic Framework for Balanced Trees", Leonidas J. Guibas and Robert Sedgewick derived the red–black tree from the symmetric binary B-tree. The color "red" was chosen because it was the best-looking color produced by the color laser printer available to the authors while working at Xerox PARC. Another response from Guibas states that it was because of the red and black pens available to them to draw the trees. Author's name was Rudolf Bayer so he took the initials from his name that is R B and in colours, R means red and B means Black 
+In a 1978 paper, "A Dichromatic Framework for Balanced Trees", Leonidas J. Guibas and Robert Sedgewick derived the red–black tree from the symmetric binary B-tree. The color "red" was chosen because it was the best-looking color produced by the color laser printer available to the authors while working at Xerox PARC. Another response from Guibas states that it was because of the red and black pens available to them to draw the trees.
 
 In 1993, Arne Andersson introduced the idea of a right leaning tree to simplify insert and delete operations.
 
diff --git a/src/dataset/general_kb/data/Travelling__salesman__problem.txt b/src/dataset/general_kb/data/Travelling__salesman__problem.txt
@@ -136,7 +136,7 @@ The most direct solution would be to try all permutations (ordered combinations)
 
 One of the earliest applications of dynamic programming is the Held–Karp algorithm that solves the problem in time <math-expression>O(n^{2}2^{n})</math-expression>. This bound has also been reached by Exclusion-Inclusion in an attempt preceding the dynamic programming approach. 
 
-Improving these time bounds seems to be difficult. For example, it has not been determined whether a classical exact algorithm for TSP that runs in time <math-expression>O(1.9999^{n})</math-expression> exists.
+Improving these time bounds seems to be difficult. For example, it has not been determined whether a classical exact algorithm for TSP that runs in time <math-expression>O(1.9999^{n})</math-expression> exists. The currently best quantum exact algorithm for TSP due to Ambainis et al. runs in time <math-expression>O(1.728^{n})</math-expression>. 
 
 Other approaches include: 
 
diff --git a/src/haystack/elasticsearch/db.py b/src/haystack/elasticsearch/db.py
@@ -85,4 +85,4 @@ def store_general_kb():
    document_store.write_documents(docs)
 
 store_qna_kb()
-# store_general_kb()
+store_general_kb()
diff --git a/src/haystack/pipeline/nlp_pipeline.yaml b/src/haystack/pipeline/nlp_pipeline.yaml
@@ -34,19 +34,19 @@ components:
   - name: DocumentStore
     type: ElasticsearchDocumentStore
     params:
-      host: b963fb901829469e968500b83b71ab81.us-central1.gcp.cloud.es.io
+      host: 391f53d8501748afaf1fbfc591ed43e1.us-central1.gcp.cloud.es.io
       username: elastic
-      password: ytuLqABRF5zE7gbwEqL1aR7l
+      password: VjLfWuuVTOnUZbFlI6a9eriq
       port: 9243
       scheme: https
       index: document
 
   - name: MyQNADocumentStore
     type: ElasticsearchDocumentStore
     params:
-      host: b963fb901829469e968500b83b71ab81.us-central1.gcp.cloud.es.io
+      host: 391f53d8501748afaf1fbfc591ed43e1.us-central1.gcp.cloud.es.io
       username: elastic
-      password: ytuLqABRF5zE7gbwEqL1aR7l
+      password: VjLfWuuVTOnUZbFlI6a9eriq
       port: 9243
       scheme: https
       index: qna_document