Skip to content
Glaucia edited this page May 11, 2019 · 4 revisions

Welcome to the task2stackRapidMiner wiki!

Process Overview

This process retrieves the similarity indexes between the project tasks of a given dataset and evaluates if the comparisons with higher similarities refer to the tasks that use the same Stack Overflow Post. The implemented process receives as input a dataset with project tasks, gathers the similarity between each pair of project tasks and evaluates the similarity generated by providing metric results.

Operators Description

Operators that run sequentially compose the process. Each operator has a different responsibility, and the combination and order of operators can change the result of a process. In this process, the first operator loads the sample file with project tasks. A straight line connects operators. Each operator is a box that runs a unique procedure and the result is an input to the next operator. Every operator has semicircles that are ports for inputs and outputs, except for the Retrieve operator, that has no input, because this operator represents a loaded file. These semicircles are labelled icons on the side of operators. The inputs and outputs of operators are:

  1. out: output port.
  2. ori: the original data of the sample.
  3. exa: the generated set modified by operators.
  4. sim: similarity table generated.
  5. lab: labelled data. A label input is applied in the example set and is delivered in this port.
  6. per: performance vector for selected attributes.
  7. doc: document or document set.
  8. res: connector represents the end of the process.

Retrieve Operator: The first operator in the process is the Retrieve operator. This operator represents the dataset import in the process. According to RapidMiner documentation, this operator loads the desired repository into the process. It is necessary to inform the platform where the physical file is and also configure a few characteristics of the dataset. RapidMiner provides guided user interfaces (GUI) to aid the needed configurations. The configurations required are encoding, defining a specific character for comments, and column separator. The GUI also helps the user set column types. The result of the configuration is the Retrieve operator referencing a configured data sample. After creating this operator, it is necessary to select which attributes of the dataset (project task context elements) will be used to generate the similarities. The operator responsible for selecting the attributes is Select Attributes.

Select Attributes: The Select Attributes operator is used to select what attributes from the dataset will be the project task context elements to generate the similarity index extraction. This operator selects a subset of attributes of a dataset and does not consider the other attributes that were not selected. This attribute is then linked to the Retrieve attribute out port to its exa in port. After selecting project task context elements, they are submitted to a text pre-processing process. The exa output port of the Select Attributes operator is then connected to the exa input port of the Process Documents from Data operator.

Process Documents from Data: this operator is a subprocess, responsible for the text pre-processing transformations. The transformations executed are the transformation of characters to lowercase (Transform Cases), the removal of every character that is not an alphanumerical character (Tokenize), stop-words filtering (Filter Stopwords) and lastly, the transformation of inflected words into a base or root form of the word (Stem).
This operator’s output, which is the text with which all text pre-processing configurations, is connected to the next operator input port, Set Role.

Set Role: This operator changes the role of an attribute of the dataset. It is needed by the next operator´s (Data to Similarity Data) input. This operator identifies which information from the input dataset is the dependent variable, meaning, information that will be suggested and submitted to evaluation further in the process.

Data to Similarity Data: this operator is responsible for creating the similarity table. It receives as input the configured and selected attributes of the imported dataset and provides as output the similarity table, containing all similarity indexes extracted from each project task comparison. The similarity table has four columns: Row No., FIRST_ID, SECOND_ID and SIMILARITY. Row No. is an identification number of each generated row. FIRST_ID is the identifier of the row number of the dataset used as a base for comparison. This row is compared to the row informed in the SECOND_ID column. The SIMILARITY column is the discovered similarity index result as a comparison between FIRST_ID and SECOND_ID. Data to Similarity Data operator has two parameters: Measure Type and the algorithms available for each measure type. The parameter Measure Type is used for selecting the type of measure to be used for calculating similarity. The available measure types are mixed measures, nominal measures, numerical measures, and Bregman divergences. These parameters define how to calculate distances for the attributes of the input dataset. This parameter is configured according to the dataset’s configurations and characteristics. For this model, considering the dataset has text columns only, the option selected for Measure Types is “Nominal Measures”. When this parameter is selected, the second parameter changes dynamically. The parameters tab changes and presents “Nominal Measures” as a label and this parameter has the suited algorithms for textual measure type as options. Nominal measure algorithms are described below. Considering “e” as a number of attributes for which both examples have equal and non-zero values, “u” the number of attributes for which both examples have not equal values and “z” the number of attributes for which both examples have zero values, the available algorithms are:

  1. NominalDistance: Distance of two values is 0 if both values are the same and 1 otherwise.
  2. DiceSimilarity: With the above-mentioned definitions the DiceSimilarity is 2e/(2e+u)
  3. JaccardSimilarity: With the above-mentioned definitions the JaccardSimilarity is e/(e+u)
  4. KulczynskiSimilarity: With the definitions mentioned above the KulczynskiSimilarity is e/u
  5. RogersTanimotoSimilarity: With the above-mentioned definitions the RogersTanimotoSimilarity is (e+z)/(e+2*u+z)
  6. RussellRaoSimilarity: With the above-mentioned definitions the RussellRaoSimilarity is e/(e+u+z)
  7. SimpleMatchingSimilarity: With the above-mentioned definitions the SimpleMatchingSimilarity is (e+z)/(e+u+z)

Jaccard algorithm is broadly used in text similarity retrieval (LEVENSHTEIN, 1966) (YUNG-SHEN LIN et al., 2014), and the algorithm most often used for document comparison (TAN et al., 2006). It compares two strings and retrieves an index that shows how similar both strings are. The similarity indexes retrieved as a result of the execution of similarity algorithms have a range from 0 to 1 and can be interpreted in percentages. This operator compares each document to all other documents (n^2). For example, if there are 25 examples in the given dataset, there will be 625 (i.e., (25*25)) similarity comparisons in the resultant similarity table. This operator is connected to another Set Role operator, which has a different responsibility in this step of the process.

**Set Role (2): ** This operator sets roles for specific attributes. The input information for this operator is the similarity table and its four columns. In this Set Role(2) operator, column FIRST_ID is set to “label” and column SECOND_ID is set to “prediction”. The label attribute serves as a target for comparison, and the prediction attribute is the prediction of a process. In other words, this means that the information on the SECOND_ID is the expected prediction and the FIRST_ID column is the base information for the prediction. In this model, the information we are studying is Stack Overflow Posts. This means both columns should present the Stack Overflow Post associated to that compared project task. This way we can evaluate if project tasks with a high degree of similarity share the same Stack Overflow Post, in case they are equal in both columns.

Filter Examples: is the operator that sets a threshold of similarities. We defined a threshold of 50% similarity (similarity index >= 0.5).

Performance: Performance operator is used for statistical performance evaluation of classification tasks. This operator delivers a list of performance criteria values of the classification task. The classification task is the similarity extraction and that classification has the instances of data we wish to evaluate with the performance operator. To use this operator, it is mandatory to set roles to attributes from the similarity table as “label” and “prediction” roles, for which the Set Role(2) operator was responsible. The “label” attribute stores the actual observed values whereas the “prediction” attribute stores the values of label predicted by the classification process under analysis. This operator is connected to the “res” port of the process, indicating the end of the process. The output of this operator is a confusion matrix of the similarity table, and all metrics are calculated from this confusion matrix. The confusion matrix has two dimensions: label and prediction. It allows visualizing the performance of the algorithm. Each row (first dimension) represents the labels and the columns (second dimension) represent the predictions (or vice-versa).

Using the process

Basically, you have to:

  1. Upload a dataset
  2. Set the roles of dependent variables
  3. Have a stop words file to upload in the Stopwords operator of the Process Documents main operator
Clone this wiki locally