Skip to content

[Question] How to implement MultiClassClassification with tree data structure using ML.Net #5720

Open

Description

System information

ML.Net 1.5.2
.Net Framework 4.7.2

Issue

I have hundreds of projects, and they all have tree data structure like this:

A
  AA
     AAA
  BB
     BBB

Or like this:

A
  AA1
     AAA1
  BB2
     BBB2

Each project has its own tree structure which is modified from a standard tree structure. What I am trying to do is to map project's tree structure to the standard tree structure, like this:

A          <--- A
  AA       <---   AA1
     AAA   <---      AAA1
  BB       <---   BB2
     BBB   <---      BBB2

Or like this:

(img)mapping to standard tree

(The mapping really depends on the text instead of the node's level. )

Now I'm using multi class classification in ML.Net. First I map the existing projects' tree to the standard tree manually and save the results in the database, like this:

| Label      | Level1         | Level2         | Level3         |
| --------   | -------------- | -------------- | -------------- |
| A          | A              |      *         |       *        |
| A-AA       | A              |      AA1       |       *        |
| A-AA-AAA   | A              |      AA1       |      AAA1      |
| A-BB       | A              |      BB2       |       *        |
| A-BB-BBB   | A              |      BB2       |      BBB2      |
| A          | A              |      *         |       *        |
| A-AA-AAA   | A              |      AAA1      |       *        |
| A-BB       | A              |      BB2       |       *        |
| A-BB-BBB   | A              |      BB2       |      BBB2      |  

Because data in the column in ML.Net cannot be a missing value, so I replace them with *. And my tree has 15 levels (feature columns).

The multi class classification algorithm I choose is SdcaMaximumEntropy. Hopefully I can use the prediction to map the tree instead of doing this manually.

I successfully implemented the prediction. However, the prediction result is really poor.

So my question is:

  1. Is the way I do this right?
  2. If yes, should I remove the duplicate rows and should I replace the missing value with *?

Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    P2Priority of the issue for triage purpose: Needs to be fixed at some point.classificationBugs related classification tasksquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions