Skip to content

Why distances are calculated twice? #15

@NaserMonsefi

Description

@NaserMonsefi

Hi,

Thank for the great tutorial on document clustering. I am pretty new to text analytics and wanted to ask if there is a reason that distances are calculated twice for hierarchical document clustering?
First here on the `tfidf_matrix' using cosine distance:

from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)

and second time here over the dist through ward function that runs euclidean distance before doing the ward linkage:

linkage_matrix = ward(dist)

Is this something specially done for text clustering?

Thanks again

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions