GitHub - matthewparmelee/lastfmapper: A utility for determining the similarity between top artists and tags on the last.fm music repository, as a personal experiment in large datasets and weighted metrics.

matthewparmelee / lastfmapper Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

A utility for determining the similarity between top artists and tags on the last.fm music repository, as a personal experiment in large datasets and weighted metrics.

0 stars 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitattributes		.gitattributes
README.txt		README.txt
__init__.py		__init__.py
lastfm.py		lastfm.py
lastfm.pyc		lastfm.pyc
results.zip		results.zip
tagmetrics.py		tagmetrics.py

Repository files navigation

#################################
# Matthew Parmelee
# March 2015
# last.fm tag comparator
#################################

##### FILES

* lastfm.py - a class for interfacing with the LastFM API
* tagmetrics.py - the main execution class, performs all other functions
* __init__.py - allows inclusion of lastfm.py
* results.zip - contains data generated by tagmetrics.py
* tag_data.json - the generated dataset
* tag_list.json - a master list of all tags
* artist_list.json - a master list of all artists
* tag_results.json - resulting tag comparison data
* artist_results.json - resulting artist comparison data
* README.txt - this README

##### USAGE

At the command line, simply run:

python /path/to/this/directory/tagmetrics.py

##### OPTIONS

There are several options that can be set in tagmetrics.py. They are as follows:

* DEBUG_OUTPUT - Boolean. Controls verbosity of command-line output.
* SMALL_OUTPUT - Boolean. Removes nonexistent tag relationships from output to save space.
* COMPARE_ARTISTS - Boolean. Controls whether or not artist similarity is assessed and output.
* DATASET_FILENAME - String. Save path for the generated dataset.
* TAG_LIST_FILENAME - String. Save path for the master list of all tags.
* ARTIST_LIST_FILENAME - String. Save path for the master list of all artists.
* TAGS_FILENAME - String. Save path for final tag results.
* ARTISTS_FILENAME - String. Save path for final artist results.
* DATASET_SIZE - Integer. Number of artists for which a dataset should be generated.
* SIMILARITY_THRESHOLD - Integer. Search depth at which a relationship path should be abandoned.

##### DESIGN DECISIONS

## Dataset structure

It was anticipated that, for each artists, a majority of tags would be largely unique to that artist
alone, as the tags are user-submitted and are generally very specific. Since I wanted to approach this
as a graph search problem, determining the ultimate structure of the graph was paramount. Since the
nature of the tags implies a sparse graph, I elected to use the more suitable adjacency list over
an adjacency matrix.

## Depth-First Search

Choosing how to traverse the graph was an easier decision. Breadth-first search is useful for returning
the shortest path first with a bit of additional overhead, but it was a benefit that is not useful here
as all paths (underneath the threshold) are needed for scoring. As such, depth-first was selected.
Initially, a recursive implementation of DFS was used, before debugging presented a number of issues
that would harm the long-term maintenance of this script.

## Similarity Threshold

As mentioned, the bulk of tags are fully unique and determining the path to/from them to other tags
is either tedious or impossible. As such, a search threshold was introduced to allow the user to
select a point at which a search should be abandoned. Beyond a certain point the score of a path length
approaches zero, so it is beneficial to allow this option to save massive amounts of runtime.

## Output Size

Upon initial runs of the script, a small list of one hundred artists generated a results file that was
about 150MB. While, for modern machines, this is hardly an issue, it bothered me just enough that there
is now an option for SMALL_OUTPUT, which removes all zero-score tag relationships from the results which,
given the nature of tags, occupies the vast majority of space in the resulting data.

##### RESULTS

For the sake of time, test runs were performed with sets of 100 artists that took an average of eight minutes
to complete.

The results appear to be very coherent, and seem to provide useful data for large datasets given a reasonable
amount of runtime. I find the ratio of lines of code to runtime and output accuracy to be very satisfactory,
though I acknowledge a great deal of fine-tuning could likely increase performance overall.

##### GOING FORWARD

By far my largest concern is the runtime. I am continuing to look into runtime reduction strategies
such as mapreduce and threading, but in the meantime the results are sound despite the poorly-scaling
execution time. This will be the first tweak in future revisions.

This script was designed to be fairly generic, so much so that adding functionality to compare artists instead
of tags was fully trivial. As such, I feel like the application of the script to other datasets would be
informative as well as straightforward for someone unfamiliar with the code.

With regard to the generated data, the comparison data for both tags and artists is incredibly useful to
LastFM and services like it which derive a great deal of there userbase and income from an accurate recommendation
engine. While I am doubtful my solution is quite as robust as theirs, it does demonstrate the need for thinking
about data in new ways and produces meaningful data that draws new connections from seemingly unrelated points.

About

A utility for determining the similarity between top artists and tags on the last.fm music repository, as a personal experiment in large datasets and weighted metrics.

Readme

Activity

0 stars

1 watching

0 forks

Report repository