Skip to content

A utility for determining the similarity between top artists and tags on the last.fm music repository, as a personal experiment in large datasets and weighted metrics.

Notifications You must be signed in to change notification settings

matthewparmelee/lastfmapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#################################
# Matthew Parmelee
# March 2015
# last.fm tag comparator
#################################

##### FILES

  * lastfm.py		    - a class for interfacing with the LastFM API
  * tagmetrics.py       - the main execution class, performs all other functions
  * __init__.py         - allows inclusion of lastfm.py
  * results.zip         - contains data generated by tagmetrics.py
    * tag_data.json       - the generated dataset
    * tag_list.json       - a master list of all tags
    * artist_list.json    - a master list of all artists
    * tag_results.json    - resulting tag comparison data
    * artist_results.json - resulting artist comparison data
  * README.txt		    - this README

##### USAGE

  At the command line, simply run:

  	python /path/to/this/directory/tagmetrics.py

##### OPTIONS

  There are several options that can be set in tagmetrics.py. They are as follows:

    * DEBUG_OUTPUT			- Boolean. Controls verbosity of command-line output.
	* SMALL_OUTPUT			- Boolean. Removes nonexistent tag relationships from output to save space.
	* COMPARE_ARTISTS		- Boolean. Controls whether or not artist similarity is assessed and output.
	* DATASET_FILENAME		- String. Save path for the generated dataset.
	* TAG_LIST_FILENAME		- String. Save path for the master list of all tags.
	* ARTIST_LIST_FILENAME  - String. Save path for the master list of all artists.
	* TAGS_FILENAME         - String. Save path for final tag results.
	* ARTISTS_FILENAME      - String. Save path for final artist results.
	* DATASET_SIZE          - Integer. Number of artists for which a dataset should be generated.
	* SIMILARITY_THRESHOLD  - Integer. Search depth at which a relationship path should be abandoned.

##### DESIGN DECISIONS

## Dataset structure

  It was anticipated that, for each artists, a majority of tags would be largely unique to that artist
  alone, as the tags are user-submitted and are generally very specific. Since I wanted to approach this
  as a graph search problem, determining the ultimate structure of the graph was paramount. Since the
  nature of the tags implies a sparse graph, I elected to use the more suitable adjacency list over
  an adjacency matrix.

## Depth-First Search

  Choosing how to traverse the graph was an easier decision. Breadth-first search is useful for returning
  the shortest path first with a bit of additional overhead, but it was a benefit that is not useful here
  as all paths (underneath the threshold) are needed for scoring. As such, depth-first was selected.
  Initially, a recursive implementation of DFS was used, before debugging presented a number of issues
  that would harm the long-term maintenance of this script.

## Similarity Threshold

  As mentioned, the bulk of tags are fully unique and determining the path to/from them to other tags
  is either tedious or impossible. As such, a search threshold was introduced to allow the user to
  select a point at which a search should be abandoned. Beyond a certain point the score of a path length
  approaches zero, so it is beneficial to allow this option to save massive amounts of runtime.

## Output Size

  Upon initial runs of the script, a small list of one hundred artists generated a results file that was
  about 150MB. While, for modern machines, this is hardly an issue, it bothered me just enough that there
  is now an option for SMALL_OUTPUT, which removes all zero-score tag relationships from the results which,
  given the nature of tags, occupies the vast majority of space in the resulting data.

##### RESULTS

  For the sake of time, test runs were performed with sets of 100 artists that took an average of eight minutes
  to complete.

  The results appear to be very coherent, and seem to provide useful data for large datasets given a reasonable
  amount of runtime. I find the ratio of lines of code to runtime and output accuracy to be very satisfactory,
  though I acknowledge a great deal of fine-tuning could likely increase performance overall.

##### GOING FORWARD

  By far my largest concern is the runtime. I am continuing to look into runtime reduction strategies
  such as mapreduce and threading, but in the meantime the results are sound despite the poorly-scaling
  execution time. This will be the first tweak in future revisions.

  This script was designed to be fairly generic, so much so that adding functionality to compare artists instead
  of tags was fully trivial. As such, I feel like the application of the script to other datasets would be
  informative as well as straightforward for someone unfamiliar with the code.

  With regard to the generated data, the comparison data for both tags and artists is incredibly useful to
  LastFM and services like it which derive a great deal of there userbase and income from an accurate recommendation
  engine. While I am doubtful my solution is quite as robust as theirs, it does demonstrate the need for thinking
  about data in new ways and produces meaningful data that draws new connections from seemingly unrelated points.

About

A utility for determining the similarity between top artists and tags on the last.fm music repository, as a personal experiment in large datasets and weighted metrics.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages