-
Notifications
You must be signed in to change notification settings - Fork 0
matthewparmelee/lastfmapper
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
################################# # Matthew Parmelee # March 2015 # last.fm tag comparator ################################# ##### FILES * lastfm.py - a class for interfacing with the LastFM API * tagmetrics.py - the main execution class, performs all other functions * __init__.py - allows inclusion of lastfm.py * results.zip - contains data generated by tagmetrics.py * tag_data.json - the generated dataset * tag_list.json - a master list of all tags * artist_list.json - a master list of all artists * tag_results.json - resulting tag comparison data * artist_results.json - resulting artist comparison data * README.txt - this README ##### USAGE At the command line, simply run: python /path/to/this/directory/tagmetrics.py ##### OPTIONS There are several options that can be set in tagmetrics.py. They are as follows: * DEBUG_OUTPUT - Boolean. Controls verbosity of command-line output. * SMALL_OUTPUT - Boolean. Removes nonexistent tag relationships from output to save space. * COMPARE_ARTISTS - Boolean. Controls whether or not artist similarity is assessed and output. * DATASET_FILENAME - String. Save path for the generated dataset. * TAG_LIST_FILENAME - String. Save path for the master list of all tags. * ARTIST_LIST_FILENAME - String. Save path for the master list of all artists. * TAGS_FILENAME - String. Save path for final tag results. * ARTISTS_FILENAME - String. Save path for final artist results. * DATASET_SIZE - Integer. Number of artists for which a dataset should be generated. * SIMILARITY_THRESHOLD - Integer. Search depth at which a relationship path should be abandoned. ##### DESIGN DECISIONS ## Dataset structure It was anticipated that, for each artists, a majority of tags would be largely unique to that artist alone, as the tags are user-submitted and are generally very specific. Since I wanted to approach this as a graph search problem, determining the ultimate structure of the graph was paramount. Since the nature of the tags implies a sparse graph, I elected to use the more suitable adjacency list over an adjacency matrix. ## Depth-First Search Choosing how to traverse the graph was an easier decision. Breadth-first search is useful for returning the shortest path first with a bit of additional overhead, but it was a benefit that is not useful here as all paths (underneath the threshold) are needed for scoring. As such, depth-first was selected. Initially, a recursive implementation of DFS was used, before debugging presented a number of issues that would harm the long-term maintenance of this script. ## Similarity Threshold As mentioned, the bulk of tags are fully unique and determining the path to/from them to other tags is either tedious or impossible. As such, a search threshold was introduced to allow the user to select a point at which a search should be abandoned. Beyond a certain point the score of a path length approaches zero, so it is beneficial to allow this option to save massive amounts of runtime. ## Output Size Upon initial runs of the script, a small list of one hundred artists generated a results file that was about 150MB. While, for modern machines, this is hardly an issue, it bothered me just enough that there is now an option for SMALL_OUTPUT, which removes all zero-score tag relationships from the results which, given the nature of tags, occupies the vast majority of space in the resulting data. ##### RESULTS For the sake of time, test runs were performed with sets of 100 artists that took an average of eight minutes to complete. The results appear to be very coherent, and seem to provide useful data for large datasets given a reasonable amount of runtime. I find the ratio of lines of code to runtime and output accuracy to be very satisfactory, though I acknowledge a great deal of fine-tuning could likely increase performance overall. ##### GOING FORWARD By far my largest concern is the runtime. I am continuing to look into runtime reduction strategies such as mapreduce and threading, but in the meantime the results are sound despite the poorly-scaling execution time. This will be the first tweak in future revisions. This script was designed to be fairly generic, so much so that adding functionality to compare artists instead of tags was fully trivial. As such, I feel like the application of the script to other datasets would be informative as well as straightforward for someone unfamiliar with the code. With regard to the generated data, the comparison data for both tags and artists is incredibly useful to LastFM and services like it which derive a great deal of there userbase and income from an accurate recommendation engine. While I am doubtful my solution is quite as robust as theirs, it does demonstrate the need for thinking about data in new ways and produces meaningful data that draws new connections from seemingly unrelated points.
About
A utility for determining the similarity between top artists and tags on the last.fm music repository, as a personal experiment in large datasets and weighted metrics.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published