This week I have been working constantly analysing my test results from initial Black-box user testing, finalising my report and re-training my model for the 2nd iteration of black-box testing, which I have now started collecting data for. I also had an interview for a PhD entitled Music Discovery for Data Science and Recommendation at Queen Mary, University of London — a busy and productive week!
In our meeting Mick and I discussed my draft of the report and he mentioned that it is looking like it could be a publishable paper, which is good. He supported the idea of doing a 2nd iteration of testing with the refined / re-trained model.
Thank you to anyone who has been keeping up with these blogs! Some time after I have handed the report on Tuesday, I hope to port my Web App (with completed Spotify mappings) to my website so that people can have a go at discovering music through their favourite artists.
Until next time – keep on discovering.
This week I got the web app working for Black-box testing.
I asked users to fill out feedback on 3 recommendations (although half of them didn’t do all 3!), this was a mixture of friends and family, whilst I also paid for 12 tests from Amazon’s MTURK to gather a good dataset. I received a load of data from the Google Docs questionnaires and I’m now processing/analysing it in Pandas/Python.
I’ve also been working a lot on the report aiming for ~14k words!
More next week if I have time!
This week I have succeeded in ‘linking’ my databases, and i’m very close to the Black-Box user testing phase which I have been striving towards and eagerly anticipating.
This was a challenging process; it involved creating multiple Python Dictionaries using the mappings from music releases on CB to Spotify album URIs, then sorting by MBID’s that were exclusive to my corpus used in the Doc2Vec model. Then, I queried the batch of respective URIs (could do with a diagram perhaps) to the Spotify API and filled up a dictionary with the artist names; I then sorted them by distinct values in alphabetical order!
List the artists in HTML, get the pickled Doc2Vec model to work (causing me a problem atm), and serve the recommendations using nearest neighbour based on the Cos Sim to the seed Artists (perhaps) most popular release.
It’s doubtful I will be able to do another iteration of user testing in the 2 weeks that I have left, but there is a couple of tasks to make this good project a great one;
- Use the Spotify seed recommendations and do a ‘SoTA’ comparison.
- Implement another dimension for the recommendations by use of other MusicBrainz meta/acoustic data and do some white-box testing.
(Much) more progress next week!
This week i’ve been on the same task as the last, working away in a seemingly perpetual manner. Firstly some good news though, Francesco and I just submitted our paper for ANTS 2018. Next research goal is for this project to be submitted to SAAM 2018!
So what’s been so difficult? Why?
I joined the MusicBrainz development IRC at the beginning of the week, although ended up getting banned for the first day due to connection failures (tip: use the web client rather than the MAC OS X one!). I’ve been regularly trying to hack in to the the database to get Artist names and Spotify URIs! Luckily the helpful ‘iliekcomputers’ in the IRC managed to send me a dump of all the available MBID to Spotify URIs for releases! But after trying so hard to access the postgresql MusicBrainz db via VirtualMachine i’ve moved on to a different strategy…
My goals remain the same. But my projected workflow is now a little different;
- Doc from corpus MBID,
- -> Spotify URI,
- -> list all distinct Artists from Spotify API/db by means of Curl (btw if you’re trying to get access tokens you need to register your app, encode the hex ID and secret AS this combination <clientid:clientsecret>) and the syntactically strange pyjq to get only artist ‘name’ field,
- -> list artists by order of popularity to user,
- -> upon selection get the most popular release,
- -> get and serve recommendation from nearest neighbour in the Doc2Vec cosine similarity neighbourhood(s).
Onwards and upwards, until next time, keep on hacking!
I’ve spent a lot of time over the past couple of weeks figuring out the best way to access a select set of MusicBrainz metadata fields for the set of releases which i have encapsulated in my Doc2Vec model. As this is my first ‘Big Data’ project, it’s inevitable that I will make some mistakes figuring out the best development process, for example:
On my external hard-drive I have a data dump of ‘Releases’ which I uncompressed into a 100gb JSON file, obviously, this beyond the scope of RAM of every high-end laptop I can think of (I have 16gb of RAM available on my Macbook Pro). This led me to thinking of how i can parse in the JSON using a buffer, which ( i think – i have managed to do # https://stackoverflow.com/questions/21708192/how-do-i-use-the-json-module-to-read-in-one-json-object-at-a-time)
However when i try to access select fields, the json object doesn’t read anything beyond ID, when i try to use exclusively the ‘id’ field, i still get a memory error….
Thoughts on the projects framework so far:
Is pickling the Doc2Vec model really the best way to do things? For longevity, expansion and Linking of the data it could be more beneficial to do some kind of database – is this realistic with latent vector spaces though? Perhaps not…
Page 1) Upon entry to website… please select 3 artists you listen to from this list (where list is retrieved from UNIQUE artists in the intersection; of releases that have a vectorised review in my model, and the MusicBrainz release data dump).
Page 2) Pre-study survey.
Page 3) Serve Recommendations.
Page 4) Post-study survey for data collection.
Serve recommendations based on schema. Youtube embeddings are going to be difficult – will look into using Spotify API to embed audio for the recommendations.
More next time!
Over the past 2 weeks there has been much industrial strike action at my University – as a result of this, I have missed a weekly meeting with my supervisor Mick and have spent a lot of time participating in strike action by meeting and standing in solidarity with my lecturers and discussing with my peers how best we can help the situation… I have also been unwell for a few days. Therefore, here I have summarised (consequently under) 2 weeks of work into Chapter 7 of my Major Project blog.
In this time I have been designing, testing from a technical perspective, and implementing the next stage of my successful (at least in my opinion) music recommendation system. In our meeting, Mick and I discussed the ways in which I can best user test the Doc2vec based MRS, discussing Psychology informed methods and Computing ones; we came to the conclusion that I should ‘pickle’ the Doc2vec model, and create a web app around the Python Flask framework that will serve the recommendations, and collect data from survey and user testing.
My plan of action was mostly informed by this useful book.
I have so far successfully pickled my Doc2vec model ‘object’, implemented a simple Flask web app whereby a user can enter some data via a form, and proposed some initial pre-study survey questions to Mick which are as follows (to be answered on a 1-5 ‘likert’ scale);
- How important is music to you?
- How often do you actively search for new music?
- How often do you discover music through being recommended tracks/releases?
- How eclectic (diverse) would you say your music taste is?
- Has your musical taste changed since two years ago?
- When you put music on, how deeply do you listen to music?
These do initially form a good basis for assessing a users background, although they need to be expanded upon.
One of the biggest obstacles i’m overcoming here is the web app development learning curve, whilst I know some and am very willing to learn, I have little prior experience working with external servers and SQLite, which I am learning to use in order to store user data and integrate MusicBrainz data into the web app.
Next time, could it be, an (almost) completed web app? One that serves great recommendations and receives useful and informative data from user testing?.. Tune in next time to find out!