Major Project Chapter 6 – Some Promising Results

Upon successfully training my Doc2vec model (thanks to this blog post), I have gained some promising results, after training on a large corpus (555 reviews) – for example if I had been listening to 1: . It would have recommended me 2: as the most cosine similar. Which is a nice recommendation I think.

Comparing different examples – I think it’s only vectorising the first paragraph in the review, so next I need to concatenate the paragraphs. Additionally it has a lovely capability of analysing sentiment – if we look at the first paragraph from the prior example:
This is all about Dvorak’s chamber music, and if you doubt me, just listen to the slow movement of the Cello Concerto: Jean-Guihen Queyras and the Prague Philharmonia in rapt communication…we’re just eavesdroppers on an intimate conversation. In the outer movements Queyras’s instinctive musicianship and ravishingly beautiful tone are bewitching, and Belohlavek follows every imaginative twist and turn in his soloist’s phrasing; they’re glued together like the finest ballroom dancers.
Long-term fans of [Sigur Ros] ( could be forgiven for feeling a little nervous about Jonsi Birgisson’s new project. The quartet he usually fronts possess many admirable qualities, but their international success owes much to a mystique greatly enhanced by lyrics that are gobbledegook to most. Singing in Icelandic has the useful effect of making you sound like the house band from a science fiction film, an in-built benefit Birgisson has decided to eschew with his first solo record, which is sung mostly in English. If the Sigur Ros spell is ever to be broken, this might be the moment.

You can see here that these paragraphs are imaginative and descriptive.

Perhaps the most(surprisingly) challenging task was to extract the review text and exporting to separate, training .txt files in batch. I tried a piece of software but it wanted me to pay for it, so I thought i’d have a go at writing my own Python script. Many thanks to this blog post, I was able to semi-successfully implement it. My code so far manages to do it in batch per folder, but lacks the ability to iterate through folders and sub-folders, although my logic dictates it should do this… part of the challenge is dealing with some fairly odd hexadecimal directory organisation.

Things to fix:

  1. Get script to batch iterate through all review directories within the JSON dump(s).
  2. Concatenate multiple paragraphs.

Thoughts about text summarisation

Mick and I have been discussing text summarisation as previously mentioned – I have been doing some research, and I think it will not be necessary as key words, such as names of artists, tracks etc. will most likely not be included in the summarisation which are very important in having a Collaborative Filtering type influence on the model. I am doing some cleaning from NLTK toolkit before training my Doc2vec model, so I think it would be more useful to optimise this instead (which in a way, is a form of text summarisation).

Thanks for reading, more progress next week!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s