Exercice LSA

Note: there are two versions of the LSA exercice:

The first one (“numpy”) is more programming-oriented
The second one (“gensim”) is more “usage”-oriented, more descriptive

Exercice LSA version 1 (numpy)

Write a python script to build a term-document matrix on MR500train
Build a LSA model
- Use numpy.linalg.svd
- How much time did it take ?
- Check that “cat” and “dog” are closer together than with “talker”
(Opt.) Evaluate on the MEN word similarity task (https://aclweb.org/aclwiki/MEN_Test_Collection_(State_of_the_art) )
- Use scipy.stats.spearman()

Exercice LSA version 2 (gensim)

Load the MR500train data into a list of String, one string per review, without the sentiment label
Tokenize the lines with nltk.tokenize.word_tokenize()

Retrieve and remove stop words with:

from nltk.corpus import stopwords
en_stop = set(stopwords.words('english'))

Apply stemming, with:

from nltk.stem.porter import PorterStemmer
p_stemmer = PorterStemmer()
stem = p_stemmer.stem(word)

Build the term-document matrix with gensim:

from gensim import corpora
dictionary = corpora.Dictionary(list_of_docs)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

Train an LSA model with gensim:

from gensim.models import LsiModel
lsamodel = LsiModel(doc_term_matrix, num_topics=number_of_topics, id2word = dictionary)
print(lsamodel.print_topics(num_topics=number_of_topics, num_words=words))

Look at the gensim documentation about “corpora.Dictionary” and “LsiModel.get_topics()” to compute the cosine distance between the topic vectors of “cat” vs. “dog”, and of “cat” vs. “foul”

Tutorial adapted from: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

Exercice LSA version 1 (numpy)

Exercice LSA version 2 (gensim)

See also