Note: there are two versions of the LSA exercice:
- The first one (“numpy”) is more programming-oriented
- The second one (“gensim”) is more “usage”-oriented, more descriptive
Exercice LSA version 1 (numpy)
- Write a python script to build a term-document matrix on MR500train
- Build a LSA model
- Use numpy.linalg.svd
- How much time did it take ?
- Check that “cat” and “dog” are closer together than with “talker”
- (Opt.) Evaluate on the MEN word similarity task (https://aclweb.org/aclwiki/MEN_Test_Collection_(State_of_the_art) )
- Use scipy.stats.spearman()
Exercice LSA version 2 (gensim)
- Load the MR500train data into a list of String, one string per review, without the sentiment label
- Tokenize the lines with nltk.tokenize.word_tokenize()
Retrieve and remove stop words with:
from nltk.corpus import stopwords en_stop = set(stopwords.words('english'))
Apply stemming, with:
from nltk.stem.porter import PorterStemmer p_stemmer = PorterStemmer() stem = p_stemmer.stem(word)
Build the term-document matrix with gensim:
from gensim import corpora dictionary = corpora.Dictionary(list_of_docs) doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
Train an LSA model with gensim:
from gensim.models import LsiModel lsamodel = LsiModel(doc_term_matrix, num_topics=number_of_topics, id2word = dictionary) print(lsamodel.print_topics(num_topics=number_of_topics, num_words=words))
Look at the gensim documentation about “corpora.Dictionary” and “LsiModel.get_topics()” to compute the cosine distance between the topic vectors of “cat” vs. “dog”, and of “cat” vs. “foul”
Tutorial adapted from: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python