Exercice LSA


Note: there are two versions of the LSA exercice:

  • The first one (“numpy”) is more programming-oriented
  • The second one (“gensim”) is more “usage”-oriented, more descriptive

Exercice LSA version 1 (numpy)

  • Write a python script to build a term-document matrix on MR500train
  • Build a LSA model
    • Use numpy.linalg.svd
    • How much time did it take ?
    • Check that “cat” and “dog” are closer together than with “talker”
  • (Opt.) Evaluate on the MEN word similarity task (https://aclweb.org/aclwiki/MEN_Test_Collection_(State_of_the_art) )
    • Use scipy.stats.spearman()

Exercice LSA version 2 (gensim)

  • Load the MR500train data into a list of String, one string per review, without the sentiment label
  • Tokenize the lines with nltk.tokenize.word_tokenize()
  • Retrieve and remove stop words with:

    from nltk.corpus import stopwords
    en_stop = set(stopwords.words('english'))
  • Apply stemming, with:

    from nltk.stem.porter import PorterStemmer
    p_stemmer = PorterStemmer()
    stem = p_stemmer.stem(word)
  • Build the term-document matrix with gensim:

    from gensim import corpora
    dictionary = corpora.Dictionary(list_of_docs)
    doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
  • Train an LSA model with gensim:

    from gensim.models import LsiModel
    lsamodel = LsiModel(doc_term_matrix, num_topics=number_of_topics, id2word = dictionary)
    print(lsamodel.print_topics(num_topics=number_of_topics, num_words=words))
  • Look at the gensim documentation about “corpora.Dictionary” and “LsiModel.get_topics()” to compute the cosine distance between the topic vectors of “cat” vs. “dog”, and of “cat” vs. “foul”


Tutorial adapted from: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

See also