Solvedgensim Error on Word2vec training on tweets

Hi,

I am trying to train wor2vec embeddings on tweets. I defined the sentence class as follows:

def tokenize_tweets():
    for line in codecs.open('../data/sample_tweets.txt', encoding='utf-8'):
        tweet_text = ' '.join([token for token in tknz.tokenize(line) if token not in stopwords.words('english')])
        try:
            mod_text = tokenize(tweet_text)
            tokens = tknz.tokenize(mod_text)
            if len(tokens) > 0:
                yield tknz.tokenize(mod_text)
            else:
                yield ['NULL']
        except UnicodeEncodeError as e:
            yield ['<NULL>']

Voacb. building from this class runs fine. But when I try running the train method, I am getting the following errors:

ValueError: You must specify either total_examples or total_words, for proper alpha and progress calculations. The usual value is total_examples=model.corpus_count.

Not sure what is wrong with it.

17 Answers

✔️Accepted Answer

@shashankg7 Which version of Gensim are you using? As per the latest release, one would need to explicitly pass epochs parameter and an estimate of corpus size while calling the train function. So you would have to pass these parameters like : vec_model.train(sentences, total_examples=self.corpus_count, epochs=self.iter). You could read about this here.

More Issues: