I’m currently working with a emotion analysis dataset which contains 100000 docs.
By using sklearn’s
TfIdfVector, i transformed these docs to a (100000, 20000) vecs. It works fine with
LogisticRegression, though it does take several minutes to print the result.
Then i tried
gensim.doc2Vec.Doc2Vec model to train these docs to a (10000, 20000) vecs.
doc2vec generated, i transformed these vecs to a dataframe like these:
train_doc2vec = model.dv.vectors[:8000] test_doc2vec = model.dv.vectors[8000:] train_df = pd.DataFrame(train_doc2vec) train_df.head(5)
all these works fine. But when it comes to
train_test_split, it takes more than hours, but still not splitted….. i’ve tried to
del train_doc2vec and model for saving memories, neither works out too…
del train_doc2vec del model
Is there anyway i can do to making gensim generated doc2vec works faster, or anything wrong with me??