in

what’s the difference between pd.DataFrame and sklearn transformed SparseMatrix? why sklearn transformed SparseMatrix so fast?


I’m currently working with a emotion analysis dataset which contains 100000 docs.

By using sklearn’s TfIdfVector, i transformed these docs to a (100000, 20000) vecs. It works fine with trains_test_split and LogisticRegression, though it does take several minutes to print the result.

Then i tried gensim.doc2Vec.Doc2Vec model to train these docs to a (10000, 20000) vecs.
When the doc2vec generated, i transformed these vecs to a dataframe like these:

train_doc2vec = model.dv.vectors[:8000]
test_doc2vec = model.dv.vectors[8000:]

train_df = pd.DataFrame(train_doc2vec)
train_df.head(5)

all these works fine. But when it comes to train_test_split, it takes more than hours, but still not splitted….. i’ve tried to del train_doc2vec and model for saving memories, neither works out too…

del train_doc2vec
del model

Is there anyway i can do to making gensim generated doc2vec works faster, or anything wrong with me??



Source: https://stackoverflow.com/questions/70718678/whats-the-difference-between-pd-dataframe-and-sklearn-transformed-sparsematrix

iQOO 9, iQOO 9 Pro Specifications Tipped Ahead of the Official Launch

Run this bot on machine where your qbittorrent has been installed