Skip to content

Incremental Training with Word2Vec/ParagraphVectors #579

Open
@bluelkc

Description

@bluelkc

Issue Description

I am trying to incrementally train word2vec model and analyse the time and vector space difference as compared to the model obtained through batch training. So far I only found the word2vec uptraining example relevant to this issue and was wondering what should be the data input for the subsequent incremental training after the first training.

In the Word2VecUptrainingExample, the same raw_text file is being used for both the first and the second training. Am I right to say that for the subsequent incremental trainings, the input data should always include the very original set of data plus whatever data that is newly added?

Also, is it possible to conduct incremental training on paragraph vectors? I have tried with DocumentIterator with trainWordVector set to TRUE, but the nearestWords test shows document index among the results.

Lastly, I found it very strange that for all my incremental trainings with previously trained word2vec model loaded, nearestWords test always show the same result as what the loaded word2vec model would show. There is certainly something missing here, please advise.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions