How Calculate Distance Between 2 Node2vec Model

August 20, 2024 Post a Comment

I have 2 node2vec models in different timestamps. I want to calculate the distance between 2 models. Two models have the same vocab and we update the models. My models are like thi

Solution 1:

Assuming you've used a standard word2vec library to train your models, each run bootstraps a wholly-separate model whose coordinates are not necessarily comparable to any other model.

(Due to some inherent randomness in the algorithm, or in the multi-threaded handling of training input, even running two training sessions on the exact same data will result in different models. They should each be about as useful for downstream applications, but individual tokens could be in arbitrarily-different positions.)

That said, you could try to synthesize some measures of how much two models are different. For example, you might:

Pick a bunch of random (or domain-significant) word-pairs. Check the similarity between each pair, in each model individually, then compare those values between models. (That is, compare model1.similarity(token_a, token_b) with model2.similarity(token_a, token_b).) Consider the difference-between-the-models as as some weighted combination of all the tested similarity-differences.
For some significant set of relevant tokens, collect the top-N most-similar tokens in each model. Compare this lists via some sort of rank-correlation measure, to see how much one model has changed the 'neighborhoods' of each token.

For each of these, I'd suggest verifying their operation against a baseline case of the exact-same training data that's been shuffled and/or trained with a different starting random seed. Do they show such models as being "nearly equivalent"? If not, you'd need to adjust the training parameters or synthetic measure until it does have the expected result - that models from the same data are judged as alike, even though tokens have very different coordinates.

Another option might be to train one giant combined model from a synthetic corpus where:

all the original unmodified 'texts' from both eras all appear once
texts from each separate era appear again, but with some random-proportion of their tokens modified with an era-specific modifier. (For example, 'foo' sometimes becomes 'foo_1' when in first-era texts, and sometimes becomes 'foo_2' in second-era texts. (You don't want to convert all tokens in any one text to era-specific tokens, because only tokens that co-appear with each other influence each other, and you thus want tokens from either era to sometimes appear with common/shared variants, but also often appear with era-specific variants.)

At the end, the original token 'foo' will get three vectors: 'foo', 'foo_1', and 'foo_2'. They should all be quite similar, but the era-specific variants will be relatively more-influenced by the era-specific contexts. Thus the differences between those three (and relative movement in the now common coordinate space) will be an indication of the magnitude and kinds of changes that happened between the two eras' data.

Python Tutorial for Beginners

How Calculate Distance Between 2 Node2vec Model

Solution 1:

Post a Comment for "How Calculate Distance Between 2 Node2vec Model"