How Calculate Distance Between 2 Node2vec Model
Solution 1:
Assuming you've used a standard word2vec
library to train your models, each run bootstraps a wholly-separate model whose coordinates are not necessarily comparable to any other model.
(Due to some inherent randomness in the algorithm, or in the multi-threaded handling of training input, even running two training sessions on the exact same data will result in different models. They should each be about as useful for downstream applications, but individual tokens could be in arbitrarily-different positions.)
That said, you could try to synthesize some measures of how much two models are different. For example, you might:
Pick a bunch of random (or domain-significant) word-pairs. Check the similarity between each pair, in each model individually, then compare those values between models. (That is, compare
model1.similarity(token_a, token_b)
withmodel2.similarity(token_a, token_b)
.) Consider the difference-between-the-models as as some weighted combination of all the tested similarity-differences.For some significant set of relevant tokens, collect the top-N most-similar tokens in each model. Compare this lists via some sort of rank-correlation measure, to see how much one model has changed the 'neighborhoods' of each token.
For each of these, I'd suggest verifying their operation against a baseline case of the exact-same training data that's been shuffled and/or trained with a different starting random seed
. Do they show such models as being "nearly equivalent"? If not, you'd need to adjust the training parameters or synthetic measure until it does have the expected result - that models from the same data are judged as alike, even though tokens have very different coordinates.
Another option might be to train one giant combined model from a synthetic corpus where:
- all the original unmodified 'texts' from both eras all appear once
- texts from each separate era appear again, but with some random-proportion of their tokens modified with an era-specific modifier. (For example, '
foo
' sometimes becomes'foo_1'
when in first-era texts, and sometimes becomes'foo_2'
in second-era texts. (You don't want to convert all tokens in any one text to era-specific tokens, because only tokens that co-appear with each other influence each other, and you thus want tokens from either era to sometimes appear with common/shared variants, but also often appear with era-specific variants.)
At the end, the original token 'foo'
will get three vectors: 'foo'
, 'foo_1'
, and 'foo_2'
. They should all be quite similar, but the era-specific variants will be relatively more-influenced by the era-specific contexts. Thus the differences between those three (and relative movement in the now common coordinate space) will be an indication of the magnitude and kinds of changes that happened between the two eras' data.
Post a Comment for "How Calculate Distance Between 2 Node2vec Model"