Skip to content Skip to sidebar Skip to footer

Re-training Inception Google Cloud Stuck At Global Step 0

I am following the flowers tutorials for re-training inception on google cloud ml. I can run the tutorial, train, predict, just fine. I then substituted the flowers dataset for a t

Solution 1:

Everything looks fine. My suspicion is that there is a problem with your data. Specifically I suspect TF is unable to read any data from your GCS files (are they empty?)? As a result when you invoke train, TF ends up blocking trying to read a batch of data which it can't do.

I would suggest adding logging statements around the call to session.run in Trainer.run_training. This will tell you whether that is the line where it is getting stuck.

I'd also suggest checking the sizes of your GCS files.

TensorFlow also has an experimental RunOptions which allows you to specify a timeout for Session.run. Once this feature is ready, this might be useful for ensuring code doesn't block forever.

Post a Comment for "Re-training Inception Google Cloud Stuck At Global Step 0"