Solvedpytorch lightning How to log by epoch for both training and validation on 1.0.0rc4 / 1.0.0rc5 / 1.0.0
βοΈAccepted Answer
To clarify a bit further, I want to do
def training_epoch_end(self, outputs):
some_val = ...
self.log('step', self.trainer.current_epoch)
self.log('some_val', {'train': some_val})
def validation_epoch_end(self, outputs):
some_val = ...
self.log('step', self.trainer.current_epoch)
self.log('some_val', {'valid': some_val})
Expecting to get a graph where I see some_val
for both training and validation which would look like
It is useful for me to observe in a single graph the same value for both training and validation at comparable time intervals. I also want the x-axis to be epoch
because of several reasons. One of them is that I want to use GradientAccumulationScheduler
which means that in each epoch the number of steps can be different. If I use number of steps then the points in the x-axis would be unevenly distributed.
Other Answers:
I think that to log one value per epoch you can simply call
self.log('metric_name', metric_value, on_step=False, on_epoch=True)
at each training step. This should automatically accumulate over the epoch and output the averaged value at epoch end. But true, then on the x-axis you will have the current step (not the epoch number).
I'm not sure you can override that from the LightningModule.log
API. If that's very important maybe you can directly access the logger in self.logger.experiment
and use that?
- This:
def training_step:
return loss
def training_epoch_end(outs):
self.log('avg_loss', outs.mean())
is the same as:
def training_step:
self.log('avg_loss', on_step=False, on_epoch=True)
return loss
- If you still need to log something on epoch end, then just call self.log
def training_epoch_end(outs):
some_val = ...
self.log('some_val', some_val)
- Logging steps in validation makes no sense lol... the x-axis would be the batch idx not time. So the curve means nothing. This is why PL makes a separate graph for each... because when done this way, it can be viewed as a change in distribution over time.
@williamFalcon thank you for the response. Please note that I am not interested in logging validation in each step. I completely agree, this does not make sense. I only want to log validation values on validation_epoch_end
. In my example it is for loss
but that is not important, the same question holds for some_val
.
Furthermore, if for both training and validation values are only logged at epoch end as in the example, then both can be plotted on the same graph precisely showing the change in distribution over time. Both can be plotted in the same graph because the values correspond to the same points in time (epoch end). This is already done automatically by PL with my example snippets at the top but removing the self.log('step', ...
, but as you say this does not make sense. For the plot to make sense I want to override the step
to be batch
instead of global_step
. This overriding of step
works for training but not for validation. If users are allowed to override step
for training for consistency it makes sense that it can also be overridden for validation.
What is your question?
I have been trying out pytorch-lightning 1.0.0rc5 and wanted to log only on epoch end for both training and validation while having in the x-axis the epoch number. I noticed that
training_epoch_end
now does not allow to return anything. Though I noticed that for training I can achieve what I want by doing:It sets the
step
to be the epoch number and used for the x-axis just as I wanted. I have not found in the documentation if this is how it is intended to be logged. I am also a bit confused about the result objects. Nevertheless, this code seems quite simple and logical, so I thought this could be one of the possible intended ways of logging per epoch.I tried to do the same for validation as follows:
However, in the case of validation the x-axis is the number of batches in validation and an additional
step
graph appears in tensorboard.Based on this I have some questions. Is this an intended way of logging per epoch? If yes, is the idea that the same behavior is obtained for both training and validation? If this is not the intended way of logging per epoch, where can I read about how this is planned for version 1.0.0?
What's your environment?