Solvedpytorch lightning How to log by epoch for both training and validation on 1.0.0rc4 / 1.0.0rc5 / 1.0.0

What is your question?

I have been trying out pytorch-lightning 1.0.0rc5 and wanted to log only on epoch end for both training and validation while having in the x-axis the epoch number. I noticed that training_epoch_end now does not allow to return anything. Though I noticed that for training I can achieve what I want by doing:

def training_epoch_end(self, outputs):
    loss = compute_epoch_loss_from_outputs(outputs)
    self.log('step', self.trainer.current_epoch)
    self.log('loss', {'train': loss})

It sets the step to be the epoch number and used for the x-axis just as I wanted. I have not found in the documentation if this is how it is intended to be logged. I am also a bit confused about the result objects. Nevertheless, this code seems quite simple and logical, so I thought this could be one of the possible intended ways of logging per epoch.

I tried to do the same for validation as follows:

def validation_epoch_end(self, outputs):
    loss = compute_epoch_loss_from_outputs(outputs)
    self.log('step', self.trainer.current_epoch)
    self.log('loss', {'valid': loss})

However, in the case of validation the x-axis is the number of batches in validation and an additional step graph appears in tensorboard.

Based on this I have some questions. Is this an intended way of logging per epoch? If yes, is the idea that the same behavior is obtained for both training and validation? If this is not the intended way of logging per epoch, where can I read about how this is planned for version 1.0.0?

What's your environment?

  • OS: Linux
  • Packaging: pip
  • Version: 1.0.0rc5
25 Answers

✔️Accepted Answer

To clarify a bit further, I want to do

def training_epoch_end(self, outputs):
    some_val = ...
    self.log('step', self.trainer.current_epoch)
    self.log('some_val', {'train': some_val})

def validation_epoch_end(self, outputs):
    some_val = ...
    self.log('step', self.trainer.current_epoch)
    self.log('some_val', {'valid': some_val})

Expecting to get a graph where I see some_val for both training and validation which would look like

Screenshot from 2020-10-14 08-02-53

It is useful for me to observe in a single graph the same value for both training and validation at comparable time intervals. I also want the x-axis to be epoch because of several reasons. One of them is that I want to use GradientAccumulationScheduler which means that in each epoch the number of steps can be different. If I use number of steps then the points in the x-axis would be unevenly distributed.

Other Answers:

I think that to log one value per epoch you can simply call

self.log('metric_name', metric_value, on_step=False, on_epoch=True)

at each training step. This should automatically accumulate over the epoch and output the averaged value at epoch end. But true, then on the x-axis you will have the current step (not the epoch number).

I'm not sure you can override that from the LightningModule.log API. If that's very important maybe you can directly access the logger in self.logger.experiment and use that?

  1. This:
def training_step:
   return loss

def training_epoch_end(outs):
    self.log('avg_loss', outs.mean())

is the same as:

def training_step:
   self.log('avg_loss', on_step=False, on_epoch=True)
   return loss
  1. If you still need to log something on epoch end, then just call self.log
def training_epoch_end(outs):
    some_val = ...
    self.log('some_val', some_val)
  1. Logging steps in validation makes no sense lol... the x-axis would be the batch idx not time. So the curve means nothing. This is why PL makes a separate graph for each... because when done this way, it can be viewed as a change in distribution over time.

@williamFalcon thank you for the response. Please note that I am not interested in logging validation in each step. I completely agree, this does not make sense. I only want to log validation values on validation_epoch_end. In my example it is for loss but that is not important, the same question holds for some_val.

Furthermore, if for both training and validation values are only logged at epoch end as in the example, then both can be plotted on the same graph precisely showing the change in distribution over time. Both can be plotted in the same graph because the values correspond to the same points in time (epoch end). This is already done automatically by PL with my example snippets at the top but removing the self.log('step', ..., but as you say this does not make sense. For the plot to make sense I want to override the step to be batch instead of global_step. This overriding of step works for training but not for validation. If users are allowed to override step for training for consistency it makes sense that it can also be overridden for validation.

Related Issues:

pytorch lightning Model load_from_checkpoint
Here's a solution that doesn't require modifying your model (from #599). Describe the bug When loadi...
pytorch lightning How to use multiple metric monitors in ModelCheckpoint callback?
Do you plan to support it? It would be nice to be able to do the following: and something similar fo...
pytorch lightning How to log train and validation loss in the same figure ?
Got NotImplementedError: Got <class 'dict'> but numpy array torch tensor or caffe2 blob name are exp...
pytorch lightning Cyclic learning rate finder as a part of Trainer
I actually did a bit of research into this and implemented this at work It's actually very easy ...
pytorch lightning How to log by epoch for both training and validation on 1.0.0rc4 / 1.0.0rc5 / 1.0.0
To clarify a bit further What is your question? I have been trying out pytorch-lightning 1.0.0rc5 an...
pytorch lightning Cannot import pytorch-lightning-v0.5.3
I had the same error too cannot import name 'TestTubeLogger' I tried pip install test-tube and it sa...
pytorch lightning Log training metrics for each epoch
How about this: In __init__: self.training_losses = [] In training_step method: self.training_losses...
pytorch lightning Single node DDP: "Default process group is not initialized"
Can we re-open this issue? I am still having the Default process group is not initialized issue when...
pytorch lightning Hydra configs with multi GPU DDP training in Pytorch Lightning
@rakhimovv when you use ddp As far as I understand DDP backend runs my training script from beginnin...
pytorch lightning Model summarize displayed twice before training starts
on my windows machine single gpu happens too but the logs also appear twice including all warnings ¨...
pytorch lightning How to keep some LightningModule's parameters on cpu when using CUDA devices for training
I modified my code like this: It works! Thank you @rohitgr7 and @awaelchli! ❓ Questions and Help Wha...
pytorch lightning the self.log problem in validation_step()
The Issue is still present in version 1.0.1 as doc say we should use self.log in last version ...
pytorch lightning Does one need to reset Metrics during the end of epoch in Lightning 1.0.3
Depends on how you are using the metrics In general if the .compute()method is called the internal s...
pytorch lightning CUDA OOM when initializing DDP
With CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9 I get the usual OOM exception With CUDA_VISIBLE_DEVICE...
TagUI dyld: Library not loaded error on macOS (OpenSSL missing)
This worked for me: Welcome to TagUI GitHub page 🤖! TagUI is a free RPA tool by AI Singapore (to au...
TagUI macOS OpenSSL version issue - Homebrew moved it from v1.0 to v1.1 - initial fix
The solution seems to be running below commands in terminal There are 2 uninstall commands to make s...
mlflow Unable to see mlflow ui at on macOS Sierra version 10.12.3
@stenpiren did you happen to run the mlflow ui from the root path of your cloned repo for example ...
mlflow RunId not found when executing "mlflow run" with remote tracking server
How are ya'll specifying the tracking URI? I'd recommend doing so by setting the MLFLOW_TRACKING_URI...
carla Invalid pixelmap or window parameter
@jbeck9 I actually got this working by switching to the latest Nvidia driver 470 and running Carla w...
mlflow Unable to see mlflow ui at when mlflow is running in docker container
@datariders I had the same issue The problem here is gunicorn is binding to just inside th...
spaCy Problem with spacy.load('en')
Try this... pip install spacy && python -m spacy download en This worked for me. ...
spaCy ValueError: 2539520 exceeds max_bin_len(1048576) when uses spacy.load()
Thanks for the report and sorry you've hit this problem It also just came up in our tests today and ...
mlflow error with “mlflow ui” at windows
Are there any plans to have Windows supported for the mlflow ui ? I am also experiencing a problem w...
cookiecutter data science Integration with dvc
@isms good questions! I've prepared a PR #159 to show how the first step would look like Would be gr...
spaCy 💫 Train parser and NER with regression objective, to make scores express expected parse quality
@Zhenshan-Jin I think I'm getting somewhere now This works: Only problem now is my models are traine...
milvus Support collection rename
Hi sorry for the late reply Here is my rough design Any suggestions or feedbacks are welcome ...
dvc Dataset storage improvements
I will give my impressions on your questions: There were many requests related to datasets storing w...
dvc [Feature Request?] dvc run ... without actually running?
We've been thinking about it a lot and decided to change dvc add/run/repro so they will only save ch...
photoprism Error 1045: Access denied for user 'photoprism'@''
I ran into the same problem and this thread helped me to fix it by: Stopping the running containers ...
carla Building on Windows 0.8.X
Install the pre-requisites (Visual Studio 2015 This issue have changed alot since it was opened ...
carla 0.9.6 docker container: xdg-user-dir not found
Hmm This issue does appear to exist at least for me When executing 0.9.6 docker with The following w...
carla rsync: change_dir "/home/ue4/carla//./Plugins" failed: No such file or directory (2)
@thillRobot My fix was to change line 166 in in the Docker container to: copy_if_changed ...
spaCy 💫 Participating in CoNLL 2018 Universal Dependencies evaluation (Team spaCy?)
To everyone who wants to help out here's another issue to take on Update 06/06/2018 ...
openkore Disconnected from server.
I've got the solution! Simply remove the line 0362 0367 in tables/iRO/Restart/shuffle.txt on the end...
aiyprojects raspbian ImportError: No module named 'aiy'
I compared the check audio script with the one in voicekit branch and I added this line just before ...
spaCy How to get dependency tags/tree in a CoNLL-format?
Should have had this snippet up from the start --- thanks. How to get dependency tags/tree in a CoNL...
caffe2 Caffe Translator Error: Convolution Layer
OK so it seems like the issue I mentioned in the previous comment has to do with the Scale layer bei...
dvc Model parameter tracking
There are different tools already to do hyperparameter optimization:
dvc ARM architecture support
If anybody needs it Recently I have tried to install DVC on Nvidia Jetson TX2 without any success (I...
dvc Pushing artifacts via WebDAV results in a 411 Length Required response
Any info about the server? At the first glance seems like the server is not understanding the chunke...
dvc Not able to push data of dependencies to the remote
@Christoph-1 using the rules I suggested my_data.csv will be ignored by the first rule data/** ...
haystack Haystack with Albert is awesome! XLNet question
If you update FARM on latest master you should be able to increase the batch size a lot @Timoeller Y...
dvc Connecting to SSH remote with a custom port and a private key
Hi @gcoter ! Both patches are merged and released in 0.18.4 Here is a quick run-through: Please feel...
react native vision camera 🐛 Android unknown crash
Hey! I ran into the same issue when testing out the module in my own application Only one of my devi...
photoprism Login with Admin password results in invalid credentials
Hi @wasuint here is my personal docker compose file: Please note: Currently we should use preview ta...
snorkel LabelModel produces equal probability for labeled data
Hi @s2948044 @vtang13 @xsway thanks first of all for bringing this issue to light in such detail ...
dm_control import MuJoCo, ImportError: cannot import name 'constants'
Did you install making it editable? This gave me the same problem In other words ...
learning to learn structures don't have the same sequence type
It is because the input and output of the LSTMcell don't have the same type in tf.while_loop The pro...
sonnet undefined symbol
I'm also getting this as per @thesilencelies Tried tensorflow 1.1.0 then 1.1.0-rc2 when that didn't ...
fairseq wav2vec 2.0 inference pipeline
I did it in Fairseq version 0.9.0 In fairseq-0.9.0 Wav2vec-2.0 is not supported So I took it from th...