Feedback on `@pytorch_parallel`

Related to [metaflow #907](Netflix/metaflow#907 and the related docs draft

1. Developer Ergonimics

To use the feature, the user must learn a brand new way of doing foreach. This adds a high degree of congntive load, as the user must remember that for this particular use case and this use case only, they need to use self.next(..., num_parallel=...).

The api makes it very unclear where the parallelization is happening. For example, pytorch_lightning.trainer has an argument gpus=-1, which means that it will use all available gpus. In this case, what does num_parallel add to this? The user has lots of cognitive overload to have to reason about where and what kind of parallelization is happening.

Furthermore, this is a leaky abstraction as the user has to still think about the dynamics of paralleization as illustrated by this code:

       # do something with the model_path
       if current.parallel.node_index == 0:
           self.final_model_path = ...
       self.next(self.train_join)

Upon seeing this, the user must go into the rabbit hole of understanding how inter-process communication works and why this is even necessary. This is not an api that allows you to "just scale up" and use the same API you would use as if though you were still on a single node, which would presumably be the benefit of using a system so you don't have to worry about this.

Finally, the same snippet of code presented above presents yet another type of inconsistency in the traditional metaflow design pattern, which is conducting a merge step after a fan out. In this case, we are basically doing a merge step directly inside a foreach step. This forces the user to develop yet another mental model that is orthogonal to the general mental model of metaflow (fanout -> merge/join). This greatly increases cognitive load once again for the user.

2. Not Suited For Pragmatic "Scaleup" But For Hyperscale (OpenAI / Google)

The most pragmatic approach to acheiving distributed training is to provision a single node with multiple GPUs. AWS offers instances that have up to 16 GPUs on a single node. This often a good approach, especially with the communication overhead that starts to accumulate with multiple nodes. While this new Metaflow feature does not foreclose this avenue, it doesn't make single-node multi-gpu any easier to use, Infact, I believe it makes it much more Complicated If you are using a high level api, like pytorch_lightning you can achieve multi-gpu training like so:

trainer = Trainer(gpus=-1)
trainer.fit(...)

It appears the purported value add of @pytorch_parallel is that you have a new control plane that allows you to provision a multi-node setup. If you are using a single node, @pytorch_parallel seems to only complicate things when you are using a high level api like pytorch lightning. Infact, if you are using a single node, it appears that you should not use this feature at all.

Therefore this feature has the following consequences:

Because this feature is about enabling a multi-node setup, it mainly caters to Hyperscale use cases which is an extremely rare kind of user. We should guide people towards the more pragmatic approach of using a single node with multiple GPUs wherever possible, so much care should be taken when describing this feature.
Has the potential to confuse many users to use a sub-optimal design pattern and workflow to accomplish the goal of multi-gpu training by inadvertendly conducting multi-node training.

3. Documentation

Pytorch vs. Pytorch Lightning?

It is not clear from the documentation that this will work with vanilla pytorch (even though I personally know it does!). It would be helpful if the documents could specify more about the scope of the functionality and exactly how it works. For example, I don't think the api as designed is currently compatible with fastai. So users should be able to understand why.

Data Parallel vs Model Paralllel

I know that because we expose current.parallel.num_nodes and current.parallel.node_index this allows you to achieve whatever type of parallelization you want.

However, most people are using a high level api for training pytorch, and not be knowledgable about this corner of pytorch ito be able to figure this out. It would be helpful to disucss this more because people will certainly ask.

Other Nits

There are critical things that are not documented. Some that are immediately obvious are:

the current module, particularly current.parallel.num_nodes and current.parallel.node_index. We need to explain what these mean, how they are calculated and how this should be used.
image="pytorchlightning/pytorch_lightning" is an undocumented feature.

hamelsmu/pytorch_lightning_metaflow.md

Feedback on @pytorch_parallel