OpenACC defines data acording to whether it is in discrete or shared memory. When in discrete, specific data operations are specified and implicit data clauses are defined. When in shared memory, data clauses may be ignored if they exist. As an optimization, an implementation may wish to use data clauses as optimization hints. I have historically thought of these in terms of CUDA Unified/Managed Memory with preferred location and prefetching hints. A few cases were brought to my attention that are potentially interesting examples of how this thinking may not be sufficient.
I have been made aware of an application that extensively uses the pattern below. A temporary array is allocated locally, in the example below it is an automatic array, and dynamic data lifetimes are used to expose it to the device asynchronously. It is possible that the function would return, deallocting the automatic array, before all operations on that array have completed. Supporting this pattern requires either that memory allocation and deallocation are stream-ordered or the some sort of garbage collection is implemented to clean up the present table lazily after all operations have completed.
Some more text later in that paragraph:
"A data lifetime is the duration from when the data is first made available to the accelerator until it becomes unavailable."
and:
"For data not in shared memory, the data lifetime begins when it is made present and ends when it is no longer present."
The definition of "present data" from the glossary:
"data for which the sum of the structured and dynamic reference counters is greater than zero in a single device memory section".
As we discussed today, reference counting is synchronous with the host. Thus, in the first example above, the ref count of B becomes zero at the
acc exit data
, so it is no longer present after that, so its data lifetime ends, so it is not available to the accelerator, so the kernel accessing B is invalid because it might be executed after this point due to its async clause.In other words, that program is invalid even for discrete memory. I'm just not sure any implementation will detect that violation.
My understanding of what the spec says here keeps changing. Am I still misunderstanding? Does the spec need to be changed?