Release Highlights Easier Application Porting Share GPUs across multiple threads Use all GPUs in the system concurrently from a single host thread No-copy pinning of system memory, a faster alternative to cudaMallocHost() C++ new/delete and support for virtual functions Support for inline PTX assembly Thrust library of templated performance primitives such as sort, reduce, etc. NVIDIA Performance
![CUDA Toolkit 4.0](https://cdn-ak-scissors.b.st-hatena.com/image/square/c508c975172534ac2931db7e4309e753fccf91bb/height=288;version=1;width=512/https%3A%2F%2Fdeveloper.download.nvidia.com%2Fimages%2Fog-default.jpg)