[B! cuda] y_uukiのブックマーク

GPU Programming in Rust: Implementing High Level Abstractions in a Systems Level Language

y_uuki 2016/05/03

リンク

Darknet: Open Source Neural Networks in C

Darknet is an open source neural network framework written in C and CUDA. It is fast, easy to install, and supports CPU and GPU computation. You can find the source on GitHub or you can read more about what Darknet can do right here: Installing Darknet Darknet is easy to install and run. This post will guide you through it. YOLO: Real-Time Object Detection You only look once (YOLO) is a state-of-t

y_uuki 2016/03/21

リンク

More modern gpu

GPUがなぜ速いのか，またその上でどのようなデータ構造やアルゴリズム，ライブラリが使えるのかを説明します。特にMapReduceなどの非均質で，離散的なアルゴリズムがいかに高速に実現されるかを紹介します。実験に使ったコード https://github.com/hillbig/gpuexperiments セミナーの動画 https://www.youtube.com/watch?v=WmETPBK3MOIRead less

y_uuki 2015/12/19

GPU
cuda

リンク

File Not Found: Indiana University

File Not FoundSorry for the inconvenience, the page you requested could not be found.

y_uuki 2013/07/16

リンク

PyCUDAの紹介

11. スペックIntel Xeon X5570 × 264bit 4コアNVIDIA Tesla M2050 × 2448GPUコア単精度ピーク時1.03Tflops倍精度対応Memory 22GB高速なネットワーク同一Placement Group内は高速通信可能CentOS 5.5同程度のマシンを買うと80〜100万円ぐらい？ 12. 費用On-Demand Instance$2.1 / hour (￥178.5 / hour)1年￥1,563,660，3年￥4,690,980，…Reserved Instance1年 $5,630 (￥478,550)$0.65 / hour (￥55 / hour)3年 $8,650 (￥735,250)$0.33 / hour (￥28 / hour)+EBS料金など(月$2〜)※$1 = ￥85で計算

y_uuki 2013/05/31

リンク

mydoc/gpu_profiler_counter.md at main · iwag/mydoc

GPU Advent Calendar 2012 21日目の記事です。あのだよ、ワス今さらなんだケドGPGPUに興味がでてきてCUDAなんか調べてたら、お声がかかってこういう書いているわけですよ。 GPUやろうと思ったのも特になんか作りたいものがあったりするわけじゃないケド、やっぱりGPU使う以上早くなって欲しいわけじゃんか。でいろいろ本とかNVIDIAの資料とかウェッブで資料読んだりしてるわけです。だけどもさあ、これはバンド幅制限だすよとかコアレッシングですよとかシェアドメモリがバンクコンフリクトとか分岐がどうしてますよとか。いやぁワスみたいな初心者にはプログラムみてここがコアレッシングだよとか正直きびしい。まあ作ってみて測ればいいじゃんとか言われるとぐぅの音もでないんだケドさ、ワスは高速化するために行きつ戻りとかしたくないんだよ！！というかワス、ホント情けないことなんだケ

y_uuki 2013/05/26

リンク

Modern GPU

© 2013, NVIDIA CORPORATION. All rights reserved. Code and text by Sean Baxter, NVIDIA Research. (Click here for license. Click here for contact information.) Modern GPU is code and commentary intended to promote new and productive ways of thinking about GPU computing. This project is a library, an algorithms book, a tutorial, and a best-practices guide. If you are new to CUDA, start here. If you'r

y_uuki 2013/05/21

リンク

hgpu.org

In this paper, we focus on three sparse matrix operations that are relevant for machine learning applications, namely, the sparse-dense matrix multiplication (SPMM), the sampled dense-dense matrix multiplication (SDDMM), and the composition of the SDDMM with SPMM, also termed as FusedMM. We develop optimized implementations for SPMM, SDDMM, and FusedMM operations utilizing Intel oneAPI’s Explicit

y_uuki 2013/05/19

リンク

Using async memcopy without using cudaMallocHost/cudaHostAlloc?

y_uuki 2013/02/25

cuda
memory

リンク

Why is CUDA pinned memory so fast?

I observe substantial speedups in data transfer when I use pinned memory for CUDA data transfers. On linux, the underlying system call for achieving this is mlock. From the man page of mlock, it states that locking the page prevents it from being swapped out: mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified addre

y_uuki 2013/02/25

cuda
memory

リンク

CUDAのPinnedホストメモリ - トータル・ディスクロージャ・サイト（事実をありのままに）

Pinnedホストメモリとは CUDAの利用に適した、ページアウトしないホストメモリであり、cudaHostAllocによって新規確保、cudaHostRegisterによって既存のホストメモリをPinned化できる。cudaHostAllocによって確保されたメモリはcudaFreeHostによって解放でき、cudaHostRegisterによってPinned化されたメモリはcudaHostUnregisterによって非Pinnedホストメモリに戻せる。Pinned化したホストメモリは、GPUとの通信が高速に行えるほか、後述するMappedメモリとしても使うことができる。 Pinnedホストメモリの転送速度は、CUDA SDKに含まれる、bandwidthTestサンプルプログラムを実行すると、簡単に確認できる。通常の非Pinnedホストメモリ使用: $ NVIDIA_GPU_Com

y_uuki 2013/02/14

CUDA

リンク

How to Optimize Data Transfers in CUDA C/C++ | NVIDIA Technical Blog

In the previous three posts of this CUDA C & C++ series we laid the groundwork for the major thrust of the series: how to optimize CUDA C/C++ code. In this and the following post we begin our discussion of code optimization with how to efficiently transfer data between the host and device. The peak bandwidth between the device memory and the GPU is much higher (144 GB/s on the NVIDIA Tesla C2050,

y_uuki 2013/02/14

CUDA

リンク

The CUDA Handbook: Stream Callbacks

y_uuki 2013/02/13

CUDA

リンク

CUDA 5の新機能(3): CPU処理のStream投入 - トータル・ディスクロージャ・サイト（事実をありのままに）

この記事はRC版の情報を基にしており、リリース版では変更が有ることが分かっています。リリース版の仕様に合わせて修正していますが、古い情報が残っている可能性が有ります。 CUDAにはStreamという機能が有り、GPU計算のKernelやメモリ転送を、CPUのメインスレッドとは独立のキューで管理することができる。このStreamの機能がCUDA 5で拡張され、CPU処理をStreamに投入できるようになった。 API cudaError_t cudaStreamAddCallback(cudaStream_t, cudaStreamCallback_t, void *, unsigned int) 第1引数のStreamのキューに、第3引数を引数にした、関数ポインタである第2引数の呼び出し処理を投入する。第4引数は処理のフラグであり、CPU処理のコールバック関数を呼び出し次第ブロッキングせ

y_uuki 2013/02/13

cuda

リンク

LLVM meets GPU again! - Qiita

この記事はGPGPU Advent Calendarの8日目の記事です。 LLVM meets GPU again! CUDA4.1以降のnvccは、Compute Capability 2.0以上のコードを生成する際にLLVM IRのサブセットであるNVVMを経由して最適化等を行ってから、NVIDIAの規定する中間表現であるPTXへの変換を行なっています。今年に入ってこの成果がLLVM本家にマージされ、バージョン3.2で正式にお目見えすることになりました。LLVM 3.2の正式リリースは2012/12/16と一週間ほど先に予定されていますが、リポジトリにはすでに3.2用のブランチが切られています。今回は一足先にLLVM 3.2を使用して、LLVM IRからPTXを生成してみましょう！歴史的な経緯今回LLVMにマージされたNVIDIAの実装したPTXバックエンドはNVPTXと呼ばれてい

y_uuki 2013/02/13

CUDA
LLVM

リンク

CUDA vs. Phi: Phi Programming for CUDA Developers

Currently we allow the following HTML tags in comments: Single tags These tags can be used alone and don't need an ending tag. <br> Defines a single line break <hr> Defines a horizontal line Matching tags These require an ending tag - e.g. <i>italic text</i> <a> Defines an anchor <b> Defines bold text <big> Defines big text <blockquote> Defines a long quotation <caption> Defines a table caption <c

y_uuki 2013/02/13

cuda
xeonphi

リンク

Thrustから見る、reduceを用いたアルゴリズム実装

"The Great Day of His Wrath" by John Martin はじめに世はまさに大並列時代。火を手にした人類が飛躍的な進化を遂げたように、並列化のパラダイムは、今まで到底不可能と思われていた速さで計算を行うことができるようになりました。ところが、どんな手法にも欠点は存在するもので、実際に実装しようとすると非常に難しい。何故かというと、並列には並列化特有の問題が存在しており、愚直に実装してしまうとCPUより早くなってしまうどころか遅くなってしまうことだってあり得るのです。これを回避するにはGPUの内部構造についてきちんと理解をした上で、実装したいアルゴリズムそれぞれの場合に特化したコーディングを行う必要がある。しかしよくよく考えるとおかしな話です。私たちが実装したいのはあくまで手法であり、ハードウェアではありません。なぜこのような詳細について把握する必要がある

y_uuki 2013/01/31

リンク

CUDA 4.0 - cudaHostUnregister is slow

y_uuki 2013/01/31

CUDA

リンク

トータル・ディスクロージャ・サイト（事実をありのままに）

このWikiは、開発の「生」データの情報開示を目的としています。従って仕様などの情報は頻繁に更新されますので、出荷される製品の仕様とは異なる場合の方が多いです。更新された過去の情報は履歴から参照することが可能です。「ベンチマーク」セクションは、おもに当社の製品で使用される事が多いアプリケーションや、新たに適用されるOSについての情報を記載します。「技術情報」セクションは、HPCシステムズ秘蔵の技術情報を公開しています。このサイトへのご意見はサイトへの書き込みのみでお受けいたします。また、返信の方法はサイトへの書き込みのみとさせていただきます。このサイトの情報は、特記のないかぎり、各ページの執筆時点における状況に基づいて記述されています。執筆後に状況等が変化し、閲覧時の最新の状況とは内容が異なる可能性があることを、あらかじめご承知おきください。注目情報！本日の出荷 2009