The table below shows the performance of quiver.Feature when we cache 20% hot data on GPU cpmpare with normal CPU feature collection. We collect features with GPU.įirst we show our performance of feature collection using one GPU. It can also host hot data on GPU and cold data on CPU (if csr_topo is provided). Memory bandwidth relationship of a GPU server is: GPU Global Memory > GPU P2P With NVLink > Pinned Memory > Pageble Memory.Ĭonsidering the bandwidth and the power law, Quiver's quiver.Feature can allocate features automatically to GPU memory and CPU Pinned Memory.#edges(with node degree > average)/total_edges_count #(degree > average_degree)/total_nodes_count We show in the table below that two datasets follow our observations. Most sampling algorithms based on graph structure will sample more nodes whose degree are higher. Real graph datasets usually follow the power low, which means most of the edges in the graph are associated with a small portion of nodes.The implementation of quiver.Feature is based on two observations below: Quiver provides feature collection quiver.Feature with high throughput. When feature size is small, put all feature data on GPU and collect the batchĪpproach 1 has a much lower throughput, and it has a poor scalability because multiple workers will share CPU memory bandwidth.Collect feature on CPU, and transfer the result to GPU for training.The core idea of optimizing feature collection is to optimise the throughput of feature transfer. The feature collection time, especially the CPU-to-GPU case, have significant cost. The feature size of one batch in GNN training is usually hundreds of MB or even several GB. Note: Benchmark code here 3.1 Existing Approaches This part will be further explained in our end-to-end performance benchmark. GraphSageSampler( csr_topo,, device = 0, mode = "UVA")įurthermore, UVA sampling will not cost CPU computation resources, which avoid contention when we train with multiple GPUs and have a better scalability. # You can set mode='GPU' to choose place graph data in GPU memory quiver_sampler = quiver. Compared to UVA mode, this sampling strategy will 30-40% speedup with extra GPU memory cost.ĭataset = PygNodePropPredDataset( 'ogbn-products', root)Ĭsr_topo = quiver. Meanwhile, Quiver's sampling can be configured to enable GPU-only sampling ( mode='GPU'), which put the whole graph structure to GPU memory to achieve higher performance. We can observe that without storing the graph on GPU, Quiver get 20x speedup on real datasets. Our sampling metrics are the sampling throughput ( Sampled Edges Per Second, SEPS). We evaluate our sampling performance on ogbn-productsand reddit datasets, and we find that the performance of UVA-based sampling is much higher than CPU sampling. In this way, we not only get performance benefits beyond CPU sampling, but also sample the graph whose size is beyond GPU memory. Quiver provide users with UVA-Based(Unified Virtual Addressing Based)graph sampling operator, supporting storing graph structure in CPU memory and sampling the graph with GPU when the graph is large. CPU sampling has poor performance and bad scalability. GPU sampling needs to store the entire graph structure on GPU memory, which is a limit when the graph is large. 2 Graph Sampling 2.1 Existing ApproachesĮxisting graph learning frameworks support both CPU and GPU sampling. Next we will introduce our contributions on the optimizations of graph sampling and feature collection. We even achieve super linear scalability with the help of NVLink. Users will have better end-to-end performance and better scalability of multi-GPU GNN training. Our Quiver is a high-performance GNN training add-on which can fully utilize the hardware. Below is the benchmark of training ogbn-product with PYG and DGL, on the scalability of multi-GPU training with CPU sampling and feature collection: This strategy will not only cause performance issues when we use single-GPU training but also cause poor scalability because of CPU resource sharing. Normally, we choose to use CPU to sample the graph and collect features. Feature collection is a bandwidth critical problem, so bandwidth should be optimized to achieve high performance.Graph sampling is a latency critical problem, so massive parallelism could hide the latency.How to resolve those two bottlenecks? Quiver's core ideas are: The unique characteristic of graph learning systems is that the preprocessing with graph structures and the collection of features are usually the bottlenecks.
0 Comments
Leave a Reply. |