{"id":109238,"date":"2025-12-04T14:20:51","date_gmt":"2025-12-04T22:20:51","guid":{"rendered":"https:\/\/developer.nvidia.com\/blog\/?p=109238"},"modified":"2025-12-11T12:46:59","modified_gmt":"2025-12-11T20:46:59","slug":"simplify-gpu-programming-with-nvidia-cuda-tile-in-python","status":"publish","type":"post","link":"https:\/\/developer.nvidia.com\/blog\/simplify-gpu-programming-with-nvidia-cuda-tile-in-python\/","title":{"rendered":"Simplify GPU Programming with NVIDIA CUDA Tile in Python"},"content":{"rendered":"\n<p>The<a href=\"https:\/\/developer.nvidia.com\/blog\/nvidia-cuda-13-1-powers-next-gen-gpu-programming-with-nvidia-cuda-tile-and-performance-gains\"> release of NVIDIA CUDA 13.1<\/a> introduces <a href=\"https:\/\/developer.nvidia.com\/blog\/focus-on-your-algorithm-nvidia-cuda-tile-handles-the-hardware\">tile-based programming<\/a> for GPUs, making it one of the most fundamental additions to GPU programming since CUDA was invented. Writing GPU tile kernels enables you to write your algorithm at a higher level than a single-instruction multiple-thread (SIMT) model, while the compiler and runtime handle the partitioning of work onto threads under the covers. Tile kernels also help abstract away special-purpose hardware like tensor cores, and write code that&#8217;ll be compatible with future GPU architectures.&nbsp;With the launch of NVIDIA cuTile Python, you can write tile kernels in Python.<\/p>\n\n\n\n<h2 id=\"what_is_cutile_python\"  class=\"wp-block-heading\">What is cuTile Python?<a href=\"#what_is_cutile_python\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>cuTile Python is an expression of the CUDA Tile programming model in Python, built on top of the CUDA Tile IR specification. It enables you to write tile kernels in Python and express GPU kernels using a tile-based model, rather than or in addition to a single instruction, multiple threads (SIMT) model.&nbsp;<\/p>\n\n\n\n<p>SIMT programming requires specifying each GPU thread of execution. In principle, each thread can operate independently and execute a unique code path from any other thread. In practice, to use GPU hardware effectively, it&#8217;s typical to program algorithms where each thread performs the same work on separate pieces of data.&nbsp;<\/p>\n\n\n\n<p>SIMT enables maximum flexibility and specificity, but can also require more manual tuning to achieve top performance. The tile model abstracts away some of the HW intricacies. You can focus on your algorithm at a higher level, while the NVIDIA CUDA compiler and runtime handle partitioning your tile algorithm into threads and launching them onto the GPU.<\/p>\n\n\n\n<p>cuTile is a programming model for writing parallel kernels for NVIDIA GPUs.&nbsp; In this model:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Arrays are the primary data structure.<\/li>\n\n\n\n<li>Tiles are subsets of arrays that kernels operate on.<\/li>\n\n\n\n<li>Kernels are functions that are executed in parallel by blocks.<\/li>\n\n\n\n<li>Blocks are subsets of the GPU; operations on tiles are parallelized across each block.<\/li>\n<\/ul>\n\n\n\n<p>cuTile automates block-level parallelism and asynchrony, memory movement, and other low-level details of GPU programming. It will leverage the advanced capabilities of NVIDIA hardware (such as tensor cores, shared memory, and tensor memory accelerators) without requiring explicit programming. cuTile is portable across different NVIDIA GPU architectures, enabling you to use the latest hardware features without rewriting your code.<\/p>\n\n\n\n<h2 id=\"who_is_cutile_for\"  class=\"wp-block-heading\">Who is cuTile for?<a href=\"#who_is_cutile_for\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>cuTile is for general-purpose data-parallel GPU kernel authoring. Our efforts have been focused on optimizing cuTile for the types of computations typically encountered in AI\/ML applications. We\u2019ll continue to evolve cuTile, adding functionality and performance features to expand the range of workloads it can optimize.<\/p>\n\n\n\n<p>You might be asking why you\u2019d use cuTile to write kernels when CUDA C++ or CUDA Python has worked well so far. We talk more about this in another <a href=\"https:\/\/developer.nvidia.com\/blog\/focus-on-your-algorithm-nvidia-cuda-tile-handles-the-hardware\">post<\/a> describing the CUDA tile model. The short answer is that as GPU hardware becomes more complex, we&#8217;re providing an abstraction layer at a reasonable level so developers can focus more on algorithms and less on mapping an algorithm to specific hardware.&nbsp;<\/p>\n\n\n\n<p>Writing tile programs enables you to target tensor cores with code compatible with future GPU architectures. Just as Parallel Thread Execution (PTX) provides the virtual Instruction Set Architecture (ISA)&nbsp;that underlies the SIMT model for GPU programming, Tile IR provides the virtual ISA for tile-based programming. It enables higher-level algorithm expression, while the software and hardware transparently map that representation to tensor cores to deliver peak performance.<\/p>\n\n\n\n<h2 id=\"cutile_python_example\"  class=\"wp-block-heading\">cuTile Python example<a href=\"#cutile_python_example\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>What does cuTile Python code look like? If you\u2019ve learned CUDA C++, you probably encountered the canonical vector addition kernel. Assuming the data has been copied from the host to the device, a vector add kernel in CUDA SIMT looks something like the following, which takes two vectors and adds them together elementwise to produce a third vector. This is one of the simplest CUDA kernels you can write.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\n__global__ void vecAdd(float* A, float* B, float* C, int vectorLength)\n{\n \/* calculate my thread index *\/\n int workIndex = threadIdx.x + blockIdx.x*blockDim.x;\n\n if(workIndex &lt; vectorLength)\n {\n  \/* perform the vector addition *\/\n  C&#x5B;workIndex] = A&#x5B;workIndex] + B&#x5B;workIndex];\n }\n}\n<\/pre><\/div>\n\n\n<p>In this kernel, each thread\u2019s work is explicitly specified, and the programmer, when launching this kernel, selects the number of blocks and threads for launch.<\/p>\n\n\n\n<p>Now, let\u2019s look at the equivalent code written in cuTile Python. We don\u2019t need to specify what each thread does. We only have to break the data into tiles and specify the mathematical operations for each tile. Everything else is handled for us.<\/p>\n\n\n\n<p>The cuTile Python kernel looks as follows:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport cuda.tile as ct\n\n@ct.kernel\ndef vector_add(a, b, c, tile_size: ct.Constant&#x5B;int]):\n  # Get the 1D pid\n  pid = ct.bid(0)\n\n  # Load input tiles\n  a_tile = ct.load(a, index=(pid,) , shape=(tile_size, ) )\n  b_tile = ct.load(b, index=(pid,) , shape=(tile_size, ) )\n\n  # Perform elementwise addition\n  result = a_tile + b_tile\n\n  # Store result\n  ct.store(c, index=(pid, ), tile=result)\n<\/pre><\/div>\n\n\n<p><code>ct.bid(0)<\/code> is the function that obtains the block ID along the (in this case) zeroth axis.&nbsp;It\u2019s equivalent to how SIMT kernel writers would reference <code>blockIdx.x<\/code> and <code>threadIdx.x<\/code>, for example.&nbsp;<code>ct.load()<\/code> is the function that loads a tile of data, with the requisite index and shape, from device memory.&nbsp; Once data is loaded into tiles, these tiles can be used in computations.&nbsp;When all the computations are complete, <code>ct.store()<\/code> puts the tiled date back into GPU device memory.<\/p>\n\n\n\n<h2 id=\"putting_it_all_together\"  class=\"wp-block-heading\">Putting it all together<a href=\"#putting_it_all_together\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>Now we\u2019ll show how to call this <code>vector_add<\/code> kernel in Python using a complete Python script that you can try yourself.&nbsp;The following is the complete code, including the kernel and the main function.&nbsp;&nbsp;<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n&quot;&quot;&quot;\nExample demonstrating simple vector addition.\nShows how to perform elementwise operations on vectors.\n&quot;&quot;&quot;\n\nfrom math import ceil\n\nimport cupy as cp\nimport numpy as np\nimport cuda.tile as ct\n\n\n@ct.kernel\ndef vector_add(a, b, c, tile_size: ct.Constant&#x5B;int]):\n  # Get the 1D pid\n  pid = ct.bid(0)\n\n  # Load input tiles\n  a_tile = ct.load(a, index=(pid,) , shape=(tile_size, ) )\n  b_tile = ct.load(b, index=(pid,) , shape=(tile_size, ) )\n\n  # Perform elementwise addition\n  result = a_tile + b_tile\n\n  # Store result\n  ct.store(c, index=(pid, ), tile=result)\n\ndef test():\n  # Create input data\n  vector_size = 2**12\n  tile_size = 2**4\n  grid = (ceil(vector_size \/ tile_size),1,1)\n\n  a = cp.random.uniform(-1, 1, vector_size)\n  b = cp.random.uniform(-1, 1, vector_size)\n  c = cp.zeros_like(a)\n\n  # Launch kernel\n  ct.launch(cp.cuda.get_current_stream(),\n       grid, # 1D grid of processors\n       vector_add,\n       (a, b, c, tile_size))\n\n  # Copy to host only to compare\n  a_np = cp.asnumpy(a)\n  b_np = cp.asnumpy(b)\n  c_np = cp.asnumpy(c)\n\n  # Verify results\n  expected = a_np + b_np\n  np.testing.assert_array_almost_equal(c_np, expected)\n\n  print(&quot;\u2713 vector_add_example passed!&quot;)\n\nif __name__ == &quot;__main__&quot;:\n  test()\n<\/pre><\/div>\n\n\n<p>Assuming you\u2019ve already <a href=\"https:\/\/docs.nvidia.com\/cuda\/cutile-python\/quickstart.html\">installed all the requisite software<\/a>, including cuTile Python and CuPy, running this code is as simple as invoking Python.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: bash; title: ; notranslate\" title=\"\">\n$ python3 VectorAdd_quickstart.py\n\u2713 vector_add_example passed!\n<\/pre><\/div>\n\n\n<p>Congratulations, you just ran your first cuTile Python program!<\/p>\n\n\n\n<h2 id=\"developer_tools\"  class=\"wp-block-heading\">Developer tools<a href=\"#developer_tools\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>cuTile kernels can be profiled with NVIDIA Nsight Compute in the same way as SIMT kernels.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: bash; title: ; notranslate\" title=\"\">\n$ ncu -o VecAddProfile --set detailed python3 VectorAdd_quickstart.py\n<\/pre><\/div>\n\n\n<p>Once you\u2019ve created the profile and opened it with the graphical version of Nsight Compute:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Select the <code>vector_add<\/code> kernel<\/li>\n\n\n\n<li>Choose the \u201cDetails\u201d tab<\/li>\n\n\n\n<li>Expand the \u201cTile Statistics\u201d report section&nbsp;<\/li>\n<\/ul>\n\n\n\n<p>You should see an image similar to Figure 1.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure data-wp-context=\"{&quot;imageId&quot;:&quot;69efefea876d5&quot;}\" data-wp-interactive=\"core\/image\" class=\"aligncenter size-full wp-lightbox-container\"><img loading=\"lazy\" decoding=\"async\" width=\"1600\" height=\"932\" data-wp-class--hide=\"state.isContentHidden\" data-wp-class--show=\"state.isContentVisible\" data-wp-init=\"callbacks.setButtonStyles\" data-wp-on-async--click=\"actions.showLightbox\" data-wp-on-async--load=\"callbacks.setButtonStyles\" data-wp-on-async-window--resize=\"callbacks.setButtonStyles\" src=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/11\/cuTile-Nsight.png\" alt=\"Image of the profile generated from Nsight Compute, showing the tile statistics for the vector_add kernel.\" class=\"wp-image-109241\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/11\/cuTile-Nsight.png 1600w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/11\/cuTile-Nsight-300x175.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/11\/cuTile-Nsight-625x364.png 625w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/11\/cuTile-Nsight-179x104.png 179w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/11\/cuTile-Nsight-768x447.png 768w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/11\/cuTile-Nsight-1536x895.png 1536w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/11\/cuTile-Nsight-645x376.png 645w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/11\/cuTile-Nsight-500x291.png 500w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/11\/cuTile-Nsight-155x90.png 155w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/11\/cuTile-Nsight-362x211.png 362w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/11\/cuTile-Nsight-189x110.png 189w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/11\/cuTile-Nsight-1024x596.png 1024w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/11\/cuTile-Nsight-927x540.png 927w\" sizes=\"auto, (max-width: 1600px) 100vw, 1600px\" \/><button\n\t\t\tclass=\"lightbox-trigger\"\n\t\t\ttype=\"button\"\n\t\t\taria-haspopup=\"dialog\"\n\t\t\taria-label=\"Enlarge\"\n\t\t\tdata-wp-init=\"callbacks.initTriggerButton\"\n\t\t\tdata-wp-on-async--click=\"actions.showLightbox\"\n\t\t\tdata-wp-style--right=\"state.imageButtonRight\"\n\t\t\tdata-wp-style--top=\"state.imageButtonTop\"\n\t\t>\n\t\t\t<svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"12\" height=\"12\" fill=\"none\" viewBox=\"0 0 12 12\">\n\t\t\t\t<path fill=\"#fff\" d=\"M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z\" \/>\n\t\t\t<\/svg>\n\t\t<\/button><figcaption class=\"wp-element-caption\"><em>Figure 1. Profile generated from Nsight Compute, showing the tile statistics for the vector_add kernel<\/em><\/figcaption><\/figure><\/div>\n\n\n<p>Notice the Tile Statistics report section includes the number of tile blocks specified, block size (chosen by compiler), and various other tile-specific information.<\/p>\n\n\n\n<p>The source page also supports cuTile kernels and performance metrics at the source-line level, just like CUDA C kernels.&nbsp;<\/p>\n\n\n\n<h2 id=\"how_developers_can_get_cutile\"  class=\"wp-block-heading\">How developers can get cuTile<a href=\"#how_developers_can_get_cutile\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>To run cuTile Python programs, you need the following:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A GPU with compute capability 8.x, 10.x, 11.x or 12.x (in future CUDA releases, we\u2019ll add support for additional GPU architectures)<\/li>\n\n\n\n<li>NVIDIA Driver R580 or later (R590 is required for tile-specific developer tools support)<\/li>\n\n\n\n<li>CUDA Toolkit 13.1 or later<\/li>\n\n\n\n<li>Python version 3.10 or higher<\/li>\n\n\n\n<li>The cuTile Python package: <code>pip install cuda-tile<\/code><\/li>\n<\/ul>\n\n\n\n<h2 id=\"get_started\"  class=\"wp-block-heading\">Get started<a href=\"#get_started\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>Check out a few videos to help you learn more:<\/p>\n\n\n\n<figure class=\"wp-block-embed aligncenter is-type-video is-provider-youtube wp-block-embed-youtube\"><div style=\"display: contents;\" >\n\n<div data-mode=\"normal\" data-oembed=\"1\" data-provider=\"youtube\" id=\"arve-youtube-cndbqfaoq9k\" style=\"max-width:900px;\" class=\"arve\">\n\t<div class=\"arve-inner\">\n\t\t<div style=\"aspect-ratio:500\/281\" class=\"arve-embed arve-embed--has-aspect-ratio\">\n\t\t\t<div class=\"arve-ar\" style=\"padding-top:56.200000%\"><\/div>\n\t\t\t<iframe allow=\"accelerometer &#039;none&#039;;autoplay &#039;none&#039;;bluetooth &#039;none&#039;;browsing-topics &#039;none&#039;;camera &#039;none&#039;;clipboard-read &#039;none&#039;;clipboard-write;display-capture &#039;none&#039;;encrypted-media &#039;none&#039;;gamepad &#039;none&#039;;geolocation &#039;none&#039;;gyroscope &#039;none&#039;;hid &#039;none&#039;;identity-credentials-get &#039;none&#039;;idle-detection &#039;none&#039;;keyboard-map &#039;none&#039;;local-fonts;magnetometer &#039;none&#039;;microphone &#039;none&#039;;midi &#039;none&#039;;otp-credentials &#039;none&#039;;payment &#039;none&#039;;picture-in-picture;publickey-credentials-create &#039;none&#039;;publickey-credentials-get &#039;none&#039;;screen-wake-lock &#039;none&#039;;serial &#039;none&#039;;summarizer &#039;none&#039;;sync-xhr;usb &#039;none&#039;;web-share;window-management &#039;none&#039;;xr-spatial-tracking &#039;none&#039;;\" allowfullscreen=\"\" class=\"arve-iframe fitvidsignore\" credentialless data-arve=\"arve-youtube-cndbqfaoq9k\" data-lenis-prevent=\"\" data-src-no-ap=\"https:\/\/www.youtube-nocookie.com\/embed\/cNDbqFaoQ9k?feature=oembed&amp;iv_load_policy=3&amp;modestbranding=1&amp;rel=0&amp;autohide=1&amp;playsinline=0&amp;autoplay=0\" frameborder=\"0\" height=\"505.8\" loading=\"lazy\" name=\"\" referrerpolicy=\"strict-origin-when-cross-origin\" sandbox=\"allow-scripts allow-same-origin allow-presentation allow-popups allow-popups-to-escape-sandbox\" scrolling=\"no\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/cNDbqFaoQ9k?feature=oembed&#038;iv_load_policy=3&#038;modestbranding=1&#038;rel=0&#038;autohide=1&#038;playsinline=0&#038;autoplay=0\" title=\"\" width=\"900\"><\/iframe>\n\t\t\t\n\t\t<\/div>\n\t\t\n\t<\/div>\n\t\n\t\n\t\n\t\n<\/div>\n<\/div><figcaption class=\"wp-element-caption\"><em>Video 1. Getting started with cuTile Python<\/em><\/figcaption><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.youtube.com\/watch?v=Gf5xYeluBRU\">What is CUDA Tile?<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.youtube.com\/watch?v=UEdGJGz8Eyg\">The Future is Tiled<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/youtu.be\/YFrP03KuMZ8\">Deep Dive: How to Use cuTile Python<\/a><\/li>\n<\/ul>\n\n\n\n<p>Also, check out the <a href=\"http:\/\/docs.nvidia.com\/cuda\/cutile-python\">cuTile Python documentation<\/a>.<\/p>\n\n\n\n<p>You&#8217;re now ready to try the sample programs on <a href=\"http:\/\/github.com\/nvidia\/cutile-python\">GitHub<\/a> and start programming in cuTile Python.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The release of NVIDIA CUDA 13.1 introduces tile-based programming for GPUs, making it one of the most fundamental additions to GPU programming since CUDA was invented. Writing GPU tile kernels enables you to write your algorithm at a higher level than a single-instruction multiple-thread (SIMT) model, while the compiler and runtime handle the partitioning of &hellip; <a href=\"https:\/\/developer.nvidia.com\/blog\/simplify-gpu-programming-with-nvidia-cuda-tile-in-python\/\">Continued<\/a><\/p>\n","protected":false},"author":2569,"featured_media":107703,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"publish_to_discourse":"","publish_post_category":"318","wpdc_auto_publish_overridden":"1","wpdc_topic_tags":"","wpdc_pin_topic":"","wpdc_pin_until":"","discourse_post_id":"1726490","discourse_permalink":"https:\/\/forums.developer.nvidia.com\/t\/simplify-gpu-programming-with-nvidia-cuda-tile-in-python\/353602","wpdc_publishing_response":"success","wpdc_publishing_error":"","nv_subtitle":"","ai_post_summary":"<ul><li>The release of NVIDIA CUDA 13.1 introduced tile-based programming for GPUs, allowing developers to write higher-level GPU tile kernels and abstracting away hardware details, such as tensor cores, for better compatibility with future GPU architectures.<\/li><li>cuTile Python enables writing tile kernels in Python, focusing on dividing arrays into tiles that can be operated on in parallel, with the NVIDIA CUDA compiler and runtime automating low-level GPU tasks like block-level parallelism, memory movement, and hardware feature usage.<\/li><li>cuTile targets data-parallel workloads, especially in AI\/ML applications, letting developers concentrate on algorithm design by providing an abstraction layer that simplifies kernel authoring compared to the traditional SIMT (single-instruction multiple-thread) model.<\/li><\/ul>","footnotes":"","_links_to":"","_links_to_target":""},"categories":[696,4146,1903],"tags":[4897,4896,453,61],"coauthors":[4325,287],"class_list":["post-109238","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science","category-development","category-features","tag-cuda-tile","tag-cutile","tag-featured","tag-python","tagify_workload-data-science"],"acf":{"post_industry":["General"],"post_products":["CUDA"],"post_learning_levels":["Intermediate Technical"],"post_content_types":["Tutorial"],"post_collections":""},"jetpack_featured_media_url":"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/10\/floating-cubes.png","primary_category":{"category":"Data Science","link":"https:\/\/developer.nvidia.com\/blog\/category\/data-science\/","id":696,"data_source":""},"nv_translations":[{"language":"zh_CN","title":"\u5728\u00a0Python\u00a0\u4e2d\u501f\u52a9\u00a0NVIDIA CUDA Tile\u00a0\u7b80\u5316\u00a0GPU\u00a0\u7f16\u7a0b","post_id":16013},{"language":"ko_KR","title":"\ud30c\uc774\uc36c\uc5d0\uc11c NVIDIA CUDA Tile\ub85c GPU \ud504\ub85c\uadf8\ub798\ubc0d \uac04\uc18c\ud654","post_id":4584}],"jetpack_shortlink":"https:\/\/wp.me\/pcCQAL-spU","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/109238","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/users\/2569"}],"replies":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/comments?post=109238"}],"version-history":[{"count":18,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/109238\/revisions"}],"predecessor-version":[{"id":113868,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/109238\/revisions\/113868"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/media\/107703"}],"wp:attachment":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/media?parent=109238"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/categories?post=109238"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/tags?post=109238"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/coauthors?post=109238"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}