In the Fall of 2024, I did a research project on CUDA. I first had created a triangle stripping alogrithm in C++, as well as a vertex reduction algorithm, both of which worked great and gave me a good baseline to beat. I used the thrust library as well to assist in data management, as it was much easier to allow the library to handle the GPU allocations rather than track it myself. The reduction was pretty straightforward on CUDA as it was not math heavy and the data could be accessed in parallel without and issues. It was a simple as creating a custom method to first sort the list, by Pos, Col, UV, and then run custom comparators finding duplicates as I go, making sure to track their index location in the triangle list, as the final step was replacing the removed vertices respective index in the triangle list. For these methods, I used thrust to launch the CUDA threads as I trusted it would have the best chance of launching the most optimal threads.
When it came to the triangle stripper, it became much more challengeing, due to the many race conditions involved. Generating the adjacency lists was simple. However my initial method of finding the strips was not parraelizable in the slightest. My method on the C++ side was to go through each triangle one by one and just generate a strip based on whatever triangles are avalible. It wasnt the best method, but by far from the worse. But when I began to parrelize it I realized that I could be marking the same triangle multiple times for multiple strips, in the same cycle. So I as this was an offline program, I chose to use a greedy algorithm instead on the CUDA I would generate the longest strip for each triangle, and then chsoe the longest, marked them used and repeat for the remaining triangles. This allowed me to not worry about the marking since the only time triangles were marked in the shared memory, was when I was selecting the longest strips. This alorhtim while slow allowed for the best ratio of triangles to vertices.Â