-
Notifications
You must be signed in to change notification settings - Fork 433
[Driver] BUG: unable to handle page fault for address: ffffa7c13eaffff8 #2198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is iommu is enabled? Would it be possible to test by adding iommu=off parameter in the Kernel command line? Thanks. |
It's off in both BIOS and on cmdline. I also fixed some PCI-E issues with
|
Tried a different computer. Might be different crash, I think this is the one that's fixed in the new kernel. But again, this is exactly your install instructions. Followed https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.5/page/Introduction_to_ROCm_Installation_Guide_for_Linux.html to a T.
|
Switched to Linux tiny2 6.2.16-060216-generic, and now back to this crash. Same crash with
|
Ahh okay. https://community.amd.com/t5/knowledge-base/iommu-advisory-for-amd-instinct/ta-p/484601 It was suggested in our discord to turn the IOMMU on (not off), then use I got this, but I haven't seen the gart_map crash.
Stable for 3 minutes until
Attempting to reset the GPU caused a kernel panic.
|
I believe the following patch should fix your issue. Could you please try? Thanks. |
Sorry, I'm not working with AMD GPUs anymore. In addition to the amdgpu_gart_map issue, there's 2 others I've seen with the MES and sdma0 that may be hardware issues. The driver and hardware don't have the stability that I would feel okay selling to customers of tinyboxes, and without hardware documentation these issues are very hard to investigate. I recommend setting up a variety of systems constantly fuzzing from user space to catch kernel driver issues. I expect no amount of running demo apps in loops to crash either the kernel or the GPU, and it seems like it's a long way from there. Feel free to close my issues. |
@geohot Quite sad, I saw your RDNA3 stream the other day and I was excited seeing that your were using AMD GPUs in your tinyboxes, it would have probably help motivate AMD and the community to spend more resources on making ROCm mature in the first place |
@MatPoliquin The official install instruction even fails, that means the product is not ready to sale. Compare to AMD, take a look at this tweet. |
next station, intel one api? |
While AMD automatically closes all tickets after few months, I believe it would be beneficial to fix this anyway. |
the first thing I had to deal with in my work is the fact that this stuff is so deliberately arcane that it is structurally impossible to substantially modify from the outside. most people who are in the trenches of systems programming these days understand this sad reality. I hope things can improve structurally so that it is not always a matter of mindshare and collective resources for people to be able to try new things with existing technology. |
There a great and promising progress with AMD ROCm / HIP support with other projects, I think it's not the end of the red game yet turboderp/exllama#7 |
I had an RX 580 the past years and at first I buyed that for the future, it supported vulkan and AMD promised to improve the software in a near future, being open and better. In that time I got some mad and deceptioned, but then I thinks, ok it is because this does not make sense, the RX580 is not powerful enough? But and now? Really, AMD? Why dont try to effort a little more and do it better? I hope AMD can really solve this if not I dont expect a better future for AMD in graphics and AI. |
|
We have to push to give us mainstream support cards |
Fixed in ROCm 5.6.0 |
This one is the worst, only seems to occur when I put two GPUs in the system.
Sometime just happens, but easy to reproduce running https://github.com/RadeonOpenCompute/rocm_bandwidth_test
2x 7900XTX
ASROCK ROMED8-2T
EPYC 7662
Ubuntu 22.04, Kernel 6.2.14-060214-generic, ROCm 5.5
The text was updated successfully, but these errors were encountered: