ILoveControl Posted October 14, 2024 Posted October 14, 2024 Hi all - So firstly to get the versions out the way - all the latest version on the Director. The director is a EA5 with x2 other EA5s. I run a C4 LEAF Matrix 6x6 which has been rock solid for years. So what's the issue: Randomly (Haven't been able to figure out a pattern) watching TV through the matrix (regardless of the input) and screen goes black, sound comes through (via a Triad amp), I notice all navigators go unresponsive for about 5mins.... then all of a sudden navigators come back online (phone app, T4's, Neo remote, SR260) but the signal is lost (audio still playing through speakers), I also notice on the navigators what I was watching goes from "watching" to off meaning I have to select the input again i.e. Netflix, Youtube, Cable, AppleTV etc. Then things carry on. I don't want to influence what the issue could be but almost seems like a director reboot (everything disconnecting), haven't checked on the director yet to see if it did reboot. I have rebooted the matrix - same issue happens and all the inputs. Any ideas I can check etc, I have checked the network side of things and everything seems to be in normal working order there isnt any network outages on my system at the time or in the networking stack anything in any of the logs that indicates a network drop. Quote
DLite Posted October 14, 2024 Posted October 14, 2024 2 minutes ago, ILoveControl said: Hi all - So firstly to get the versions out the way - all the latest version on the Director. The director is a EA5 with x2 other EA5s. I run a C4 LEAF Matrix 6x6 which has been rock solid for years. So what's the issue: Randomly (Haven't been able to figure out a pattern) watching TV through the matrix (regardless of the input) and screen goes black, sound comes through (via a Triad amp), I notice all navigators go unresponsive for about 5mins.... then all of a sudden navigators come back online (phone app, T4's, Neo remote, SR260) but the signal is lost (audio still playing through speakers), I also notice on the navigators what I was watching goes from "watching" to off meaning I have to select the input again i.e. Netflix, Youtube, Cable, AppleTV etc. Then things carry on. I don't want to influence what the issue could be but almost seems like a director reboot (everything disconnecting), haven't checked on the director yet to see if it did reboot. I have rebooted the matrix - same issue happens and all the inputs. Any ideas I can check etc, I have checked the network side of things and everything seems to be in normal working order there isnt any network outages on my system at the time or in the networking stack anything in any of the logs that indicates a network drop. Are you using a Wattbox for power? The older Wattboxes have a problem where half the outlets randomly switch off temporarily and then come back online. Eventually, the whole Wattbox dies, but that initial "intermittent" stage can last for a good while. Quote
ILoveControl Posted October 14, 2024 Author Posted October 14, 2024 Nope no wattbox at all in my install but you did make me go check power and can confirm, all power related connections are tested and in good working condition - thanks for the prompt though DLite 1 Quote
RAV Posted October 14, 2024 Posted October 14, 2024 Audio is output from the same matrix right? And audio plays through the problem? Quote
ILoveControl Posted October 15, 2024 Author Posted October 15, 2024 10 hours ago, RAV said: Audio is output from the same matrix right? And audio plays through the problem? Yes that's correct Quote
RAV Posted October 15, 2024 Posted October 15, 2024 How do you get the image to come back? reboot matrix, reboot source, power TV off, comes back when I reselect the room source? matrix is 232 or IP? Quote
ILoveControl Posted October 15, 2024 Author Posted October 15, 2024 2 hours ago, RAV said: How do you get the image to come back? reboot matrix, reboot source, power TV off, comes back when I reselect the room source? matrix is 232 or IP? Great questions: How do I get it back = I wait +- 2-5mins for all the navigators to come back online (either my phone app, neo, sr260, T4) - they all go offline at the same time as the "black out". I don't physically reboot anything. To get the picture back the source (under watch) are all marked as off as if the room is off (but audio is still coming through the speakers), I then select the source again and it comes back as per normal. This is happening once max twice a day, random times I can't find a pattern to attribute it to. The last three times I was watching Apple TV but it has happened with cable TV, with Netflix running off a Android Nvidia so it's all different sources. What I did notice on apple TV is I completely lose C4 remote via ip on the apple TV and it never recovers until I reboot the apple TV but saying that I'm on TvOS18 so that could attributed to the current known TvOs 18 bug who knows. Matrix control = IP. It's just weird this setup had been so stable. Quote
South Africa C4 user Posted October 15, 2024 Posted October 15, 2024 This does sound like a Director reboot to me. Have you checked uptime on your Director? Or programmed a push notification on startup of Director? Quote
ILoveControl Posted October 16, 2024 Author Posted October 16, 2024 14 hours ago, South Africa C4 user said: This does sound like a Director reboot to me. Have you checked uptime on your Director? Or programmed a push notification on startup of Director? Man, I miss terminal access to the directors could of done this easier with normal Linux commands...... So, I downloaded a snapshot of the EA5 - under system_info > uptime.txt #-------------------[ uptime ] -----------------------# 12:08:23 up 15:17, load average: 7.15, 7.10, 7.02 ------------------------------------------------------- I am assuming: Up since - 12:08:23 Up time = 15h:17m Is this correct? Looking at the load average - it doesn't seem like this director is overly working itself, so if its a reboot its something else.... Quote
ILoveControl Posted October 16, 2024 Author Posted October 16, 2024 LONG AND VERBOSE - but could help someone in the future: Gosh gotta love ChatGPT  uploaded the system snapshot used some prompts and now some interesting info coming out. Two in specific: Thermals - easy to fix - server room has cooling Out of Memory issues - complex. Summary - Seems like the GStreamer service is causing very high memory util - makes sense since most of the C4 system is there for media switching - HOWEVER - this system has been the same for years and Out of Memory (OOM) issues havent happened before. In addition this was one TV with one source on the matrix wasnt being overused and I can't see how one source could cause an OOM issue - bug? Confirming the reboot + Out of Memory: The logs indicate that the system rebooted on October 15, 2024, at 20:51:15. This is evident from the log entries showing the boot process, including processor initialization and memory information. Summary: The system was rebooted at 20:51:15. This aligns with the earlier discovery of out-of-memory (OOM) events, which likely caused the reboot. Thermals: Having said that if anyone knows in the snapshot where to specifically look for the log that could stipulate why it rebooted let me know. The c4faultd.log reveals several important details: Temperature sensor issues: On October 15, 2024, there was a failure related to the CPU temperature sensor: javascript Copy code ERROR: Failed Opening(/dev/thermal). Error(Device or resource busy) ERROR: Error initializing temperature sensor for CPU1 This suggests that the system had issues accessing the thermal sensors, which might indicate overheating or sensor malfunctions. Frequent system logging restarts: There are multiple instances of c4faultd (the fault management service) being restarted, including: bash Copy code INFO: Initializing logging from /etc/logging/c4faultd.conf DEBUG: MQTT - topic for this device: s1/c4/c4faultd/v1/control4_ea5_xxxxxxxxxxxx/status This might indicate that the fault service was trying to reinitialize due to persistent issues. Possible causes: CPU or hardware overheating: The errors related to the temperature sensor failing to read the CPU temperature may indicate that the system was rebooting due to overheating or a hardware fault. Device or resource busy errors: This could mean the system was struggling to manage resources, possibly leading to faults or crashes. These issues could collectively cause the system to reboot. If you'd like, I can continue checking other logs for further confirmation, or we can explore potential fixes for overheating and sensor errors. Let me know! Looked at the thermal file - temps are high (I know the server room runs hot) but avg 63 degrees Celsius but in most cases / equipment this isn't really high? Example of the thermal log - thermal.txt 2024-10-10 17:17:54 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64 2024-10-10 17:18:54 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64 2024-10-10 17:19:54 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64 2024-10-10 17:20:54 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64 2024-10-10 17:21:54 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64 2024-10-10 17:22:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=62 2024-10-10 17:23:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64 2024-10-10 17:24:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64 2024-10-10 17:25:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=62,cpu1=64 2024-10-10 17:26:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64 2024-10-10 17:27:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64 2024-10-10 17:28:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=60,cpu1=62 2024-10-10 17:29:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64 2024-10-10 17:30:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64 2024-10-10 17:31:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64 2024-10-10 17:32:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=62 2024-10-10 17:33:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=62 2024-10-10 17:34:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64 2024-10-10 17:35:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64 2024-10-10 17:36:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=62 2024-10-10 17:37:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=62 2024-10-10 17:38:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=62,cpu1=62 2024-10-10 17:39:56 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=61 2024-10-10 17:40:56 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=62 2024-10-10 17:41:56 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=62 2024-10-10 17:42:56 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=60,cpu1=62 2024-10-10 17:43:56 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=62 2024-10-10 17:44:56 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=60,cpu1=62 2024-10-10 17:45:56 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64 Carrying on with some ChatGPT help - The bootloader_replace.log shows that the bootloader replacement process was successful, with entries such as: "Copy Bootloader To Flash Success!" "Bootloader reflash success!!!" There are no indications of errors or failures in this log, so it seems that the bootloader process itself was not a source of the problem. Since the bootloader log appears clean, I’ll look into any remaining logs for final confirmation or additional insights, particularly checking debug logs and other system-level files like dman.log Summary for thermals: Summary of Findings: Temperature Sensor Issues: There were errors indicating that the system failed to read from the CPU thermal sensor, which might have led to system instability or reboots due to overheating. Frequent Restarts of the c4faultd Service: The fault management service was restarted multiple times on October 15, 2024, which aligns with system instability. No Bootloader or Major Device Errors: Both the bootloader and dman logs don't indicate recent or significant failures, and they seem unrelated to the recent reboot. Conclusion: The evidence points toward hardware-related issues, possibly related to the thermal sensor failure and potential overheating. This could have caused the system to reboot as a protective measure. The frequent restarts of fault management services (c4faultd) are likely a response to ongoing faults in the system. Now what as interesting was some out of memory issues...... and around the time I recall for the black out Out of Memory: The kernel log indicates that there were memory-related events and potential out-of-memory (OOM) issues around October 15, 2024. Specifically, there is a log entry that shows the system handling memory-related problems: Out of memory warnings: css Copy code [<c10bc671>] ? out_of_memory+0x281/0x2c0 Per-CPU memory stats: The log also shows CPU-specific memory usage (DMA and Normal per-CPU stats), which are often associated with memory management under high load. These logs suggest that the system might have been under heavy load or experiencing memory pressure, leading to the system running out of available memory. This could have triggered memory management functions like the out-of-memory (OOM) killer, which could force processes to terminate or lead to system instability. I then asked if it could give me specific times around when it was Out of Memory (OOM) was being experiences to see if it matched when I could recall the random black out for me it was around the time 20:00 - look at the third event mentioned below The out-of-memory (OOM) events occurred at multiple times throughout October 15, 2024. Here are the key times: First OOM event sequence: Between 01:33:43 and 01:34:15 (over a span of about 30 seconds). Several entries indicate that the system experienced memory exhaustion and related warnings during this time. Second OOM event: At 18:37:19. Third OOM event sequence: Between 20:50:02 and 20:50:39, with multiple entries showing memory-related issues. Fourth OOM event: At 22:38:18. These repeated instances of memory exhaustion indicate that the system faced high memory pressure multiple times on that day, suggesting that memory management might have been a critical issue contributing to the system's instability. But it seems to only be performing a black out at certain OOM events not all as I cannot recall any issue at 22:38pm. I then asked our friend ChatGPT to cross reference the OOM to reboots and ...... Let me cross-reference these logs for any evidence of a reboot around 20:00 and align it with the out-of-memory entries. The sysmand.log shows evidence of a system restart around 20:50 on October 15, 2024, as indicated by the series of messages stating that multiple services were being restarted: INFO: Started c4faultd INFO: Started various services like appled, audio3client, avahi-daemon, broker, etc. This indicates that the system likely rebooted at 20:50, and several services were initialized immediately afterward. Cross-analysis with the Out-of-Memory (OOM) Events: We see OOM events starting around 20:50:02, which align closely with the time the system restarted. The OOM events likely led to the system becoming unstable, eventually triggering the reboot shortly afterward. Conclusion: The system experienced multiple OOM events, particularly at 20:50, which caused memory exhaustion and contributed to a system reboot at approximately 20:50. This ties the out-of-memory conditions directly to the reboot you observed. So we getting somewhere ..... Looks like system is running out of memory (now to figure out WHY - who's badly written driver is causing this ) - this is a joke before someone has a sense of humor failure. I'll start by analyzing the kernel log and related logs to see if any processes or services were reported with high memory consumption or being killed due to memory pressure around 20:50. The kernel log entries from 20:50:02 show that a process named GstSystemClock (likely related to GStreamer, a multimedia framework) was active during the out-of-memory (OOM) event: Process involved: GstSystemClock with PID: 21030. Memory pressure warnings: The log indicates that the system was running out of memory (OOM) at this time. Key Insights: The GstSystemClock process is associated with GStreamer, which is commonly used in media handling applications for tasks like audio and video processing. It's possible that a multimedia service or application was consuming excessive memory, leading to the OOM event. Now I want to check which specific driver/service is calling on GStreamer that is causing it to spike: I will start by looking through other logs to identify which services or applications may have triggered GstSystemClock and analyze if there are any dependencies or interactions leading to memory overuse. Let’s inspect related logs like driver_debug.log and other debug logs for clues. It seems there are no relevant entries in the driver_debug.log related to GstSystemClock, media services, or memory issues. This suggests that the driver log may not directly track interactions with the multimedia service causing the memory issue. Next, I'll inspect additional logs like sysmand.log and any other potential debug logs that could capture interactions or specific applications triggering the GstSystemClock process, focusing on media or multimedia-related services. The sysmand.log contains several references to audio3client and audio3server, which appear to be audio-related services. These services were: Restarted at 01:34 and again enabled/disabled around 01:38. The entries suggest that audio3client and audio3server were actively running and managed during the timeframe of the system instability. While these services were restarted earlier in the day, there could still be a connection between these audio services and the GstSystemClock process during media handling or streaming, leading to high memory usage. This is where it starts becoming a dead end...................... I would have to do some real time debugging on these services to see when they spike etc - so going to try figure out if I can enable this through composer: The audio3client and audio3server logs show regular management activity (starting, stopping, and enabling) during October 15, 2024. These services were: Restarted and brought back online around 01:34. Enabled and disabled around 01:38. However, there are no specific memory spikes or warnings directly associated with these services based on the logs reviewed. The results from the broader search show additional instances of the out-of-memory (OOM) killer being invoked, specifically related to the pan manager run process: The OOM killer was triggered at 01:33:43 on October 15, 2024, involving the pan manager run process (PID: 4733). This is similar to the earlier memory pressure issue seen with GstSystemClock. Top Services Involved in High Memory Usage: GstSystemClock: As identified earlier, this multimedia process was involved in memory issues during media handling. pan manager run: This process, which could be related to a media, display, or management task, was involved in the OOM event, as indicated by the logs. Audio services (audio3client and audio3server): Although not directly flagged for high memory, these services were restarted and active around the time of the system instability. Conclusion: GstSystemClock and pan manager run are the primary services consuming high memory and triggering the OOM killer. These services, along with media-related operations, seem to be the main contributors to the system running out of memory. Quote
ILoveControl Posted November 2, 2024 Author Posted November 2, 2024 So, I didn't give up I eventually started doing some more diagnostics - Refresher - We know the system was rebooting due to Out Of Memory (see thread above) - OOM Screensaver service: {ol4.screensaver} com.control4.screensaver – This is the highest memory consumer and might be contributing to memory pressure if it’s constantly active. Lets look into this - 9533 10019 1423m S {ol4.screensaver} com.control4.screensaver Xzygote /system/bin --zygote --start-system-server Look at the memory the screensaver process is using 1423M - now let's see how much memory this EA5 has: total used free shared buffers cached Mem: 1597772 1524828 72944 0 19664 117988 -/+ buffers/cache: 1387176 210596 Swap: 2097148 550676 1546472 Which means the com.control4.screensaver service - Not that I use it on any of the navigators........ is using: Percentage=(1597/1423)×100≈ 89.1% 89.1% of the total system memory!!! Just for a screensaver So, either there is a memory leak on the screensaver service OR with all the T3/4s and five TV's that have access to the OSD - its just to much for the poor little EA5. - Now the question - do I remove the Screensaver Agent for now, or kill the com.control4.screensaver service........ Again this research is in the name of science and maybe to possibly help someone in the future OR if any Control4 peeps are reading - to debug this service. Quote
South Africa C4 user Posted November 2, 2024 Posted November 2, 2024 Interesting… I regularly have issues with the screensaver. My current issue being that the photos don’t appear in the Media Agent / screensaver page. Essentially this means I can add photos (and this works) but I can’t delete photos (other than doing so directly on the USB). I wonder if your memory leak is my problem also… Quote
ILoveControl Posted November 2, 2024 Author Posted November 2, 2024 2 hours ago, South Africa C4 user said: Interesting… I regularly have issues with the screensaver. My current issue being that the photos don’t appear in the Media Agent / screensaver page. Essentially this means I can add photos (and this works) but I can’t delete photos (other than doing so directly on the USB). I wonder if your memory leak is my problem also… Well something could be releated but you have the beefy Core 10's so dont think its experiancing the rebooting my EA5 is..... Im holding on to what new HW is coming as I am sure with OS4 they will bring some newer HW out. I wont upgrade to 4 since I am in no mood to spend a bazillion pesos on refreshing my lighting kit, but do want to eventually upgrade the main Director/Controler. Sadly a screensaver service causing so much havoc is just not right....... it either memory leak or its not built for the EA5. Quote
South Africa C4 user Posted November 2, 2024 Posted November 2, 2024 2 hours ago, ILoveControl said: Well something could be releated but you have the beefy Core 10's so dont think its experiancing the rebooting my EA5 is..... Im holding on to what new HW is coming as I am sure with OS4 they will bring some newer HW out. I wont upgrade to 4 since I am in no mood to spend a bazillion pesos on refreshing my lighting kit, but do want to eventually upgrade the main Director/Controler. Sadly a screensaver service causing so much havoc is just not right....... it either memory leak or its not built for the EA5. You are right… I don’t have the rebooting issue but I have had memory issues over the last couple of years. I’ve never thought to blame the Screensaver though. Quote
South Africa C4 user Posted November 2, 2024 Posted November 2, 2024 2 hours ago, ILoveControl said: Well something could be releated but you have the beefy Core 10's so dont think its experiancing the rebooting my EA5 is..... Im holding on to what new HW is coming as I am sure with OS4 they will bring some newer HW out. I wont upgrade to 4 since I am in no mood to spend a bazillion pesos on refreshing my lighting kit, but do want to eventually upgrade the main Director/Controler. Sadly a screensaver service causing so much havoc is just not right....... it either memory leak or its not built for the EA5. I hope you are right on new hardware! The CA10 is a beast but it is also 5 years old so would be nice to have something bigger and better. Same goes for the T4s… Quote
ILoveControl Posted November 3, 2024 Author Posted November 3, 2024 17 hours ago, South Africa C4 user said: I hope you are right on new hardware! The CA10 is a beast but it is also 5 years old so would be nice to have something bigger and better. Same goes for the T4s… Before we get the post army on here, I have no view, knowledge or anything else that new HW will come w/ 4, I am assuming based on the system requirements (as we can see) to do simple things like screensavers etc and given the new interface probably requires some horse power to run those, the is new beefier HW on its way. What I would have expected is that we still on 3.x and that some testing would have been done before. Like I have said many times above this system was rock solid for years, this is a very recent (since last 3.x update) issue. In the last two updates I encountered (pre the current 3.x update) - System would just hang, I would have to physically reboot it, there was some comments that the previous 3.x version had some issues with networking stack (drivers utilizing a bunch of networking) that would cause the hang. Upgraded to the current 3.x and that issue went way, but it introduced the system reboots now. I don’t know if their fix for the hang was an automatic reboot South Africa C4 user 1 Quote
South Africa C4 user Posted November 3, 2024 Posted November 3, 2024 In theory we will be getting one more OS3 release so hopefully it will bring decent stability to everything… I will probably go to X4 - assuming it can handle my system - but don’t feel all that excited about the need to upgrade loads of controllers for the new interface… Quote
ILoveControl Posted November 3, 2024 Author Posted November 3, 2024 3 hours ago, South Africa C4 user said: In theory we will be getting one more OS3 release so hopefully it will bring decent stability to everything… I will probably go to X4 - assuming it can handle my system - but don’t feel all that excited about the need to upgrade loads of controllers for the new interface… At this pace I'm thinking of going two versions back last version where it was rock solid. I have had three reboots in one bloody day so super frustrating Quote
Esteban Delgado Posted January 10 Posted January 10 Good afternoon... has anyone had any progress regarding the problem raised... and how did you manage to solve it... I currently have the same blocking problems with the driver... I have already updated it to the latest version 3.4.3 but the problems still persist... can anyone help me??? Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.