Jump to content
C4 Forums | Control4

Recommended Posts

Posted

Hi all -

So firstly to get the versions out the way - all the latest version on the Director. The director is a EA5 with x2 other EA5s. I run a C4 LEAF Matrix 6x6 which has been rock solid for years.

So what's the issue:

Randomly (Haven't been able to figure out a pattern) watching TV through the matrix (regardless of the input) and screen goes black, sound comes through (via a Triad amp), I notice all navigators go unresponsive for about 5mins.... then all of a sudden navigators come back online (phone app, T4's, Neo remote, SR260) but the signal is lost (audio still playing through speakers), I also notice on the navigators what I was watching goes from "watching" to off meaning I have to select the input again i.e. Netflix, Youtube, Cable, AppleTV etc. Then things carry on.

I don't want to influence what the issue could be but almost seems like a director reboot (everything disconnecting), haven't checked on the director yet to see if it did reboot. I have rebooted the matrix - same issue happens and all the inputs. 

Any ideas I can check etc, I have checked the network side of things and everything seems to be in normal working order there isnt any network outages on my system at the time or in the networking stack anything in any of the logs that indicates a network drop.


Posted
2 minutes ago, ILoveControl said:

Hi all -

So firstly to get the versions out the way - all the latest version on the Director. The director is a EA5 with x2 other EA5s. I run a C4 LEAF Matrix 6x6 which has been rock solid for years.

So what's the issue:

Randomly (Haven't been able to figure out a pattern) watching TV through the matrix (regardless of the input) and screen goes black, sound comes through (via a Triad amp), I notice all navigators go unresponsive for about 5mins.... then all of a sudden navigators come back online (phone app, T4's, Neo remote, SR260) but the signal is lost (audio still playing through speakers), I also notice on the navigators what I was watching goes from "watching" to off meaning I have to select the input again i.e. Netflix, Youtube, Cable, AppleTV etc. Then things carry on.

I don't want to influence what the issue could be but almost seems like a director reboot (everything disconnecting), haven't checked on the director yet to see if it did reboot. I have rebooted the matrix - same issue happens and all the inputs. 

Any ideas I can check etc, I have checked the network side of things and everything seems to be in normal working order there isnt any network outages on my system at the time or in the networking stack anything in any of the logs that indicates a network drop.

Are you using a Wattbox for power?  The older Wattboxes have a problem where half the outlets randomly switch off temporarily and then come back online. Eventually, the whole Wattbox dies, but that initial "intermittent" stage can last for a good while.  

Posted

How do you get the image to come back?

reboot matrix, reboot source, power TV off, comes back when I reselect the room source?
matrix is 232 or IP?

Posted
2 hours ago, RAV said:

How do you get the image to come back?

reboot matrix, reboot source, power TV off, comes back when I reselect the room source?
matrix is 232 or IP?

Great questions:

How do I get it back = I wait +- 2-5mins for all the navigators to come back online (either my phone app, neo, sr260, T4) - they all go offline at the same time as the "black out". 

I don't physically reboot anything. 

To get the picture back the source (under watch) are all marked as off as if the room is off (but audio is still coming through the speakers), I then select the source again and it comes back as per normal. 

This is happening once max twice a day, random times I can't find a pattern to attribute it to. 

The last three times I was watching Apple TV but it has happened with cable TV, with Netflix running off a Android Nvidia so it's all different sources. 

What I did notice on apple TV is I completely lose C4 remote via ip on the apple TV and it never recovers until I reboot the apple TV but saying that I'm on TvOS18 so that could attributed to the current known TvOs 18 bug who knows. 

 

Matrix control = IP. 

It's just weird this setup had been so stable. 

Posted
14 hours ago, South Africa C4 user said:

This does sound like a Director reboot to me.  Have you checked uptime on your Director?  Or programmed a push notification on startup of Director?

Man, I miss terminal access to the directors could of done this easier with normal Linux commands......

So, I downloaded a snapshot of the EA5 - under system_info > uptime.txt

#-------------------[ uptime ] -----------------------#

 12:08:23 up 15:17,  load average: 7.15, 7.10, 7.02

-------------------------------------------------------

I am assuming:

  1. Up since - 12:08:23
  2. Up time = 15h:17m 

Is this correct?

Looking at the load average - it doesn't seem like this director is overly working itself, so if its a reboot its something else....

Posted

LONG AND VERBOSE - but could help someone in the future:

Gosh gotta love ChatGPT :) uploaded the system snapshot used some prompts and now some interesting info coming out.

Two in specific:

  1. Thermals - easy to fix - server room has cooling
  2. Out of Memory issues - complex.

Summary - 

Seems like the GStreamer service is causing very high memory util - makes sense since most of the C4 system is there for media switching - HOWEVER - this system has been the same for years and Out of Memory (OOM) issues havent happened before. In addition this was one TV with one source on the matrix wasnt being overused and I can't see how one source could cause an OOM issue - bug?

Confirming the reboot + Out of Memory:

The logs indicate that the system rebooted on October 15, 2024, at 20:51:15. This is evident from the log entries showing the boot process, including processor initialization and memory information.

Summary:
The system was rebooted at 20:51:15.
This aligns with the earlier discovery of out-of-memory (OOM) events, which likely caused the reboot.

Thermals:

Having said that if anyone knows in the snapshot where to specifically look for the log that could stipulate why it rebooted let me know.

The c4faultd.log reveals several important details:

Temperature sensor issues:

On October 15, 2024, there was a failure related to the CPU temperature sensor:
javascript
Copy code
ERROR: Failed Opening(/dev/thermal). Error(Device or resource busy)
ERROR: Error initializing temperature sensor for CPU1
This suggests that the system had issues accessing the thermal sensors, which might indicate overheating or sensor malfunctions.

Frequent system logging restarts:

There are multiple instances of c4faultd (the fault management service) being restarted, including:
bash
Copy code
INFO: Initializing logging from /etc/logging/c4faultd.conf
DEBUG: MQTT - topic for this device: s1/c4/c4faultd/v1/control4_ea5_xxxxxxxxxxxx/status
This might indicate that the fault service was trying to reinitialize due to persistent issues.

Possible causes:
CPU or hardware overheating: The errors related to the temperature sensor failing to read the CPU temperature may indicate that the system was rebooting due to overheating or a hardware fault.
Device or resource busy errors: This could mean the system was struggling to manage resources, possibly leading to faults or crashes.
These issues could collectively cause the system to reboot. If you'd like, I can continue checking other logs for further confirmation, or we can explore potential fixes for overheating and sensor errors. Let me know!

Looked at the thermal file - temps are high (I know the server room runs hot) but avg 63 degrees Celsius but in most cases / equipment this isn't really high?

Example of the thermal log - thermal.txt

2024-10-10 17:17:54 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64
2024-10-10 17:18:54 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64
2024-10-10 17:19:54 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64
2024-10-10 17:20:54 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64
2024-10-10 17:21:54 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64
2024-10-10 17:22:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=62
2024-10-10 17:23:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64
2024-10-10 17:24:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64
2024-10-10 17:25:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=62,cpu1=64
2024-10-10 17:26:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64
2024-10-10 17:27:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64
2024-10-10 17:28:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=60,cpu1=62
2024-10-10 17:29:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64
2024-10-10 17:30:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64
2024-10-10 17:31:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64
2024-10-10 17:32:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=62
2024-10-10 17:33:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=62
2024-10-10 17:34:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64
2024-10-10 17:35:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64
2024-10-10 17:36:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=62
2024-10-10 17:37:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=62
2024-10-10 17:38:55 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=62,cpu1=62
2024-10-10 17:39:56 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=61
2024-10-10 17:40:56 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=62
2024-10-10 17:41:56 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=62
2024-10-10 17:42:56 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=60,cpu1=62
2024-10-10 17:43:56 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=62
2024-10-10 17:44:56 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=60,cpu1=62
2024-10-10 17:45:56 +0200 C4EA5-XXXXXXXXXXX [3788] INFO: controller=ea5,heatsink=47,cpu0=61,cpu1=64

 

Carrying on with some ChatGPT help - 

The bootloader_replace.log shows that the bootloader replacement process was successful, with entries such as:

"Copy Bootloader To Flash Success!"
"Bootloader reflash success!!!"
There are no indications of errors or failures in this log, so it seems that the bootloader process itself was not a source of the problem.

Since the bootloader log appears clean, I’ll look into any remaining logs for final confirmation or additional insights, particularly checking debug logs and other system-level files like dman.log

Summary for thermals:

Summary of Findings:
Temperature Sensor Issues: There were errors indicating that the system failed to read from the CPU thermal sensor, which might have led to system instability or reboots due to overheating.

Frequent Restarts of the c4faultd Service: The fault management service was restarted multiple times on October 15, 2024, which aligns with system instability.

No Bootloader or Major Device Errors: Both the bootloader and dman logs don't indicate recent or significant failures, and they seem unrelated to the recent reboot.

Conclusion:
The evidence points toward hardware-related issues, possibly related to the thermal sensor failure and potential overheating. This could have caused the system to reboot as a protective measure. The frequent restarts of fault management services (c4faultd) are likely a response to ongoing faults in the system.

Now what as interesting was some out of memory issues...... and around the time I recall for the black out 

Out of Memory:

The kernel log indicates that there were memory-related events and potential out-of-memory (OOM) issues around October 15, 2024. Specifically, there is a log entry that shows the system handling memory-related problems:

Out of memory warnings:

css
Copy code
[<c10bc671>] ? out_of_memory+0x281/0x2c0
Per-CPU memory stats: The log also shows CPU-specific memory usage (DMA and Normal per-CPU stats), which are often associated with memory management under high load.

These logs suggest that the system might have been under heavy load or experiencing memory pressure, leading to the system running out of available memory. This could have triggered memory management functions like the out-of-memory (OOM) killer, which could force processes to terminate or lead to system instability.

I then asked if it could give me specific times around when it was Out of Memory (OOM) was being experiences to see if it matched when I could recall the random black out for me it was around the time 20:00 - look at the third event mentioned below

The out-of-memory (OOM) events occurred at multiple times throughout October 15, 2024. Here are the key times:

First OOM event sequence: Between 01:33:43 and 01:34:15 (over a span of about 30 seconds). Several entries indicate that the system experienced memory exhaustion and related warnings during this time.

Second OOM event: At 18:37:19.

Third OOM event sequence: Between 20:50:02 and 20:50:39, with multiple entries showing memory-related issues.

Fourth OOM event: At 22:38:18.

These repeated instances of memory exhaustion indicate that the system faced high memory pressure multiple times on that day, suggesting that memory management might have been a critical issue contributing to the system's instability.

But it seems to only be performing a black out at certain OOM events not all as I cannot recall any issue at 22:38pm. I then asked our friend ChatGPT to cross reference the OOM to reboots and ......

Let me cross-reference these logs for any evidence of a reboot around 20:00 and align it with the out-of-memory entries.

The sysmand.log shows evidence of a system restart around 20:50 on October 15, 2024, as indicated by the series of messages stating that multiple services were being restarted:

INFO: Started c4faultd
INFO: Started various services like appled, audio3client, avahi-daemon, broker, etc.
This indicates that the system likely rebooted at 20:50, and several services were initialized immediately afterward.

Cross-analysis with the Out-of-Memory (OOM) Events:
We see OOM events starting around 20:50:02, which align closely with the time the system restarted.
The OOM events likely led to the system becoming unstable, eventually triggering the reboot shortly afterward.
Conclusion:
The system experienced multiple OOM events, particularly at 20:50, which caused memory exhaustion and contributed to a system reboot at approximately 20:50. This ties the out-of-memory conditions directly to the reboot you observed.

So we getting somewhere .....

Looks like system is running out of memory (now to figure out WHY - who's badly written driver is causing this :) ) - this is a joke before someone has a sense of humor failure.

I'll start by analyzing the kernel log and related logs to see if any processes or services were reported with high memory consumption or being killed due to memory pressure around 20:50.

The kernel log entries from 20:50:02 show that a process named GstSystemClock (likely related to GStreamer, a multimedia framework) was active during the out-of-memory (OOM) event:

Process involved: GstSystemClock with PID: 21030.
Memory pressure warnings: The log indicates that the system was running out of memory (OOM) at this time.
Key Insights:
The GstSystemClock process is associated with GStreamer, which is commonly used in media handling applications for tasks like audio and video processing. It's possible that a multimedia service or application was consuming excessive memory, leading to the OOM event.

Now I want to check which specific driver/service is calling on GStreamer that is causing it to spike:

I will start by looking through other logs to identify which services or applications may have triggered GstSystemClock and analyze if there are any dependencies or interactions leading to memory overuse. Let’s inspect related logs like driver_debug.log and other debug logs for clues.

It seems there are no relevant entries in the driver_debug.log related to GstSystemClock, media services, or memory issues. This suggests that the driver log may not directly track interactions with the multimedia service causing the memory issue.

Next, I'll inspect additional logs like sysmand.log and any other potential debug logs that could capture interactions or specific applications triggering the GstSystemClock process, focusing on media or multimedia-related services. 

The sysmand.log contains several references to audio3client and audio3server, which appear to be audio-related services. These services were:

Restarted at 01:34 and again enabled/disabled around 01:38.
The entries suggest that audio3client and audio3server were actively running and managed during the timeframe of the system instability.
While these services were restarted earlier in the day, there could still be a connection between these audio services and the GstSystemClock process during media handling or streaming, leading to high memory usage.

This is where it starts becoming a dead end......................

I would have to do some real time debugging on these services to see when they spike etc - so going to try figure out if I can enable this through composer:

The audio3client and audio3server logs show regular management activity (starting, stopping, and enabling) during October 15, 2024. These services were:

Restarted and brought back online around 01:34.
Enabled and disabled around 01:38.
However, there are no specific memory spikes or warnings directly associated with these services based on the logs reviewed.

The results from the broader search show additional instances of the out-of-memory (OOM) killer being invoked, specifically related to the pan manager run process:

The OOM killer was triggered at 01:33:43 on October 15, 2024, involving the pan manager run process (PID: 4733).
This is similar to the earlier memory pressure issue seen with GstSystemClock.
Top Services Involved in High Memory Usage:
GstSystemClock: As identified earlier, this multimedia process was involved in memory issues during media handling.
pan manager run: This process, which could be related to a media, display, or management task, was involved in the OOM event, as indicated by the logs.
Audio services (audio3client and audio3server): Although not directly flagged for high memory, these services were restarted and active around the time of the system instability.
Conclusion:
GstSystemClock and pan manager run are the primary services consuming high memory and triggering the OOM killer.
These services, along with media-related operations, seem to be the main contributors to the system running out of memory.

 

  • 3 weeks later...
Posted

So, I didn't give up ;) I eventually started doing some more diagnostics - 
Refresher - We know the system was rebooting due to Out Of Memory (see thread above) - OOM
 

Screensaver service:

{ol4.screensaver} com.control4.screensaver – This is the highest memory consumer and might be contributing to memory pressure if it’s constantly active.

Lets look into this - 

9533 10019    1423m S    {ol4.screensaver} com.control4.screensaver Xzygote /system/bin --zygote --start-system-server

Look at the memory the screensaver process is using 1423M - now let's see how much memory this EA5 has:

total       used       free     shared    buffers     cached
Mem:       1597772    1524828      72944          0      19664     117988
-/+ buffers/cache:    1387176     210596
Swap:      2097148     550676    1546472

Which means the com.control4.screensaver service - Not that I use it on any of the navigators........ is using:

 

Percentage=(1597/1423)×100≈    89.1%

89.1% of the total system memory!!! Just for a screensaver

So, either there is a memory leak on the screensaver service OR with all the T3/4s and five TV's that have access to the OSD - its just to much for the poor little EA5.

 

- Now the question - do I remove the Screensaver Agent for now, or kill the com.control4.screensaver service........

Again this research is in the name of science :) and maybe to possibly help someone in the future OR if any Control4 peeps are reading - to debug this service.

Posted

Interesting… I regularly have issues with the screensaver.  My current issue being that the photos don’t appear in the Media Agent / screensaver page.  Essentially this means I can add photos (and this works) but I can’t delete photos (other than doing so directly on the USB).  I wonder if your memory leak is my problem also…

Posted
2 hours ago, South Africa C4 user said:

Interesting… I regularly have issues with the screensaver.  My current issue being that the photos don’t appear in the Media Agent / screensaver page.  Essentially this means I can add photos (and this works) but I can’t delete photos (other than doing so directly on the USB).  I wonder if your memory leak is my problem also…

Well something could be releated but you have the beefy Core 10's so dont think its experiancing the rebooting my EA5 is.....

Im holding on to what new HW is coming as I am sure with OS4 they will bring some newer HW out. I wont upgrade to 4 since I am in no mood to spend a bazillion pesos on refreshing my lighting kit, but do want to eventually upgrade the main Director/Controler. Sadly a screensaver service causing so much havoc is just not right....... it either memory leak or its not built for the EA5.

Posted
2 hours ago, ILoveControl said:

Well something could be releated but you have the beefy Core 10's so dont think its experiancing the rebooting my EA5 is.....

Im holding on to what new HW is coming as I am sure with OS4 they will bring some newer HW out. I wont upgrade to 4 since I am in no mood to spend a bazillion pesos on refreshing my lighting kit, but do want to eventually upgrade the main Director/Controler. Sadly a screensaver service causing so much havoc is just not right....... it either memory leak or its not built for the EA5.

You are right… I don’t have the rebooting issue but I have had memory issues over the last couple of years.  I’ve never thought to blame the Screensaver though.

Posted
2 hours ago, ILoveControl said:

Well something could be releated but you have the beefy Core 10's so dont think its experiancing the rebooting my EA5 is.....

Im holding on to what new HW is coming as I am sure with OS4 they will bring some newer HW out. I wont upgrade to 4 since I am in no mood to spend a bazillion pesos on refreshing my lighting kit, but do want to eventually upgrade the main Director/Controler. Sadly a screensaver service causing so much havoc is just not right....... it either memory leak or its not built for the EA5.

I hope you are right on new hardware! The CA10 is a beast but it is also 5 years old so would be nice to have something bigger and better.  Same goes for the T4s…

Posted
17 hours ago, South Africa C4 user said:

I hope you are right on new hardware! The CA10 is a beast but it is also 5 years old so would be nice to have something bigger and better.  Same goes for the T4s…

Before we get the post army on here, I have no view, knowledge or anything else that new HW will come w/ 4, I am assuming based on the system requirements (as we can see) to do simple things like screensavers etc and given the new interface probably requires some horse power to run those, the is new beefier HW on its way.

What I would have expected is that we still on 3.x and that some testing would have been done before. Like I have said many times above this system was rock solid for years, this is a very recent (since last 3.x update) issue. In the last two updates I encountered (pre the current 3.x update) - System would just hang, I would have to physically reboot it, there was some comments that the previous 3.x version had some issues with networking stack (drivers utilizing a bunch of networking) that would cause the hang. Upgraded to the current 3.x and that issue went way, but it introduced the system reboots now. I don’t know if their fix for the hang was an automatic reboot :) 

Posted

In theory we will be getting one more OS3 release so hopefully it will bring decent stability to everything…

I will probably go to X4 - assuming it can handle my system - but don’t feel all that excited about the need to upgrade loads of controllers for the new interface…

Posted
3 hours ago, South Africa C4 user said:

In theory we will be getting one more OS3 release so hopefully it will bring decent stability to everything…

I will probably go to X4 - assuming it can handle my system - but don’t feel all that excited about the need to upgrade loads of controllers for the new interface…

At this pace I'm thinking of going two versions back last version where it was rock solid. 

 

I have had three reboots in one bloody day so super frustrating 

  • 2 months later...
Posted

Good afternoon... has anyone had any progress regarding the problem raised... and how did you manage to solve it... I currently have the same blocking problems with the driver... I have already updated it to the latest version 3.4.3 but the problems still persist... can anyone help me???

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.