Cold Fusion Memory Leak
Cold
Fusion Memory Leak
The Problem
For the past year or so one
of our cold fusion server instances has been consuming more and more memory. I
thought it was because the application load had increased, but the main problem
seems to be a memory leak in cold fusion.
I increased the max heap
size and tweaked the jvm config until I was blue in the face, but this only had
the effect of delaying the inevitable. To compound the issue, one of the
servers in our cluster died, forcing the entire load through one server. It was
a kind of blessing in disguise, in that it forced the memory leak to show its
ugly little face more rapidly. The server would not make it through a 24 hour
period before it would eat up all its allocated memory and eventually crashed
horribly.
The Analysis of the
Problem
I started by enabling
metrics logging in the servers jrun.xml file. By doing that I could see the
state of the server at hang time and noticed that the memory was steadily
increasing up until it hung. It seemed like garbage collector was on holidays.
Ok .. so now I know that
it's memory usage problem, my initial reaction was to think that there must be
something wrong with my app. I start thinking about all the places where I'm
caching data and looking for areas where I can improve the memory management of
the app. I thought that soft references may be the answer to my memory woes
and read through these interesting articles with the view to using them for
caching.
http://www-128.ibm.com/developerworks/java/library/j-jtp11225/index.html
http://www-128.ibm.com/developerworks/java/library/j-jtp01246.html
The next question in my mind
is, What is using all the memory? I started looking for java memory profiling
tools. Hprof looked promising, but is really only useful for profiling J2SE applications,
not J2EE apps. HAT was also another option, but would only be useful for
analysing static heap dumps. What I needed was a real time memory profiler and
Borland's Optimizit has done the job nicely. There's a very good article here
on integrating Optimizit with Jrun 4.
http://info.borland.com/techpubs/optimizeit/optimizeit6/integ_guide/JRun4.html
A ten day trial version can
be downloaded here.
http://www.borland.com/downloads/download_optimizeit.html
Mike Schierberl's Blog
Post on ColdFusion Memory Leaks
Building on Mike's 3 posts
on the cold fusion memory leak he discovered, I created some more test cases
and can confidently agree with most of his conclusions. There is definitely a
memory leak in CF.
http://www.schierberl.com/cfblog/index.cfm/2006/10/12/ColdFusion_memoryLeak_profiler
http://www.schierberl.com/cfblog/index.cfm/2006/10/16/memoryLeak_variablesScope
http://www.schierberl.com/cfblog/index.cfm/2006/10/25/memoryLeaks_session
He suggest using BEA Jrockit
JVM in place of the built in Sun jvm in JRun. The only problem with that is
that it raised a few questions about whether or not it was the JVM that was
causing the issue or actually CF's problem. As I mentioned previously, I got
Borlands Optimizit profiler working with the built in jvm in Jrun. I ran the
same tests as in Mike's posts and have replicated his results.
The test case I created actually
shows that the problem is more severe that what Mike is suggesting. Under
the particular conditions of the test case, it appears as though any cfc's that
are instantiated in the variables scope of the calling cfm page are actually
kept in memory for the life of the app. Not only that, but the temporary class
created by CF are retained in memory for the life of the app. The problem seems
to be in the garbage collection of the temporary CFM classes.
Test Case
I tried to simplify Mark's
examples a little to remove any un-necessary complexity. The source code for my
test case is listed here:
testCase.cfm
------------------------------------------------------------------------------
<CFapplication NAME="memory_test"
sessionMANAGEMENT="YES"
sessionTIMEOUT=#CreateTimespan(0,0,0,10)#
SetClientCookies="no" clientmanagement="no">
<cfif not structKeyExists(application,"test") OR isDefined("URL.reset")>
<cfset
application.test = createObject("component","test").init() />
<cflocation url="#cgi.SCRIPT_NAME#" addtoken="false">
</cfif>
<cfset foo =
application.test />
<cfset fooB = createObject("component","testBBB") >
<cfset session[createUUID()] = fooB />
<cfoutput>
<a href="#cgi.SCRIPT_NAME#?reset=true">reset
application.test</a>
</cfoutput>
<!--- <cfset structClear(variables) /> --->
------------------------------------------------------------------------------
The cfc's I'm using are as
simple as can be, but you could use any cfc to demonstrate the issue. i.e.
<cfcomponent name="test">
<cffunction name="init">
<cfreturn this >
</cffunction>
</cfcomponent>
The first time you run this
script you get the following output from the profiler.

Note: I'm filtering the
class list using *cftest* on the classname
You can see that there are
two instances of testCase.cfm, 1 instance of test.cfc and 1 instance of
testBBB.cfc. After the session times out (after 10 seconds), the garbage
collector should take testCase.cfm and testBBB.cfc back to 0 instances. This
doesn't happen, It appears that the number of cfm instances retained is
directly proportional to the number of cfc instances retained.
After running it again you
get the following output. You can see that the cfm instance count has increase
by one and the testBBB.cfc count has increase by one. test.cfc stays the same
since it's being referenced from the existing instance in the application
scope.

After the session times out
again, the garbage collector does collect the new instances.

Clicking the link to reset
application.test produces the following.

Session times out, garbage
collection is done again and it returns to the base state.

Now for the tricky part ,
refresh the page 5 times and then click the link to reset the application.test
instance.

After the session times out
and the garbage is collected the leak is becoming more evident. The five
instances of test.cfm and testBBB.cfc we created before we reset application.test
are retained in memory.

Refresh the page again..

Garbage is collected ..

The leaked objects are
maintained on the heap.
The Solution
Now what we have done as per
Marks recommendation is to clear the variables scope on request end. This is
more of a work around rather than a solution, but it has stopped our servers
from hanging. Before this our server was maxing out at 500mb and crashing and
now the same server is humming along at around 250 mb heap usage. If this isn't
a leak I don't know what is.
One minor hiccup that I ran
into with this work around is that I had to get rid of application.cfc and
replace it with application.cfm .. onRequestEnd.cfm won't work in conjunction
with application.cfc. I did try using structclear("variables") in the
onRequestEnd method of application.cfc, but it was only clearing the local
scope of the cfc and not the variables scope of the request.
Conclusions
It has been a very time
consuming exercise in diagnosing this issue and I would much rather have spent
the time working on my actual development projects. However, the upside is that
I,ve learned a hell of a lot about Java memory management.
I am planning on passing
this report onto Adobe in the hopes that they acknowledge the issue and release
a fix. For now we will continue to operate with the work around.
Resources
Java Performance Tuning
Tools
http://www.javaperformancetuning.com/tools/index.shtml
HeapRoots
http://www.alphaworks.ibm.com/tech/heaproots
HProf
http://java.sun.com/developer/technicalArticles/Programming/HPROF.html
Java Performance Analysis
http://java.sun.com/developer/onlineTraining/Programming/JDCBook/perf3.html
HAT, Java Heap Analysis
Tool
https://hat.dev.java.net/doc/README.html

Well done.
This should fix the problem.
I think we owe Mike and Jason a huge vote of thanks for this - I don't even want to think about the number of hours they must have burned. Adobe, how about coughing up a couple of trips to MAX or something?
BTW, did anyone notice the date on that KB article? This fix was released last month along with 7.0.2 cumulative hot fix 2.
Jaime Metcher
BTW, are you saying you applied the hotfix? Also, it appears from the "Additional info" section of the technote (http://www.adobe.com/cfusion/knowledgebase/index.c... that you need to apply this fix as a manual extra step. Just confirming that you caught and did that.
The solution I found is to store all complex objects in server scope rather in Application scope or sesssion scope.
my suggestion is to store all complex object in server scope and you will see there is no memory leakage problem.
I am happy to help anybody who is struggling like these issues.
Thanks
Sana
are you using any application scope to store CFC or storing query results in Application scope.
If then use server scope rather Application, you don't need to call GC yourself, calling GC does not make any effects. If still problem persist then you can catch me on sanaullah73@gmail.com
Thanks
Sana
I have also spent a lot of time trying to resolve a memory issue with our CF server. It was about 6 months ago, on a CFMX6.1 server (W2K server, now W23K server). I turned on some extensive logging, and found that with my jvm tweaked settings, that the garbage collection was running about every 3-5 seconds or so. The data in the young generation would be promoted to the permanent generation when a request spanned that 5 second garbage collection gap. Meaning that if you had a long running page, the variables you instantiate would end up in the permanent generation.
Now, that's actually Ok, because a full garbage collection would run when the memory usage finally made it up to the max memory setting in your config file, and at that time the jvm would release that memory from the permanent generation for re-use. We noted here that the task manager still showed our max memory setting (~1.3Gb), even though the garbage collection logging showed that memory usage was released down to under 100Mb. I now no longer am concerned by seeing the 1.3Gb in the task manager like I used to be. In fact, we run the whole day just fine at our max, because in actual fact that only represents what's been allocated to the jrun.exe, and not what is actually in use by the jrun.exe. We do restart every night.
Ok, having said all that, our server, which had been running fine with my tweaked config for over 12 months, just started dying on us again. 4 hangs now in the last 3 weeks, and my client is getting worried, as am I. So, I start looking for the problem again, and find this blog post.
I just thought I'd mention that my previous issue looked like a CF memory leak, and smelled like a CF memory leak, but in actual fact was not technically a CF memory leak. My testing showed that the memory would be eaten up (i.e. transferred to the permanent generation) when the page request spanned the garbage collection interval - which of course is extremely easy to cause by hitting F5 5 times fast. Even a simple page would cause the young generation variables to make it into the permanent generation when run multiple times, which is quite a common occurrence with web programming in general (think multiple users - and especially if you use some CFLocks to protect inserts and the like). In this situation though, the full garbage collection should release that memory again when it runs, as long as those variables are not used again. Therefore, perhaps the issues you guys have been having are not in fact a memory leak inside CF, but incorrectly tuned jvm settings? Perhaps you've been too concerned about the instances not being immediately released, when instead you should be concerned about why they do not get released from a full garbage collection? ... I guess you could argue that what I have described IS what a memory leak is... I'm just saying that I don't think it's caused by a programmatical bug in CF more than it simply is just an incorrectly tuned jvm.
Also note that the posted hotfix does not actually apply to the problem as you've described it. It stated it was for a CFFile Action="Upload" problem that they had discovered, and not a generic variable scope instantiation not being release from active memory.
Back to my current problem, I am almost certain it is because we are now storing some configuration data in the application scope, and I didn't realize how much data it was going to be until I just saw how big it was today. A struct with 4,000 odd stucture of arrays... Although I don't think I can tell exactly how many bytes it is using, I'm fairly confident that it is the reason for my hangs now - that struct has (clearly) made it into the permanent generation, and is in constant use (it's there all the day long), so it will never be cleared even by the full garbage collection process. I either have to re-tweak my config, or more likely rethink that idea of using the application scope for this large struct.
I may give that StructClear(Variables) a try, but I actually don't think that will help my problem (mine's in the application scope, and I can't very well clear that every page request, can I?)! :)
I hope I didn't muddy things too much for anyone out there. I have been very happy with our (Macromedia-assisted) jvm settings, and in case you are interested, here they are:
java.args=-server -DJINTEGRA_NATIVE_MODE -DJINTEGRA_PREFETCH_ENUMS -Xms1284m -Xmx1284m -XX:NewSize=64m -XX:MaxNewSize=64m -XX:SurvivorRatio=3 -XX:PermSize=64m -XX:MaxPermSize=128m -XX:+UseParNewGC -XX:+DisableExplicitGC -Dsun.rmi.dgc.client.gcInterval=600000 -Dsun.rmi.dgc.server.gcInterval=600000 -Dsun.io.useCanonCaches=false -Xbootclasspath/a:"{application.home}/../lib/webchartsJava2D.jar" -Djavax.xml.parsers.SAXParserFactory=com.macromedia.crimson.jaxp.SAXParserFactoryImpl -Djavax.xml.parsers.DocumentBuilderFactory=com.macromedia.crimson.jaxp.DocumentBuilderFactoryImpl
And yes, I did mean Macromedia and not Adobe. Sarge helped us.
Feel free to disagree with me. I am just posting the fruits of my similar days-long trudge through the jvm memory settings swamp. I had thought I had left it behind me... :)
Good luck!
Steven Van Gemert
Steven Van Gemert
>Are you sure the full GC is only executed when the heap reaches the max heap size? I don't think this is the case.
I think you are right. I found that there was a gc happening often, and that it would clear the used memory back down to what it was before the page execution. However, when you run multiple pages, and cause objects to make it to the tenured generation (man, I may still be wrong with where it goes - I mean when it leaves the young generation because it is still in use), the overall memory usage rises. When it reaches the limit, at that point there is a more complete garbage collection performed because there is less than x% of the room left in the tenured generation (again now not sure of which generation). I think that x% was configurable, but I thought that at that point a more complete gc takes place, which would then release a huge chunk of memory back for availability to the jvm. This more complete one was only triggered because that threshhold was reached. Thus I called it a "full gc".
At least, that's what I saw. My memory creeped up over 5 or 6 bouts of holding down F5 for 30 seconds at a time, and when it reached the top, having had many (read 100) gc's occur but not clear out the memory, when it reached the top a final gc actually released a huge chunk (over a gig) of memory in one collection, that took quite a long time (5 seconds or so) compared to the others that took a few to a teen milliseconds each.
Again, that's what I remember, and now that I've thrown those log files away (oh, and it's a different server now too), I don't think I can verify it. Take it for what it's worth I guess...
>...it's requirement in CF7 to have -XX:+UseParallelGC enabled.
I recall finding some documentation that the new parallel gc was better than that. In fact, at one time I knew the exact reason why, and determined that I'd better use the new one or I'd crash and burn, hence the "XX:+UseParNewGC" in my jvm config.
>I'm also wondering why you have the -XX:MaxPermSize set so low. If as you say, "if you had a long running page, the variables you instantiate would end up in the permanent generation.", I think you're actually refering to the tenured generation.
Yeah I was referring to the tenured generation. My current problem might be that the tenured is making it to the permanent, and I don't have enough room for that (thanks for correcting me by the way).
My permanent so low? I thought it was quite large actually... I recall testing larger perm generation sizes and noting no performance difference in my testing. Of course, my testing might not simulate real-life usage, but it was what I went with. I think 128MB is quite a large permanent generation... you disagree?
Steven Van Gemert
I'd really like to see how your server handled things if you didn't restart it every day. Also, if your heap is in reality only using around 100mb, i think you could reduce your max heap size considerably and rely on the GC to clean things up more regularly.
-XX:+UseParNewGC is only going to clean up the young generation, so i think you'd find using -XX:+UseParallelGC will make your fullGC start working properly.
I think we have nailed down a similar problem and fixed it as part of our
efforts on performance and scalability.
We will see if we can get this available on CF7.0.2.
Thanks,
Hemant
There is also an issue with large numbers of DSN's on the same box.
I mentioned earlier that the hotfix cured our memory problems. Well, it cured the fast leak. A slow leak remained in both the tenured and permanent generations. Clearing the variables scope in onRequestEnd has cured the slow leak in the tenured generation. I haven't yet been able to reproduce the perm gen leak on my test environment.
Jaime Metcher
causes me to restart ColdFusion 2-3 times per day when site is under load.
This site uses lots of CFCs. Other sites that do not use CFC's have no problem.
Have installed 702 and the bundled hotfixes available as of today and not seen any resolution.
Have setup the workaround to clear out variables structure as well (which may have slowed the leak some).
Mostly just documenting this to be updated on any further developments, but if anyone has
any other ideas, please feel free to reply.
Anyone tested this on CF8 RC?
this article is very interesting! I'm investigating to resolve memory leak problems on my application, so I installed OptimizeIt. Can someone make to know me why I don't see cfc and cfm in class name list?
Thanks
I'm currently running java version 1.4.2_15. Has anyone tried to upgrade to 1.5? Grasping at straws here...
Regarding running cf7 on Java 1.5, it does work to a degree, but some things break. Don't remember exactly what, but when i tested it i ran into problems.
So I added the structClear(variables) statement to the bottom of our onRequestEnd.cfm. Immediately memory usage plateaued, and after a few hours it began to gradually decrease and level off at around 50%. I was stunned that this worked. I was ready to dance on my desk.
Problem is that after a 10 hour period, the memory started climbing again and the server needed a restart. Once we restart it, memory hangs around 50% usage for about 8-10 hours before it begins the climb to 100% again. The leak is about 1 meg every 6 mins for us.
I am so close to having this problem mitigated (resolved isn't the right word here). Anyone have a clue why this structClear technique is causing this new behavior? At least the problem has gotten a little better. Things have definately improved after adding that structClear, but we're still not quite stable.
I think I'll bring this up at the expert desk at MAX. Maybe they have a resolution.
Regardless, thank you for all the time you sunk into figuring out what the problem was. You've saved us countless hours.
I see a couple of you saying you don't have time to investigate--do you have time to let someone else investigate for you or with you? Sure, I realize you may not want to just give remote access to someone, myself included. But would you be willing to spend even a half hour working together over a remote shared desktop (like Adobe Connect) and phone call, both on my dime? I'm willing to offer that to help dig into this more. I'd love to help see it resolved (though it may be different things for different shops).
I have some ideas of things I'd like to explore, and rather than outline them all here, I'd prefer to work with someone experiencing this problem.
If you're interested, or if someone finds it in the future, drop me a note at charlie@carehart.org. If I get too many requests, I may not be able to take them all up, but certainly we'll report here if we find anything meaningful.
First, thanks Charlie for spending some time taking a look at our setup. After talking with him, we backed off our session variable timeout to 5 hours. This helped a little, but we were still slowly climbing to 100% memory usage.
We also installed the latest CF updater, which had no effect.
Then, at the suggestion of a colleague, we disabled "Maintain database connections" in the admin and restarted the server. Ever since that change, memory usage has been hanging around 20% for about a week.
We do have the latest macromedia_drivers.jar file installed (the 3.5 drivers), so I don't think they were the issue. It seems as if CF had a problem clearing out those connections from memory, resulting in the problem we were seeing.
Interesting to hear that it was the maintain connections setting. Hey Joe, can you report what DBMS it is (I can't remember). It could be useful as people consider this.
Joe, I've been running with persistent DB connections switched off for the past few weeks and it doesn't seem to have made any difference. It was worth a try, but it looks like the memory leak saga continues.
Charlie, sorry I haven't had a chance to catch up with you yet. I'll try contacting you later this week to see if we can have a look at the issue together.
Working around the issue here has leveled our JVM memory consumption and we have been running smoothly for days now.
The bottom line: NEVER set a component to the session scope from within another CFC.
I have a test case/code that will reproduce the problem which I can send anyone who might be interested analyzing it. Email me - brett@webflint.com
.brett
that IS only part of the problem .. it also applies to CFC's set in any other scope... i.e. variables, application, request ... but not var. In terms of good design practice, it's not a good idea to reference the shared scopes throughout your cfc's as it breaks encapsulation and creates bad dependencies. I use a session facade component to reference any variables/cfc's stored in the session scope.
The particular problem I found exists _even_ if you pass the SESSION through to the component as an argument, sorry I wasn't clear on that point before.
This would still lead me to believe your session facade is the problem, as it was in mine.
I have about 10 lines of code that will reproduce it if you want to look closer and see if it relates to what you are doing.
.brett
Patrick
Brett, I agree with you that the session Facade is part of the problem, but IT SHOULDN'T BE !! The problem also exists in plain old cfm files in the view layer where cfc's are referenced in the variables scope. This is clearly a memory leak and problem with CF.
Jason - Yes my issue is a clear bug where stranded component classes can live for the life of a session and are not released for garbage collection after the request or cfc function call is completed, even though they are never assigned to the Session scope. The build up of these stranded components lead to the JVM memory being consumed and CF hanging (10s of thousands of component classes in my case).
I have not found a work around other than not assigning components to the Session scope inside of other CFCs. It does not matter if you pass in the session scope as an argument either.
Again if someone want to look at my code to reproduce this let me know, I'd like a better understanding of it myself.
Using beans to hold each record from the database, I blew out the heap within a few thousands records. I spent several days writing and re-writing routines in procedural and OO code styles to see what worked best. I found that a fully procedural indexing routine was able to index 10,000 documents without ever exceeding 60 MB in the heap. Even the most carefully written use of components resulted in a heap size of 137 MB, while earlier iterations blew the heap out at 500 MB.
I just wrote a series of blog posts at blog.emergentpath.com on the subject if anyone wants to take a look at what I did. My general feeling is that CF needs a way to set objects to NULL and mark them for GC. Otherwise, as developers we will always be at the mercy of the underlying engine to correctly figure out which objects to retain and which to destroy.
All versions appear to have leaks and do not scale well out of the box.
To make the leaks go away and get good scalability before we start JVM tuning we do this:
- Implement your database connection pool with Apache DBCP. The Macromedia/Adobe pool is a horrid pig.
- Use type 4 JDBC drivers which do not come from Data Direct, Microsoft, or Macromedia/Adobe. SQLServer is particularly bad, and you have to use jTDS.
We do this for all our new configurations and its like night and day.
this one slipped through the cracks. We're hosting on windows so we just run a batch script using the 'sc' command to restart the CF service when the probe fails.