Discussion:
Malloc/free data structure corruption problem
(too old to reply)
Robert McConnell
2013-04-25 03:28:01 UTC
Permalink
Recent revisions in a large mature software system have revealed a
malloc/free related problem in my code.

In the test configuration the code allocates a list of six successive
essentially identical small data blocks. Each allocation is separated by
a lot of code. Under user control I can choose to free any of the
blocks. When I try to free the third block, and only the third block, I
get a "task exception ... access violation message". I'm assuming that
some type of buffer overrun or equivalent between the time I allocate
the memory and the time I try to free it is corrupting the allocation
data structures related to this block. I'm trying to track down the code
that is doing the corruption. To do so I'm trying to figure out if
there's anything I can monitor, preferably using the debugger, to
identify exactly when the corruption takes place in order to track down
the source.

Any suggestions would be greatly appreciated.

-rob
Paul S. Person
2013-04-25 16:50:52 UTC
Permalink
Post by Robert McConnell
Recent revisions in a large mature software system have revealed a
malloc/free related problem in my code.
In the test configuration the code allocates a list of six successive
essentially identical small data blocks. Each allocation is separated by
a lot of code. Under user control I can choose to free any of the
blocks. When I try to free the third block, and only the third block, I
get a "task exception ... access violation message". I'm assuming that
some type of buffer overrun or equivalent between the time I allocate
the memory and the time I try to free it is corrupting the allocation
data structures related to this block. I'm trying to track down the code
that is doing the corruption. To do so I'm trying to figure out if
there's anything I can monitor, preferably using the debugger, to
identify exactly when the corruption takes place in order to track down
the source.
Any suggestions would be greatly appreciated.
I have some vague ideas that may help.

The debugger might be able to show you the raw memory, containing the
pointer that won't work. A break on that memory to be activated when
its contents are changed to the bad value might work.

If you can't get that to work for some reason, there are two
alternative possibilities, which are basically a manual breakpoint.
You would want to make sure the test harness is clean before proceding
to the actual code:

There are heap checking functions available in the library. They might
help pinpoint where the problem is, once you figure out how the
problem manifests itself to them (what error code they return).

You could also try freeing the block at various points in the program
and so eventually identify the line of code that is causing the
problem.

Other people probably have much better ideas, however.
--
"Nature must be explained in
her own terms through
the experience of our senses."
Robert McConnell
2013-04-25 20:39:51 UTC
Permalink
Paul,

Many thanks for the ideas. Some comments and further questions below.

-rob
Post by Paul S. Person
Post by Robert McConnell
Recent revisions in a large mature software system have revealed a
malloc/free related problem in my code.
In the test configuration the code allocates a list of six successive
essentially identical small data blocks. Each allocation is separated by
a lot of code. Under user control I can choose to free any of the
blocks. When I try to free the third block, and only the third block, I
get a "task exception ... access violation message". I'm assuming that
some type of buffer overrun or equivalent between the time I allocate
the memory and the time I try to free it is corrupting the allocation
data structures related to this block. I'm trying to track down the code
that is doing the corruption. To do so I'm trying to figure out if
there's anything I can monitor, preferably using the debugger, to
identify exactly when the corruption takes place in order to track down
the source.
Any suggestions would be greatly appreciated.
I have some vague ideas that may help.
The debugger might be able to show you the raw memory, containing the
pointer that won't work. A break on that memory to be activated when
its contents are changed to the bad value might work.
I've tried that. The pointer never stops pointing to the original
address returned by malloc() and the values at that address and beyond
remain uncorrupted. Of course the possibility exists for corruption just
before it in memory. Do you know how many bytes the system reserves
before that address and what they contain, or alternatively where this
information might be found in the source code?
Post by Paul S. Person
If you can't get that to work for some reason, there are two
alternative possibilities, which are basically a manual breakpoint.
You would want to make sure the test harness is clean before proceeding
There are heap checking functions available in the library. They might
help pinpoint where the problem is, once you figure out how the
problem manifests itself to them (what error code they return).
I'll try that.
Post by Paul S. Person
You could also try freeing the block at various points in the program
and so eventually identify the line of code that is causing the
problem.
That's basically the approach I've been taking but only where I can do
it without altering the code -may have to bite the bullet and do the
code modifications.
Post by Paul S. Person
Other people probably have much better ideas, however.
Paul S. Person
2013-04-26 17:52:57 UTC
Permalink
Post by Robert McConnell
Paul,
Many thanks for the ideas. Some comments and further questions below.
-rob
Post by Paul S. Person
Post by Robert McConnell
Recent revisions in a large mature software system have revealed a
malloc/free related problem in my code.
In the test configuration the code allocates a list of six successive
essentially identical small data blocks. Each allocation is separated by
a lot of code. Under user control I can choose to free any of the
blocks. When I try to free the third block, and only the third block, I
get a "task exception ... access violation message". I'm assuming that
some type of buffer overrun or equivalent between the time I allocate
the memory and the time I try to free it is corrupting the allocation
data structures related to this block. I'm trying to track down the code
that is doing the corruption. To do so I'm trying to figure out if
there's anything I can monitor, preferably using the debugger, to
identify exactly when the corruption takes place in order to track down
the source.
Any suggestions would be greatly appreciated.
I have some vague ideas that may help.
The debugger might be able to show you the raw memory, containing the
pointer that won't work. A break on that memory to be activated when
its contents are changed to the bad value might work.
I've tried that. The pointer never stops pointing to the original
address returned by malloc() and the values at that address and beyond
remain uncorrupted. Of course the possibility exists for corruption just
before it in memory. Do you know how many bytes the system reserves
before that address and what they contain, or alternatively where this
information might be found in the source code?
IIRC, 8 bytes. But I could be wrong; I certainly was wrong to suggest
looking at "raw memory". If you look at "pointer memory" and move up
one line, the last X bytes will be what the heap uses. Any fill bytes
should also be visible. At least, that works here with wgml; however,
wgml is using trmem, something provided in the repository and so
generally available, which may affect what I am seeing (by providing
the fill bytes or adding an additional number of bytes before or after
the actual struct, I'm sure how it works).

If you have the repository, and are willing to use trmem (it's pretty
invasive), it can catch all sorts of problems, including leaked
memory, more easily than the heap checking functions.
Post by Robert McConnell
Post by Paul S. Person
If you can't get that to work for some reason, there are two
alternative possibilities, which are basically a manual breakpoint.
You would want to make sure the test harness is clean before proceeding
There are heap checking functions available in the library. They might
help pinpoint where the problem is, once you figure out how the
problem manifests itself to them (what error code they return).
I'll try that.
If you have the repository, ...\bld\wgml\research has a module
"heapchk" in both the c and h subdirectories that might help. Then
again, it is pretty much the example in the docs slightly adapted,
IIRC.
Post by Robert McConnell
Post by Paul S. Person
You could also try freeing the block at various points in the program
and so eventually identify the line of code that is causing the
problem.
That's basically the approach I've been taking but only where I can do
it without altering the code -may have to bite the bullet and do the
code modifications.
Post by Paul S. Person
Other people probably have much better ideas, however.
--
"Nature must be explained in
her own terms through
the experience of our senses."
Robert McConnell
2013-04-26 18:54:19 UTC
Permalink
Paul, Lynn,-thanks to both for the help.

Multiple repeated insertions of small tool using _heapchk tracked
problem down to a single incorrect digit in a minor addition I made a
month ago.

-rob
Lynn McGuire
2013-04-28 12:40:26 UTC
Permalink
Post by Robert McConnell
Paul, Lynn,-thanks to both for the help.
Multiple repeated insertions of small tool using _heapchk tracked
problem down to a single incorrect digit in a minor addition I made a
month ago.
-rob
Good to hear.

Thanks,
Lynn
Wilton Helm
2013-05-10 17:14:42 UTC
Permalink
I believe malloc places some housekeeping data just before the allocated
block, which free accesses with a negative offset on the pointer, but I
could be wrong. Checking the source code would show exactly what is going
on. It would be my guess that your code accidentally went backwards off of
the pointer and altered that information, causing free to access memory
other than what it was supposed to--hence the segmentation fault.

Wilton
Steven Levine
2013-05-11 21:15:02 UTC
Permalink
On Fri, 10 May 2013 17:14:42 UTC, "Wilton Helm" <***@compuserve.com>
wrote:

Hi Wilton,
Post by Wilton Helm
I believe malloc places some housekeeping data just before the allocated
block, which free accesses with a negative offset on the pointer, but I
could be wrong.
This show it works.
Post by Wilton Helm
It would be my guess that your code accidentally went backwards off of
the pointer and altered that information, causing free to access memory
other than what it was supposed to--hence the segmentation fault.
FWIW, my experience is that heap corruption is more often caused by
array overrun than underrun. Often the code that corrupted the heap
is totally unrelated to the code that trapped because of the
exception.

Fortunately, the heap corruption caused the the wifpc defect I have
been discussing was not one of these. A couple of well placed
_heapchk() calls allow Gregg to isolate the defective code relatively
quickly.

Steven
--
---------------------------------------------------------------------
Steven Levine <***@earthlink.bogus.net>
eCS/Warp/DIY etc. www.scoug.com www.ecomstation.com
---------------------------------------------------------------------
Wilton Helm
2013-05-21 11:32:44 UTC
Permalink
Post by Steven Levine
FWIW, my experience is that heap corruption is more often caused by
array overrun than underrun. Often the code that corrupted the heap
is totally unrelated to the code that trapped because of the
exception.
Because it is corrupting the header of the next allocation. And yes, that
would be harder to track because it isn't correlated. The best (still
difficult) approach would be to set a breakpoint at a spot where the
allocation has been done, but is not yet corrupted and then set a data
breakpoint in the header that is getting corrupted. Alternately, knowing
what is getting corrupted could allow one to figure out what was allocated
just before that, which might be what is being overrun and focus attention
on code related to it.

Wilton
Steve Fabian
2013-05-21 12:54:50 UTC
Permalink
Wilton Helm wrote:
| Because it is corrupting the header of the next allocation. And yes,
| that would be harder to track because it isn't correlated. The best
| (still difficult) approach would be to set a breakpoint at a spot
| where the allocation has been done, but is not yet corrupted and then
| set a data breakpoint in the header that is getting corrupted.
| Alternately, knowing what is getting corrupted could allow one to
| figure out what was allocated just before that, which might be what
| is being overrun and focus attention on code related to it.

But beware: though much less likely, the overrun could be from loading data
into an earlier allocated block. However, it would require a more massive
overrun, even if the last allocation below preceding the broken one was
small.
--
HTH, Steve
Paul S. Person
2013-05-21 17:00:57 UTC
Permalink
On Tue, 21 May 2013 05:32:44 -0600, "Wilton Helm"
Post by Wilton Helm
Post by Steven Levine
FWIW, my experience is that heap corruption is more often caused by
array overrun than underrun. Often the code that corrupted the heap
is totally unrelated to the code that trapped because of the
exception.
Because it is corrupting the header of the next allocation. And yes, that
would be harder to track because it isn't correlated. The best (still
difficult) approach would be to set a breakpoint at a spot where the
allocation has been done, but is not yet corrupted and then set a data
breakpoint in the header that is getting corrupted. Alternately, knowing
what is getting corrupted could allow one to figure out what was allocated
just before that, which might be what is being overrun and focus attention
on code related to it.
If I am currently working on the code, then it is quite simple: I go
back and look and what I just changed. It is amazing how often that
works.

The heap-checking functions, suitably adapted, can be used to zero in
on exactly which statement in the program corrupts the heap. While
this can be tiresome, involving a seemingly-endless series of move the
heap check/compile/link/run cycles, it is often very helpful.

It also helps to keep in mind that the problem is far more likely to
be in my code than in the compiler or the library.
--
"Nature must be explained in
her own terms through
the experience of our senses."
Robert McConnell
2013-06-17 13:48:58 UTC
Permalink
Post by Paul S. Person
On Tue, 21 May 2013 05:32:44 -0600, "Wilton Helm"
Post by Wilton Helm
Post by Steven Levine
FWIW, my experience is that heap corruption is more often caused by
array overrun than underrun. Often the code that corrupted the heap
is totally unrelated to the code that trapped because of the
exception.
Because it is corrupting the header of the next allocation. And yes, that
would be harder to track because it isn't correlated. The best (still
difficult) approach would be to set a breakpoint at a spot where the
allocation has been done, but is not yet corrupted and then set a data
breakpoint in the header that is getting corrupted. Alternately, knowing
what is getting corrupted could allow one to figure out what was allocated
just before that, which might be what is being overrun and focus attention
on code related to it.
If I am currently working on the code, then it is quite simple: I go
back and look and what I just changed. It is amazing how often that
works.
The heap-checking functions, suitably adapted, can be used to zero in
on exactly which statement in the program corrupts the heap. While
this can be tiresome, involving a seemingly-endless series of move the
heap check/compile/link/run cycles, it is often very helpful.
It also helps to keep in mind that the problem is far more likely to
be in my code than in the compiler or the library.
Since I was the original cause of this discussion I thought it might be
worth commenting re the "seemingly-endless series of move the heap check
..." - I was able to reduce the effort significantly by writing a simple
function that optionally reports back with it's location only when the
heap is bad. By inserting calls to this function at multiple strategic
points the number of compile/link/run cycles was greatly diminished. The
result was a very quick isolation of my offending code.

Lynn McGuire
2013-04-26 16:52:10 UTC
Permalink
Recent revisions in a large mature software system have revealed a malloc/free related problem in my code.
In the test configuration the code allocates a list of six successive essentially identical small data blocks. Each allocation is
separated by a lot of code. Under user control I can choose to free any of the blocks. When I try to free the third block, and only
the third block, I get a "task exception ... access violation message". I'm assuming that some type of buffer overrun or equivalent
between the time I allocate the memory and the time I try to free it is corrupting the allocation data structures related to this
block. I'm trying to track down the code that is doing the corruption. To do so I'm trying to figure out if there's anything I can
monitor, preferably using the debugger, to identify exactly when the corruption takes place in order to track down the source.
Any suggestions would be greatly appreciated.
-rob
Our fortran code uses malloc and free extensively.
Works fine for us. The largest number of mallocs
and frees that I have seen in a single run is
around 10,000 to 20,000 (hard to get a precise
count).

Lynn
Loading...