Re: PPMC progress





Ray wrote:

> On Sun, 10 Dec 2000,  Jon wrote:
> > I now have all 3 axes that are supported in my versions of EMC
> > working from encoder input to DAC output.
>
> Good News.  I'd like to show one running something at NAMES.  Let's talk
> off list about this.
>
> > The crashing problem I had is actually an old one, having
> > something to do with either Linux 2.0.36 or the RT patch
> > and the TkEMC GUI.  I have had a number of versions
> > of EMC later than 15-Mar-2000 that would hang up the
> > XFree86 screen, keyboard and mouse, or the whole
> > system.  I tried running the xemc GUI instead of TkEMC,
> > and it continued to work for about 45 minutes, until I
> > glitched the hardware with an errant scope probe.
> > Ray, do you have any ideas about this?  I'd much
> > prefer to stay with the TkEMC, but if it will only work
> > with 1999 releases of EMC, that is kind of limiting.
>
> Yes, and I was using a 99 release at NAMES last year and got the same kind
> of lock up a couple times.
>
> Ho! Wah! (a youpperism who's meaning varries greatly with the inflection
> used to express it.)  And it seems to get worse with the tcl lib that I've
> been building.  I thought that it might be related to the problems with
> mbuff but if you have increased the problems with your additions to 2.0.36
> then it is more fundamental than that.

No, I don't at all think my changes affected this problem.  I now have TWO
machines, very different, that have the same trouble!  One with an STG, and
one with my hardware.  It will lock up anywhere from 5 seconds to two
hours after starting EMC with the TkEMC GUI, but doesn't lock up with
xemc, at least on the new machine with my hardware.  I don't think I
tried it with xemc on the actual machine tool.

> I have a very had time believing that Tcl/Tk all by itself can reach in
> and trash the mouse and keyboard IO.  There must be some overlap between
> the EMC code, the tcl interpreter, emcsh, and the commands being passed
> from one end of the programming stack to the other.

Well, I don't know it does that, either.  The caps lock key, like everything
else, may be controlled by Xwindows when the keyboard is connected
to the X window manager.  The only reason I think it is more serious than
an X lockup only is that I can't get to any of the console terminals, with
stuff like ctrl/alt/F2 and ctrl/alt/F7.  But, maybe when you are in X, those
commands also are actually handled by X.  I do know that when it
locks up, the RT motion task keeps right on working, you just lose
control and the screen freezes.

> Another thing that I have noticed, but this is not a hard empirical
> conclusion is that a newly installed EMC does it less frequently than does
> a used system.  Go ahead and laugh and flame but the software seems to
> degrade over many starts.  With the lib work, I've had to reinstall the
> 9-25 bin at least three times or after a few dozen starts, it will hardly
> run cds.ngc at all.

HMMMMMMMM.... that's a very interesting observation!
I can't give any empirical evidence to support it, but it DOES
fit the pattern I saw when I loaded 15-MAR-2000 on the
machine tool computer.  I got it running, and did some milling
for about 30 minutes, had it hang, but it completed the part
properly.  I then rebooted, and it croaked within 5 minutes.
I gave up at that point, and loaded 20-DEC-1999 back in.
I wish I'd tried xemc at that point, that would have been very
useful info.

I had thought this problem was related to the fixed interval scheduling
of the RT task, which certainly sounds plausible, as it was also
a change implemented at that time.  But, if xemc works and TkEMC
doesn't, that is not likely to be the problem.

But, I can see some possible ways TkEMC could cause real trouble.
Probably the biggest one (Ray, you know more about this than
anyone) is something to do with the process of initializing variables
in the RT task.  If anything in TkEMC scrambles the mechanism
that hands those variables across the shared memory, they could
literally trash any location in physical memory, by causing a wild
pointer in the RT task.  Now, this is a real stretch, since the
routines that accept these messages are pretty well put together.
On another tack, is there anything in the TCL interpreter and
its interface to the GUI that might be affected by a break in
program continuity?  I wonder if there is something maybe along
the lines of a graphics driver that, if interrupted by the RT scheduler,
could cause X to die, or maybe all of Linux?   Hmmm, one other
thing along that line.  Obviously, the RT scheduler has hooks in
all over the hardware, but especially the timer and interrupt
controller.  Maybe there is a microscopic chance that an unprotected
critical section in X is perfectly safe under plain Linux, but because
of the changes applied by the RT patch, interrupts can happen
when and where X expects they can't happen, and something
gets really trashed, like, maybe, the interrupt controller!  It
would have to be such that the interrupt controller was still servicing
the timer, but nothing else!  That would match the symptoms.

> I thought for a while that the problem was that when the EMC shut down, it
> left some garbage around.  It will often leave mbuff in the core but that
> is only a post 2.0.37 problem.  Rebooting seems to help some but after a
> bunch of starts and stops, it gets almost impossible to work with.  Almost
> as irritating as BSOD.  I haven't reinstalled the rtlinux only the EMC.

Well, something is designed to stick around, and that is the program
variables file, and also the tool table.  Now,  there was a glitch, maybe
only in stepper versions that trashed motor parameters in the .ini
file, I think.  Maybe there's something else that gets changed that
wasn't noticed.

> My plan was to reinstall everything from Mandrake on but I'd like to get
> to the bottom of the real problem rather than just getting a new run at
> it.  Since I have a "cripple" here, I'd like us to study it.  But I have no
> clue what to do next.
>
> Could someone help with a debug routine that traces the whole system
> through.  I could run it here and post the results for study. Will tried a
> while back with the mbuff problem but I found no real clues from all of
> what he suggested.   As Jon points out below, any debug routine would have
> to save to a fifo during the run 'cause there is no good way back into the
> kernel.  Maybe a network, and I could set up one of those.

I have a $130,000 super high-end (but a little old) logic analyzer here
that I got for a song on eBay.  It has the capability to do 386 and 486
at least, and I think Pentium processors, too.  And, as it has just gone
obsolete at Tektronix, I can now get all the disassemblers for free from
them.  But, I would need a processor tap, at least, or possibly a processor
adaptor, depending on how Tek implemented it for the specific processor.
I will try to see what is needed.  Right now, I'm doing this development
on a 333 Mhz Pentium II, but I could switch back to a 100 MHz
Pentium classic for the debugging, if I can get set up for that processor.
Does anyone have a PGA adaptor for the Pentium classic CPU?
This would plug into the CPU socket, and the CPU would then plug
into the adaptor.  the adaptor also brings out all the signal pins
from the CPU to headers where the logic analyzer probes are attached.

If I can get the full disassembler set up (may require a bonded-out
CPU) then I could get a full trace of 10,000+ instructions leading
up to the failure, assuming I can find something significant to trigger
on.  If I can't get disassembly, but just a PGA adaptor, I can still
get a full trace of all processor bus transactions.  Going on the
assumption that anything that is going on in a Tk/TCL script or the
interpreter that hangs the whole system would be visible from the
outside of the CPU.

A problem that gets worse over time is a boon to the debugger,
unlike a problem that stays rare.

Jon




Date Index | Thread Index | Back to archive index | Back to Mailing List Page

Problems or questions? Contact