gcc-4.8+ and R10000+

Discussion:

Joshua Kinard

2014-09-07 08:25:03 UTC

I've been banging my head on the desk over gcc PR61538 [1] the last few
months, and talking to the gcc people, I went looking through the R10000
manual again to try and see if some kind of errata sticks out. I found this
bit:

"""
Load Linked and Store Conditional instructions (LL, LLD,
SC, and SCD) do not implicitly perform SYNC operations in
the R10000. Any of the following events that occur between
a Load Linked and a Store Conditional will cause the Store
Conditional to fail: an exception; execution of an ERET,
a load, a store, a SYNC, a CacheOp, a prefetch, or an
external intervention/invalidation on the block containing
the linked address. Instruction cache misses do not cause
the Store Conditional to fail.
"""

The regression happens inside glibc's __lll_lock_wait_private routine:

void
__lll_lock_wait_private (int *futex)
{
if (*futex == 2)
lll_futex_wait (futex, 2, LLL_PRIVATE);

while (atomic_exchange_acq (futex, 2) != 0)
lll_futex_wait (futex, 2, LLL_PRIVATE);
}

It appears to hang forever on the "atomic_exchange_acq" function call.

Disassembling a statically-built copy of the "sln" binary generated by
glibc's compile phase, there are slight differences in how gcc-4.7 and
gcc-4.8 are compiling the __lll_lock_wait_private function. The key
differences in the output asm are
this:

gcc-4.7:
x+4 <START>
...
x+24 bne v1,v0,<x+56>
...
x+32 0x7c03e83b /* rdhwr */
x+36 li a2,2
x+40 lw a1,-29832(v1)
x+44 move a3,zero
x+48 li v0,4238
x+52 syscall
* x+56 li v0,2
* x+60 ll v1,0(s0)
* x+64 move a0,v0
* x+68 sc a0,0(s0)
x+72 beqzl a0,<x+56>
x+76 nop
x+80 sync
x+84 bnez v1,<x+32>

gcc-4.8:
x+4 <START>
...
x+24 bne v1,v0,<x+56>
...
x+32 0x7c03e83b /* rdhwr */
x+36 li a2,2
x+40 lw a1,-29832(v1)
x+44 move a3,zero
x+48 li v0,4238
x+52 syscall
* x+56 ll v0,0(s0)
* x+60 li at,2
* x+64 sc at,0
x+68 beqzl at,<x+56>
x+72 nop
x+76 sync
x+80 bnez v0,<x+32>

Using gdb, if I step through 'sln', the gcc-4.7 copy never calls
__lll_lock_wait_private, so I have no idea how the insns are being executed.
But the 4.8 copy does get into this function, and stepping each instruction
at a time yields this execution path:

x+4 <START>
...
x+24 bne v1,v0,<x+56>
x+56 ll v0,0(s0)
x+68 beqzl at,<x+56> /* beqzl check fails -> x+76 */
x+76 sync
x+80 bnez v0,<x+32>
x+32 0x7c03e83b /* rdhwr */
x+36 li a2,2
x+40 lw a1,-29832(v1)
x+44 move a3,zero
x+48 li v0,4238
x+52 syscall
x+56 ll v0,0(s0)
<HANG>

Executing the 'bnez' insn puts us at the rdhwr insn (x+32), then stepping
through, the 'syscall' (x+56) returns and leaves us at the 'll' a second
time, where the program just hangs.

I am guessing at a few things here:

- Because ll/sc are atomic, gdb doesn't let you step through them, which is
why the instruction pointer jumps over the 'li' and 'sc' insns.

- The 'li' after 'll' triggers the 'sc' to fail on R10K.

Does this look correct for an R10000, given the above statement from the
manual? I'm not sure how or why this would cause the program to hang, but
it seems to directly correlate.

Anyone from Debian able to test building gcc-4.8 (or greater) and glibc-2.19
on an R10K system and see if it hangs at the end of glibc's compile phase
using the 'sln' binary to generate symlinks? I've ran into this on R12000
and R14000 systems. I am assuming it'll happen on an R10000 system as well.

1: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61538
--
Joshua Kinard
Gentoo/MIPS
***@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us. And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

Miod Vallat

2014-09-08 08:11:51 UTC

Permalink

Post by Joshua Kinard
Disassembling a statically-built copy of the "sln" binary generated by
glibc's compile phase, there are slight differences in how gcc-4.7 and
gcc-4.8 are compiling the __lll_lock_wait_private function. The key
differences in the output asm are

[...]

Post by Joshua Kinard
x+4 <START>
...
x+24 bne v1,v0,<x+56>
...
x+32 0x7c03e83b /* rdhwr */
x+36 li a2,2
x+40 lw a1,-29832(v1)
x+44 move a3,zero
x+48 li v0,4238
x+52 syscall
* x+56 ll v0,0(s0)
* x+60 li at,2
* x+64 sc at,0

Note how the sc address is no longer 0(s0). Since the address does
not match the address used in the ll instruction, sc will always
fail on the R10k.

Miod

Joshua Kinard

2014-09-08 09:44:27 UTC

Permalink

Post by Miod Vallat

[...]

Note how the sc address is no longer 0(s0). Since the address does
not match the address used in the ll instruction, sc will always
fail on the R10k.

That would be a typo on my part. I typed that out by hand and just missed it. It should read:

gcc-4.8:
x+4 <START>
...
x+24 bne v1,v0,<x+56>
...
x+32 0x7c03e83b /* rdhwr */
x+36 li a2,2
x+40 lw a1,-29832(v1)
x+44 move a3,zero
x+48 li v0,4238
x+52 syscall
* x+56 ll v0,0(s0)
* x+60 li at,2
* x+64 sc at,0(s0)
x+68 beqzl at,<x+56>
x+72 nop
x+76 sync
x+80 bnez v0,<x+32>

Thanks!,

--
Joshua Kinard
Gentoo/MIPS
***@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

Joshua Kinard

2014-09-22 09:31:35 UTC

Permalink

Post by Joshua Kinard

Post by Miod Vallat

[...]

Note how the sc address is no longer 0(s0). Since the address does
not match the address used in the ll instruction, sc will always
fail on the R10k.

x+4 <START>
...
x+24 bne v1,v0,<x+56>
...
x+32 0x7c03e83b /* rdhwr */
x+36 li a2,2
x+40 lw a1,-29832(v1)
x+44 move a3,zero
x+48 li v0,4238
x+52 syscall
* x+56 ll v0,0(s0)
* x+60 li at,2
* x+64 sc at,0(s0)
x+68 beqzl at,<x+56>
x+72 nop
x+76 sync
x+80 bnez v0,<x+32>

I did some more tracing. It seems the issue with glibc itself stems from the
addition of __atomic_* builtins added generally in gcc-4.7 and

Post by Joshua Kinard
From ports/sysdeps/mips/bits/atomic.h (for 2.19) or sysdeps/mips/bits/atomic.h

(for 2.20):

/* The __atomic_* builtins are available in GCC 4.7 and later, but MIPS
support for their efficient implementation was added only in GCC 4.8.
We still want to use them even with GCC 4.7 for MIPS16 code where we
have no assembly alternative available and want to avoid the __sync_*
builtins if at all possible. */

#if __GNUC_PREREQ (4, 8) || (defined __mips16 && __GNUC_PREREQ (4, 7))
[snip]

This is why the assembly is different between the two gcc versions. This same
code is in the kernel's atomic.h copy under arch/mips/include/asm/ as well.

I tested by removing the top part of the #if macro and basically forcing the
inline versions only, then rebuilt glibc-2.20 with gcc-4.9.2 (20140921
prerelease), and lo and behold, sln executes and returns its usage information.
When using the gcc internal builtins, a futex gets used, which is why I wasn't
seeing futexes in 4.7-built copies of sln, only in 4.8 or greater-built copies.
This means that the gcc internal __atomic_* builtins may be somewhat to blame
for this problem on R1x000 systems.

I traced the kernel side of the problem out and figured out that when the futex
is taken by sln, the process gets frozen by the scheduler via a call to
freezable_schedule() in function futex_wait_queue_me in kernel/futex.c. I
added two printk statements, one before freezable_schedule() and one after, and
the first statement executes (verified by dumping /proc/kmsg directly because
dmesg itself generates futexes), but not the printk after. The printk after
freezable_schedule() only executes when I ctrl+C the frozen process and it
exits out of the futex code.

I visually checked through include/linux/freezer.h and noticed that
freezable_schedule eventually calls freezing(), which executes an atomic_read()
on system_freezing_cnt. In the mips code, that just comes out as a pointer
dereference of a volatile variable. I'm not certain, though, if in gcc's case,
the use of volatile means it tries to use its builtin __atomic_ functions
again, and tries to take another futex /while it's trying to take a futex/.
Chicken and egg?

So, could still very well be a gcc issue, or maybe it's something really subtle
in the kernel code. I am not sure which. I at least know of a specific gcc
commit that enables/disables the problem, and that's pointing the finger at gcc
here.

Ideas?
--
Joshua Kinard
Gentoo/MIPS
***@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us. And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

Maciej W. Rozycki

2014-09-28 19:34:24 UTC

Permalink

Joshua,

Post by Joshua Kinard
- Because ll/sc are atomic, gdb doesn't let you step through them, which is
why the instruction pointer jumps over the 'li' and 'sc' insns.

-- this is exactly the case, GDB tries to be smart enough and when it sees
an LL or LLD instruction it examines code that follows to find a matching
SC or SCD instruction and any other exit points from the atomic section
and sets internal breakpoints correctly to let the code fragment run at
the full speed even if single stepping. Otherwise the exception taken at
each single step would cause the conditional store instruction to always
fail -- which might not be a big issue if you were knowingly stepping code
e.g. with `stepi', but would cause big harm in implicit stepping through
unknown or unrelated code such as when a software watchpoint is active.

See `deal_with_atomic_sequence' in gdb/mips-tdep.c if curious about the
details.

Maciej

Joshua Kinard

2014-09-29 05:18:44 UTC

Permalink

Post by Maciej W. Rozycki
Joshua,

Post by Joshua Kinard
- Because ll/sc are atomic, gdb doesn't let you step through them, which is
why the instruction pointer jumps over the 'li' and 'sc' insns.

-- this is exactly the case, GDB tries to be smart enough and when it sees
an LL or LLD instruction it examines code that follows to find a matching
SC or SCD instruction and any other exit points from the atomic section
and sets internal breakpoints correctly to let the code fragment run at
the full speed even if single stepping. Otherwise the exception taken at
each single step would cause the conditional store instruction to always
fail -- which might not be a big issue if you were knowingly stepping code
e.g. with `stepi', but would cause big harm in implicit stepping through
unknown or unrelated code such as when a software watchpoint is active.
See `deal_with_atomic_sequence' in gdb/mips-tdep.c if curious about the
details.

Ah ha, that does explain it! Though I don't think it's an issue with ll/sc in
the R10000. It's something with gcc's builtin __atomic_* functions I think,
though I still haven't ruled the kernel out yet, either. I have no way to step
through the kernel syscall to make that determination, though, so I'm focusing
more on gcc as time permits. Hopefully, the gcc maintainers will find time to
look into PR61538 some more soon.

Thanks!
--
Joshua Kinard
Gentoo/MIPS
***@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us. And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic