Discussion:
[PATCH resend] MIPS: Allow FPU emulator to use non-stack area.
David Daney
2014-10-06 20:23:30 UTC
Permalink
From: David Daney <***@cavium.com>

In order for MIPS to be able to support a non-executable stack, we
need to supply a method to specify a userspace area that can be used
for executing emulated branch delay slot instructions.

We add a new system call, sys_set_fpuemul_xol_area so that userspace
threads that are using the FPU can specify the location of the FPU
emulation out of line execution area.

Background:

MIPS floating point support requires that any instruction that cannot
be directly executed by the FPU, be emulated by the kernel. Part of
this emulation involves executing non-FPU instructions that fall in
the delay slots of FP branch instructions. Since the beginning of
MIPS/Linux time, this has been done by placing the instructions on the
userspace thread stack, and executing them there, as the instructions
must be executed in the MM context of the thread receiving the
emulation.

Because of this, the de facto MIPS Linux userspace ABI requires that
the userspace thread have an executable stack. It is de facto,
because it is not written anywhere that this must be the case, but it
is never the less a requirement.

Problem:

How do we get MIPS Linux to use a non-executable stack in the face of
the FPU emulation problem?

Since userspace desires to change the ABI, put some of the onus on the
userspace code. Any userspace thread desiring a non-executable stack,
must allocate a 4-byte aligned area at least 8 bytes long with that
has read/write/execute permissions and pass the address of that area
to the kernel with the new sys_set_fpuemul_xol_area system call.

This is similar to how we require userspace to notify the kernel of
the value of the thread local pointer.

Signed-off-by: David Daney <***@cavium.com>
---

First attempt to libc-alpha@ failed due to anti-spam technology,
reattempting to a reduced list of recipients.

This patch has only been compile tested, and lacks the userspace
component. It is presented as an alternate approch to the recently
proposed MIPS non-executable stack patches posted here:

http://www.linux-mips.org/archives/linux-mips/2014-10/msg00024.html

arch/mips/include/asm/thread_info.h | 2 ++
arch/mips/include/uapi/asm/unistd.h | 15 +++++++++------
arch/mips/kernel/process.c | 1 +
arch/mips/kernel/scall32-o32.S | 1 +
arch/mips/kernel/scall64-64.S | 1 +
arch/mips/kernel/scall64-n32.S | 1 +
arch/mips/kernel/scall64-o32.S | 1 +
arch/mips/kernel/syscall.c | 8 ++++++++
arch/mips/math-emu/dsemul.c | 11 +++++++----
9 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
index 7de8658..20d47f6 100644
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -26,6 +26,7 @@ struct thread_info {
struct exec_domain *exec_domain; /* execution domain */
unsigned long flags; /* low level flags */
unsigned long tp_value; /* thread pointer */
+ unsigned long fpu_emul_xol; /* FPU emul eXecute Out of Line VA */
__u32 cpu; /* current CPU */
int preempt_count; /* 0 => preemptable, <0 => BUG */

@@ -46,6 +47,7 @@ struct thread_info {
.task = &tsk, \
.exec_domain = &default_exec_domain, \
.flags = _TIF_FIXADE, \
+ .fpu_emul_xol = ~0ul, \
.cpu = 0, \
.preempt_count = INIT_PREEMPT_COUNT, \
.addr_limit = KERNEL_DS, \
diff --git a/arch/mips/include/uapi/asm/unistd.h b/arch/mips/include/uapi/asm/unistd.h
index fdb4923..f1270ee 100644
--- a/arch/mips/include/uapi/asm/unistd.h
+++ b/arch/mips/include/uapi/asm/unistd.h
@@ -375,16 +375,17 @@
#define __NR_seccomp (__NR_Linux + 352)
#define __NR_getrandom (__NR_Linux + 353)
#define __NR_memfd_create (__NR_Linux + 354)
+#define __NR_set_fpuemul_xol_area (__NR_Linux + 355)

/*
* Offset of the last Linux o32 flavoured syscall
*/
-#define __NR_Linux_syscalls 354
+#define __NR_Linux_syscalls 355

#endif /* _MIPS_SIM == _MIPS_SIM_ABI32 */

#define __NR_O32_Linux 4000
-#define __NR_O32_Linux_syscalls 354
+#define __NR_O32_Linux_syscalls 355

#if _MIPS_SIM == _MIPS_SIM_ABI64

@@ -707,16 +708,17 @@
#define __NR_seccomp (__NR_Linux + 312)
#define __NR_getrandom (__NR_Linux + 313)
#define __NR_memfd_create (__NR_Linux + 314)
+#define __NR_set_fpuemul_xol_area (__NR_Linux + 315)

/*
* Offset of the last Linux 64-bit flavoured syscall
*/
-#define __NR_Linux_syscalls 314
+#define __NR_Linux_syscalls 315

#endif /* _MIPS_SIM == _MIPS_SIM_ABI64 */

#define __NR_64_Linux 5000
-#define __NR_64_Linux_syscalls 314
+#define __NR_64_Linux_syscalls 315

#if _MIPS_SIM == _MIPS_SIM_NABI32

@@ -1043,15 +1045,16 @@
#define __NR_seccomp (__NR_Linux + 316)
#define __NR_getrandom (__NR_Linux + 317)
#define __NR_memfd_create (__NR_Linux + 318)
+#define __NR_set_fpuemul_xol_area (__NR_Linux + 319)

/*
* Offset of the last N32 flavoured syscall
*/
-#define __NR_Linux_syscalls 318
+#define __NR_Linux_syscalls 319

#endif /* _MIPS_SIM == _MIPS_SIM_NABI32 */

#define __NR_N32_Linux 6000
-#define __NR_N32_Linux_syscalls 318
+#define __NR_N32_Linux_syscalls 319

#endif /* _UAPI_ASM_UNISTD_H */
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index 636b074..6dde6bb 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -151,6 +151,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,

if (clone_flags & CLONE_SETTLS)
ti->tp_value = regs->regs[7];
+ ti->fpu_emul_xol = ~0ul;

return 0;
}
diff --git a/arch/mips/kernel/scall32-o32.S b/arch/mips/kernel/scall32-o32.S
index 744cd10..8c19a39 100644
--- a/arch/mips/kernel/scall32-o32.S
+++ b/arch/mips/kernel/scall32-o32.S
@@ -579,3 +579,4 @@ EXPORT(sys_call_table)
PTR sys_seccomp
PTR sys_getrandom
PTR sys_memfd_create
+ PTR sys_set_fpuemul_xol_area /* 4355 */
diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S
index 002b1bc..0b9f72e 100644
--- a/arch/mips/kernel/scall64-64.S
+++ b/arch/mips/kernel/scall64-64.S
@@ -434,4 +434,5 @@ EXPORT(sys_call_table)
PTR sys_seccomp
PTR sys_getrandom
PTR sys_memfd_create
+ PTR sys_set_fpuemul_xol_area /* 5315 */
.size sys_call_table,.-sys_call_table
diff --git a/arch/mips/kernel/scall64-n32.S b/arch/mips/kernel/scall64-n32.S
index ca6cbbe..48f1760 100644
--- a/arch/mips/kernel/scall64-n32.S
+++ b/arch/mips/kernel/scall64-n32.S
@@ -427,4 +427,5 @@ EXPORT(sysn32_call_table)
PTR sys_seccomp
PTR sys_getrandom
PTR sys_memfd_create
+ PTR sys_set_fpuemul_xol_area
.size sysn32_call_table,.-sysn32_call_table
diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
index 9e10d11..60def68 100644
--- a/arch/mips/kernel/scall64-o32.S
+++ b/arch/mips/kernel/scall64-o32.S
@@ -564,4 +564,5 @@ EXPORT(sys32_call_table)
PTR sys_seccomp
PTR sys_getrandom
PTR sys_memfd_create
+ PTR sys_set_fpuemul_xol_area /* 4355 */
.size sys32_call_table,.-sys32_call_table
diff --git a/arch/mips/kernel/syscall.c b/arch/mips/kernel/syscall.c
index 4a4f9dd..5f9d9e8 100644
--- a/arch/mips/kernel/syscall.c
+++ b/arch/mips/kernel/syscall.c
@@ -96,6 +96,14 @@ SYSCALL_DEFINE1(set_thread_area, unsigned long, addr)
return 0;
}

+SYSCALL_DEFINE1(set_fpuemul_xol_area, unsigned long, addr)
+{
+ struct thread_info *ti = task_thread_info(current);
+
+ ti->fpu_emul_xol = addr;
+ return 0;
+}
+
static inline int mips_atomic_set(unsigned long addr, unsigned long new)
{
unsigned long old, tmp;
diff --git a/arch/mips/math-emu/dsemul.c b/arch/mips/math-emu/dsemul.c
index 4f514f3..bf4ff61 100644
--- a/arch/mips/math-emu/dsemul.c
+++ b/arch/mips/math-emu/dsemul.c
@@ -34,6 +34,7 @@ struct emuframe {
int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
{
extern asmlinkage void handle_dsemulret(void);
+ struct thread_info *ti = task_thread_info(current);
struct emuframe __user *fr;
int err;

@@ -64,10 +65,12 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
* branches, but gives us a cleaner interface to the exception
* handler (single entry point).
*/
-
- /* Ensure that the two instructions are in the same cache line */
- fr = (struct emuframe __user *)
- ((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
+ if (ti->fpu_emul_xol != ~0ul)
+ fr = (struct emuframe *)ti->fpu_emul_xol;
+ else
+ /* Ensure that the two instructions are in the same cache line */
+ fr = (struct emuframe __user *)
+ ((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);

/* Verify that the stack pointer is not competely insane */
if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct emuframe))))
--
1.7.11.7
Rich Felker
2014-10-06 20:54:59 UTC
Permalink
Post by David Daney
In order for MIPS to be able to support a non-executable stack, we
need to supply a method to specify a userspace area that can be used
for executing emulated branch delay slot instructions.
We add a new system call, sys_set_fpuemul_xol_area so that userspace
threads that are using the FPU can specify the location of the FPU
emulation out of line execution area.
MIPS floating point support requires that any instruction that cannot
be directly executed by the FPU, be emulated by the kernel. Part of
this emulation involves executing non-FPU instructions that fall in
the delay slots of FP branch instructions. Since the beginning of
MIPS/Linux time, this has been done by placing the instructions on the
userspace thread stack, and executing them there, as the instructions
must be executed in the MM context of the thread receiving the
emulation.
Because of this, the de facto MIPS Linux userspace ABI requires that
the userspace thread have an executable stack. It is de facto,
because it is not written anywhere that this must be the case, but it
is never the less a requirement.
How do we get MIPS Linux to use a non-executable stack in the face of
the FPU emulation problem?
Since userspace desires to change the ABI, put some of the onus on the
userspace code. Any userspace thread desiring a non-executable stack,
must allocate a 4-byte aligned area at least 8 bytes long with that
has read/write/execute permissions and pass the address of that area
to the kernel with the new sys_set_fpuemul_xol_area system call.
This is similar to how we require userspace to notify the kernel of
the value of the thread local pointer.
Userspace should play no part in this; requiring userspace to help
make special accomodations for fpu emulation largely defeats the
purpose of fpu emulation. The kernel is perfectly capable of mapping
an appropriate page. The mapping should happen at exec time, and at
clone time with CLONE_VM unless the kernel is going to handle mutual
exclusion so that only one thread can be using the page at a time.
(Using one page for the whole process, and excluding simultaneous
execution of fpu emulation in multiple threads, may be the more
practical approach.)

As an alternative, if the space of possible instruction with a delay
slot is sufficiently small, all such instructions could be mapped as
immutable code in a shared mapping, each at a fixed offset in the
mapping. I suspect this would be borderline-impractical (multiple
megabytes?), but it is the cleanest solution otherwise.

Rich
David Daney
2014-10-06 21:18:19 UTC
Permalink
Post by Rich Felker
Post by David Daney
In order for MIPS to be able to support a non-executable stack, we
need to supply a method to specify a userspace area that can be used
for executing emulated branch delay slot instructions.
We add a new system call, sys_set_fpuemul_xol_area so that userspace
threads that are using the FPU can specify the location of the FPU
emulation out of line execution area.
MIPS floating point support requires that any instruction that cannot
be directly executed by the FPU, be emulated by the kernel. Part of
this emulation involves executing non-FPU instructions that fall in
the delay slots of FP branch instructions. Since the beginning of
MIPS/Linux time, this has been done by placing the instructions on the
userspace thread stack, and executing them there, as the instructions
must be executed in the MM context of the thread receiving the
emulation.
Because of this, the de facto MIPS Linux userspace ABI requires that
the userspace thread have an executable stack. It is de facto,
because it is not written anywhere that this must be the case, but it
is never the less a requirement.
How do we get MIPS Linux to use a non-executable stack in the face of
the FPU emulation problem?
Since userspace desires to change the ABI, put some of the onus on the
userspace code. Any userspace thread desiring a non-executable stack,
must allocate a 4-byte aligned area at least 8 bytes long with that
has read/write/execute permissions and pass the address of that area
to the kernel with the new sys_set_fpuemul_xol_area system call.
This is similar to how we require userspace to notify the kernel of
the value of the thread local pointer.
Userspace should play no part in this; requiring userspace to help
make special accomodations for fpu emulation largely defeats the
purpose of fpu emulation.
That is certainly one way of looking at it. Really it is opinion,
rather than fact though.

GLibc is full of code (see ld.so) that in earlier incantations of
Unix/Linux was in kernel space, and was moved to userspace. Given that
there is a partitioning of code between kernel space and userspace, I
think it not totally unreasonable to consider doing some of this in
userspace.

Even on systems with hardware FPU, the architecture specification allows
for/requires emulation of certain cases (denormals, etc.) So it is
already a requirement that userspace cooperate by always having free
space below $SP for use by the kernel. So the current situation is that
userspace is providing services for the kernel FPU emulator.

My suggestion is to change the nature of the way these services are
provided by the userspace program.
Post by Rich Felker
The kernel is perfectly capable of mapping
an appropriate page. The mapping should happen at exec time, and at
clone time with CLONE_VM
Why? This adds overhead for threads that don't use the FPU. So this
suggestion adds at least one page of memory overhead for each thread in
the system (unless I misunderstand what you are saying).
Post by Rich Felker
unless the kernel is going to handle mutual
exclusion so that only one thread can be using the page at a time.
(Using one page for the whole process, and excluding simultaneous
execution of fpu emulation in multiple threads, may be the more
practical approach.)
As an alternative, if the space of possible instruction with a delay
slot is sufficiently small, all such instructions could be mapped as
immutable code in a shared mapping, each at a fixed offset in the
mapping. I suspect this would be borderline-impractical (multiple
megabytes?), but it is the cleanest solution otherwise.
Yes, there are 2^32 possible instructions. Each one is 4 bytes, plus
you need a way to exit after the instruction has executed, which would
require another instruction. So you would need 32GB of memory to hold
all those instructions, larger than the 32-bit virtual address space.
Post by Rich Felker
Rich
Rich Felker
2014-10-06 21:31:01 UTC
Permalink
Post by David Daney
Post by Rich Felker
Userspace should play no part in this; requiring userspace to help
make special accomodations for fpu emulation largely defeats the
purpose of fpu emulation.
That is certainly one way of looking at it. Really it is opinion,
rather than fact though.
It's an opinion, yes, but it has substantial reason behind it.
Post by David Daney
GLibc is full of code (see ld.so) that in earlier incantations of
Unix/Linux was in kernel space, and was moved to userspace. Given
that there is a partitioning of code between kernel space and
userspace, I think it not totally unreasonable to consider doing
some of this in userspace.
Even on systems with hardware FPU, the architecture specification
allows for/requires emulation of certain cases (denormals, etc.) So
it is already a requirement that userspace cooperate by always
having free space below $SP for use by the kernel. So the current
situation is that userspace is providing services for the kernel FPU
emulator.
My suggestion is to change the nature of the way these services are
provided by the userspace program.
But this isn't setup by the userspace program. It's setup by the
kernel on program entry. Despite that, though, I think it's an
unnecessary (and undocumented!) constraint; the fact that it requires
the stack to be executable makes it even more harmful and
inappropriate.
Post by David Daney
Post by Rich Felker
The kernel is perfectly capable of mapping
an appropriate page. The mapping should happen at exec time, and at
clone time with CLONE_VM
Why? This adds overhead for threads that don't use the FPU. So
this suggestion adds at least one page of memory overhead for each
thread in the system (unless I misunderstand what you are saying).
Yes, that's why I think the mutual-exclusion approach might be
preferred. But if you're going to use per-thread areas for this, they
MUST be allocated at thread-creation time, since that's the only time
you can handle error (by failing pthread_create). If you do it lazily,
it might fail and there's no way to recover. And there's no way to
know in advance whether a thread will invoke floating point code, so
you have to set it up for every thread.
Post by David Daney
Post by Rich Felker
unless the kernel is going to handle mutual
exclusion so that only one thread can be using the page at a time.
(Using one page for the whole process, and excluding simultaneous
execution of fpu emulation in multiple threads, may be the more
practical approach.)
As an alternative, if the space of possible instruction with a delay
slot is sufficiently small, all such instructions could be mapped as
immutable code in a shared mapping, each at a fixed offset in the
mapping. I suspect this would be borderline-impractical (multiple
megabytes?), but it is the cleanest solution otherwise.
Yes, there are 2^32 possible instructions. Each one is 4 bytes,
plus you need a way to exit after the instruction has executed,
which would require another instruction. So you would need 32GB of
memory to hold all those instructions, larger than the 32-bit
virtual address space.
There are not 2^32 instructions that have delay slots after them. Only
branch instructions have delay slots. The space of such instruction is
much smaller, probably on the order of 64-256 MB, not 32GB, but I
haven't looked at the instruction encoding tables to confirm this.

Rich
David Daney
2014-10-06 21:45:29 UTC
Permalink
Post by Rich Felker
Post by David Daney
Post by Rich Felker
Userspace should play no part in this; requiring userspace to help
make special accomodations for fpu emulation largely defeats the
purpose of fpu emulation.
That is certainly one way of looking at it. Really it is opinion,
rather than fact though.
It's an opinion, yes, but it has substantial reason behind it.
Post by David Daney
GLibc is full of code (see ld.so) that in earlier incantations of
Unix/Linux was in kernel space, and was moved to userspace. Given
that there is a partitioning of code between kernel space and
userspace, I think it not totally unreasonable to consider doing
some of this in userspace.
Even on systems with hardware FPU, the architecture specification
allows for/requires emulation of certain cases (denormals, etc.) So
it is already a requirement that userspace cooperate by always
having free space below $SP for use by the kernel. So the current
situation is that userspace is providing services for the kernel FPU
emulator.
My suggestion is to change the nature of the way these services are
provided by the userspace program.
But this isn't setup by the userspace program. It's setup by the
kernel on program entry. Despite that, though, I think it's an
unnecessary (and undocumented!) constraint; the fact that it requires
the stack to be executable makes it even more harmful and
inappropriate.
The management of the stack is absolutely done by userspace code. Any
time you do pthread_create(), userspace code does mmap() to allocate the
stack area, it then sets permissions on the area, and then it passes the
address of the area to clone(). Furthermore the userspace code has to
be very careful in its use of the $sp register, so that it doesn't store
data in places that will be used/clobbered by the kernel.

All of this is under the control of the userspace program and done with
userspace code.

I appreciate the fact that libc authors might prefer *not* to write more
code, but they could, especially if they wanted to add the feature of
non-executable stacks to their library implementation.

David Daney
Rich Felker
2014-10-06 21:58:13 UTC
Permalink
Post by David Daney
Post by Rich Felker
Post by David Daney
Post by Rich Felker
Userspace should play no part in this; requiring userspace to help
make special accomodations for fpu emulation largely defeats the
purpose of fpu emulation.
That is certainly one way of looking at it. Really it is opinion,
rather than fact though.
It's an opinion, yes, but it has substantial reason behind it.
Post by David Daney
GLibc is full of code (see ld.so) that in earlier incantations of
Unix/Linux was in kernel space, and was moved to userspace. Given
that there is a partitioning of code between kernel space and
userspace, I think it not totally unreasonable to consider doing
some of this in userspace.
Even on systems with hardware FPU, the architecture specification
allows for/requires emulation of certain cases (denormals, etc.) So
it is already a requirement that userspace cooperate by always
having free space below $SP for use by the kernel. So the current
situation is that userspace is providing services for the kernel FPU
emulator.
My suggestion is to change the nature of the way these services are
provided by the userspace program.
But this isn't setup by the userspace program. It's setup by the
kernel on program entry. Despite that, though, I think it's an
unnecessary (and undocumented!) constraint; the fact that it requires
the stack to be executable makes it even more harmful and
inappropriate.
The management of the stack is absolutely done by userspace code.
Any time you do pthread_create(), userspace code does mmap() to
allocate the stack area, it then sets permissions on the area, and
then it passes the address of the area to clone().
This is hardly management.
Post by David Daney
Furthermore the
userspace code has to be very careful in its use of the $sp
register, so that it doesn't store data in places that will be
used/clobbered by the kernel.
This is not "being careful". The stack pointer can never become
invalid unless you do wacky things in asm or invoke UB.
Post by David Daney
All of this is under the control of the userspace program and done
with userspace code.
For the most part it just happens by default. There is no particular
intentionality needed, and certainly no hideous MIPS-specific hacks
needed.
Post by David Daney
I appreciate the fact that libc authors might prefer *not* to write
more code, but they could, especially if they wanted to add the
feature of non-executable stacks to their library implementation.
So your position is that:

1. A non-exec-stack system can only run new code produced to do extra
stuff in userspace.

2. The startup code needs to do special work in userspace on MIPS to
setup an executable area for fpu emulation.

3. Every call to clone/CLONE_VM needs to be accompanied by a call to
mmap and this new syscall to set the address, and every call to
SYS_exit needs to be accompanies by a call to munmap for the
corresponding mapping.

This is a huge ill-designed mess.

Rich
David Daney
2014-10-06 22:17:03 UTC
Permalink
Post by Rich Felker
Post by David Daney
Post by Rich Felker
Post by David Daney
Post by Rich Felker
Userspace should play no part in this; requiring userspace to help
make special accomodations for fpu emulation largely defeats the
purpose of fpu emulation.
That is certainly one way of looking at it. Really it is opinion,
rather than fact though.
It's an opinion, yes, but it has substantial reason behind it.
Post by David Daney
GLibc is full of code (see ld.so) that in earlier incantations of
Unix/Linux was in kernel space, and was moved to userspace. Given
that there is a partitioning of code between kernel space and
userspace, I think it not totally unreasonable to consider doing
some of this in userspace.
Even on systems with hardware FPU, the architecture specification
allows for/requires emulation of certain cases (denormals, etc.) So
it is already a requirement that userspace cooperate by always
having free space below $SP for use by the kernel. So the current
situation is that userspace is providing services for the kernel FPU
emulator.
My suggestion is to change the nature of the way these services are
provided by the userspace program.
But this isn't setup by the userspace program. It's setup by the
kernel on program entry. Despite that, though, I think it's an
unnecessary (and undocumented!) constraint; the fact that it requires
the stack to be executable makes it even more harmful and
inappropriate.
The management of the stack is absolutely done by userspace code.
Any time you do pthread_create(), userspace code does mmap() to
allocate the stack area, it then sets permissions on the area, and
then it passes the address of the area to clone().
This is hardly management.
Post by David Daney
Furthermore the
userspace code has to be very careful in its use of the $sp
register, so that it doesn't store data in places that will be
used/clobbered by the kernel.
This is not "being careful". The stack pointer can never become
invalid unless you do wacky things in asm or invoke UB.
Post by David Daney
All of this is under the control of the userspace program and done
with userspace code.
For the most part it just happens by default. There is no particular
intentionality needed, and certainly no hideous MIPS-specific hacks
needed.
Yes, it happens by default. But it wasn't magic. It took careful work
by the ABI and toolchain designers to make it work.
Post by Rich Felker
Post by David Daney
I appreciate the fact that libc authors might prefer *not* to write
more code, but they could, especially if they wanted to add the
feature of non-executable stacks to their library implementation.
It is not really a position that I have. Rather a proposal for one
possible way to make non-executable stacks work on MIPS.
Post by Rich Felker
1. A non-exec-stack system can only run new code produced to do extra
stuff in userspace.
Any non-executable stack solution for MIPS will require changes to the
toolchain/libc. So it is merely a question of what form the change
should take.
Post by Rich Felker
2. The startup code needs to do special work in userspace on MIPS to
setup an executable area for fpu emulation.
Yes. Similar to how startup code has to do special work to set up the
TLS areas, and load shared libraries.
Post by Rich Felker
3. Every call to clone/CLONE_VM needs to be accompanied by a call to
mmap and this new syscall to set the address, and every call to
SYS_exit needs to be accompanies by a call to munmap for the
corresponding mapping.
No, We don't have to mmap() on each thread creation. Many threads
(perhaps 512) could be handled by a single page, so the normal case
would be a single mmap() for the life of the program.
Post by Rich Felker
This is a huge ill-designed mess.
Have you seen the alternatives?

Have you ever wondered why MIPS doesn't have non-executable stack support?
Post by Rich Felker
Rich
Rich Felker
2014-10-06 23:08:56 UTC
Permalink
Post by David Daney
Post by Rich Felker
Post by David Daney
Furthermore the
userspace code has to be very careful in its use of the $sp
register, so that it doesn't store data in places that will be
used/clobbered by the kernel.
This is not "being careful". The stack pointer can never become
invalid unless you do wacky things in asm or invoke UB.
Post by David Daney
All of this is under the control of the userspace program and done
with userspace code.
For the most part it just happens by default. There is no particular
intentionality needed, and certainly no hideous MIPS-specific hacks
needed.
Yes, it happens by default. But it wasn't magic. It took careful
work by the ABI and toolchain designers to make it work.
Here I disagree. All of these things are completely universal, not
MIPS-specific.
Post by David Daney
Post by Rich Felker
Post by David Daney
I appreciate the fact that libc authors might prefer *not* to write
more code, but they could, especially if they wanted to add the
feature of non-executable stacks to their library implementation.
It is not really a position that I have. Rather a proposal for one
possible way to make non-executable stacks work on MIPS.
Post by Rich Felker
1. A non-exec-stack system can only run new code produced to do extra
stuff in userspace.
Any non-executable stack solution for MIPS will require changes to
the toolchain/libc. So it is merely a question of what form the
change should take.
I disagree with this, at least for the most part. If the kernel does
the fpu emulation correctly, there's no reason it shouldn't be
possible to run existing binaries on a hardened kernel that does not
even support executable stack.
Post by David Daney
Post by Rich Felker
2. The startup code needs to do special work in userspace on MIPS to
setup an executable area for fpu emulation.
Yes. Similar to how startup code has to do special work to set up
the TLS areas,
Yes. Actually the simple way to implement this in userspace would be
with a page-sized, page-aligned object in TLS and a special call to
mprotect and your new syscall. One thing I'm not clear on: should this
memory have permissions r-x or rwx? If it has rwx, that defeats a lot
of the purpose of non-executable-stack. Hopefully it's r-x and the
kernel bypasses the non-writability to write to it.
Post by David Daney
and load shared libraries.
Dynamic linking is completely a separate matter. Not all programs are
even dynamic-linked.
Post by David Daney
Post by Rich Felker
3. Every call to clone/CLONE_VM needs to be accompanied by a call to
mmap and this new syscall to set the address, and every call to
SYS_exit needs to be accompanies by a call to munmap for the
corresponding mapping.
No, We don't have to mmap() on each thread creation. Many threads
(perhaps 512) could be handled by a single page, so the normal case
would be a single mmap() for the life of the program.
That's nice from a standpoint of avoiding memory waste, but it's
problematic if .////
Post by David Daney
Post by Rich Felker
This is a huge ill-designed mess.
Have you seen the alternatives?
I proposed a couple and I think they're much less ugly. Could you
point me to the others?

But perhaps you could clarify one thing for me: why is any of this
even needed? A delay slot only exists for branch instructions, and I
can't see any reason the kernel can't just emulate the branch
instruction at the same time. This is a very restricted class of
instructions that should not require any complex emulation of memory
permissions, just manipulation of the resulting program counter value
after the floating point instruction finishes. Or am I missing
something?
Post by David Daney
Have you ever wondered why MIPS doesn't have non-executable stack support?
I wasn't even aware that it didn't until your email.

Rich
Andy Lutomirski
2014-10-06 23:38:15 UTC
Permalink
Post by Rich Felker
Post by David Daney
Post by Rich Felker
Post by David Daney
Post by Rich Felker
Userspace should play no part in this; requiring userspace to help
make special accomodations for fpu emulation largely defeats the
purpose of fpu emulation.
That is certainly one way of looking at it. Really it is opinion,
rather than fact though.
It's an opinion, yes, but it has substantial reason behind it.
Post by David Daney
GLibc is full of code (see ld.so) that in earlier incantations of
Unix/Linux was in kernel space, and was moved to userspace. Given
that there is a partitioning of code between kernel space and
userspace, I think it not totally unreasonable to consider doing
some of this in userspace.
Even on systems with hardware FPU, the architecture specification
allows for/requires emulation of certain cases (denormals, etc.) So
it is already a requirement that userspace cooperate by always
having free space below $SP for use by the kernel. So the current
situation is that userspace is providing services for the kernel FPU
emulator.
My suggestion is to change the nature of the way these services are
provided by the userspace program.
But this isn't setup by the userspace program. It's setup by the
kernel on program entry. Despite that, though, I think it's an
unnecessary (and undocumented!) constraint; the fact that it requires
the stack to be executable makes it even more harmful and
inappropriate.
The management of the stack is absolutely done by userspace code.
Any time you do pthread_create(), userspace code does mmap() to
allocate the stack area, it then sets permissions on the area, and
then it passes the address of the area to clone().
This is hardly management.
Post by David Daney
Furthermore the
userspace code has to be very careful in its use of the $sp
register, so that it doesn't store data in places that will be
used/clobbered by the kernel.
This is not "being careful". The stack pointer can never become
invalid unless you do wacky things in asm or invoke UB.
I disagree a bit here. There are runtimes that aren't libc or even C at
all. See, for example, Go. (Ugh!)

What happens if a signal happens while executing from this magic
trampoline? Allocation of another one? Crash on return from the outer
trampoline invocation?

Also, if this ends up being solved with a hack of this type, please do
it right: have *two* aliases of the trampoline, one writable, and one
executable (unless the MIPS kernel can bypass write-protection).
Post by Rich Felker
Post by David Daney
All of this is under the control of the userspace program and done
with userspace code.
For the most part it just happens by default. There is no particular
intentionality needed, and certainly no hideous MIPS-specific hacks
needed.
Post by David Daney
I appreciate the fact that libc authors might prefer *not* to write
more code, but they could, especially if they wanted to add the
feature of non-executable stacks to their library implementation.
1. A non-exec-stack system can only run new code produced to do extra
stuff in userspace.
2. The startup code needs to do special work in userspace on MIPS to
setup an executable area for fpu emulation.
3. Every call to clone/CLONE_VM needs to be accompanied by a call to
mmap and this new syscall to set the address, and every call to
SYS_exit needs to be accompanies by a call to munmap for the
corresponding mapping.
This is a huge ill-designed mess.
Amen.

Can the kernel not just emulate the instructions directly? Can it
single-step through them in place?

FWIW, I have considered playing trampoline games like this on x86. It's
a giant bloody mess, and it will almost certainly never happen, even
though the performance win is dramatic. No, you don't want to know why. [1]

[1] If you actually want to know, imagine returning from a page fault
with sysret.

--Andy
David Daney
2014-10-06 23:48:52 UTC
Permalink
[...]
Post by Andy Lutomirski
Post by Rich Felker
This is a huge ill-designed mess.
Amen.
Can the kernel not just emulate the instructions directly?
In theory it could, but since there can be implementation defined
instructions, there is no way to achieve full instruction set coverage
for all possible machines.
Post by Andy Lutomirski
Can it single-step through them in place?
No. If it could, we wouldn't be having this informative discussion.
Andy Lutomirski
2014-10-06 23:54:06 UTC
Permalink
Post by David Daney
[...]
Post by Andy Lutomirski
Post by Rich Felker
This is a huge ill-designed mess.
Amen.
Can the kernel not just emulate the instructions directly?
In theory it could, but since there can be implementation defined
instructions, there is no way to achieve full instruction set coverage for
all possible machines.
Can modern user code just avoid constructs that require this kind of
trampoline hack? If so, can this be solved the same way that x86
added no-exec stacks? (I.e. mark all the binaries as supporting
non-executable stacks and letting them crash if they screw it up.)

Knowing very little about MIPS, it sounds like this is the kernel
compensating for a dumb assembler.

--Andy
Rich Felker
2014-10-07 00:05:14 UTC
Permalink
Post by David Daney
[...]
Post by Andy Lutomirski
Post by Rich Felker
This is a huge ill-designed mess.
Amen.
Can the kernel not just emulate the instructions directly?
In theory it could, but since there can be implementation defined
instructions, there is no way to achieve full instruction set
coverage for all possible machines.
Is the issue really implementation-defined instructions with delay
slots? If so it sounds like a made-up issue. They're not going to
occur in real binaries. Certainly a compiler is not going to generate
implementation-defined instructions, and if you're writing the asm by
hand, you just don't put floating point instructions in the delay
slot.

Rich
Andrew Pinski
2014-10-07 00:11:38 UTC
Permalink
Post by Rich Felker
Post by David Daney
[...]
Post by Andy Lutomirski
Post by Rich Felker
This is a huge ill-designed mess.
Amen.
Can the kernel not just emulate the instructions directly?
In theory it could, but since there can be implementation defined
instructions, there is no way to achieve full instruction set
coverage for all possible machines.
Is the issue really implementation-defined instructions with delay
slots? If so it sounds like a made-up issue. They're not going to
occur in real binaries. Certainly a compiler is not going to generate
implementation-defined instructions, and if you're writing the asm by
hand, you just don't put floating point instructions in the delay
slot.
It is not the instruction with delay slot but rather the instruction
in the delay slot itself.

Thanks,
Andrew
Rich Felker
2014-10-07 00:21:47 UTC
Permalink
Post by Andrew Pinski
Post by Rich Felker
Post by David Daney
[...]
Post by Andy Lutomirski
Post by Rich Felker
This is a huge ill-designed mess.
Amen.
Can the kernel not just emulate the instructions directly?
In theory it could, but since there can be implementation defined
instructions, there is no way to achieve full instruction set
coverage for all possible machines.
Is the issue really implementation-defined instructions with delay
slots? If so it sounds like a made-up issue. They're not going to
occur in real binaries. Certainly a compiler is not going to generate
implementation-defined instructions, and if you're writing the asm by
hand, you just don't put floating point instructions in the delay
slot.
It is not the instruction with delay slot but rather the instruction
in the delay slot itself.
An instruction in the delay slot for the instruction being emulated?
How would that arise? Are there floating point instructions with delay
slots?

Rich
Andrew Pinski
2014-10-07 00:28:02 UTC
Permalink
Post by Rich Felker
Post by Andrew Pinski
Post by Rich Felker
Post by David Daney
[...]
Post by Andy Lutomirski
Post by Rich Felker
This is a huge ill-designed mess.
Amen.
Can the kernel not just emulate the instructions directly?
In theory it could, but since there can be implementation defined
instructions, there is no way to achieve full instruction set
coverage for all possible machines.
Is the issue really implementation-defined instructions with delay
slots? If so it sounds like a made-up issue. They're not going to
occur in real binaries. Certainly a compiler is not going to generate
implementation-defined instructions, and if you're writing the asm by
hand, you just don't put floating point instructions in the delay
slot.
It is not the instruction with delay slot but rather the instruction
in the delay slot itself.
An instruction in the delay slot for the instruction being emulated?
How would that arise? Are there floating point instructions with delay
slots?
Yes branches.
Andy Lutomirski
2014-10-07 00:29:41 UTC
Permalink
Post by Andrew Pinski
Post by Rich Felker
Post by Andrew Pinski
Post by Rich Felker
Post by David Daney
[...]
Post by Andy Lutomirski
Post by Rich Felker
This is a huge ill-designed mess.
Amen.
Can the kernel not just emulate the instructions directly?
In theory it could, but since there can be implementation defined
instructions, there is no way to achieve full instruction set
coverage for all possible machines.
Is the issue really implementation-defined instructions with delay
slots? If so it sounds like a made-up issue. They're not going to
occur in real binaries. Certainly a compiler is not going to generate
implementation-defined instructions, and if you're writing the asm by
hand, you just don't put floating point instructions in the delay
slot.
It is not the instruction with delay slot but rather the instruction
in the delay slot itself.
An instruction in the delay slot for the instruction being emulated?
How would that arise? Are there floating point instructions with delay
slots?
Yes branches.
I admit I have no idea what's going here, but I find it hard to
believe that having the kernel fix this up for new code is desirable.
Unless MIPS can round-trip a trap *very* quickly, performance will be
awful for any code that has this problem.

--Andy
David Daney
2014-10-07 00:32:34 UTC
Permalink
Post by Andy Lutomirski
Post by Andrew Pinski
Post by Rich Felker
Post by Andrew Pinski
Post by Rich Felker
Post by David Daney
[...]
Post by Andy Lutomirski
Post by Rich Felker
This is a huge ill-designed mess.
Amen.
Can the kernel not just emulate the instructions directly?
In theory it could, but since there can be implementation defined
instructions, there is no way to achieve full instruction set
coverage for all possible machines.
Is the issue really implementation-defined instructions with delay
slots? If so it sounds like a made-up issue. They're not going to
occur in real binaries. Certainly a compiler is not going to generate
implementation-defined instructions, and if you're writing the asm by
hand, you just don't put floating point instructions in the delay
slot.
It is not the instruction with delay slot but rather the instruction
in the delay slot itself.
An instruction in the delay slot for the instruction being emulated?
How would that arise? Are there floating point instructions with delay
slots?
Yes branches.
I admit I have no idea what's going here, but I find it hard to
believe that having the kernel fix this up for new code is desirable.
Unless MIPS can round-trip a trap *very* quickly, performance will be
awful for any code that has this problem.
It is FPU *emulation*, of course the performance will suck. We don't
care about performance, we just want it to execute correctly.

David Daney
David Daney
2014-10-07 00:33:18 UTC
Permalink
Post by Rich Felker
Post by David Daney
[...]
Post by Andy Lutomirski
Post by Rich Felker
This is a huge ill-designed mess.
Amen.
Can the kernel not just emulate the instructions directly?
In theory it could, but since there can be implementation defined
instructions, there is no way to achieve full instruction set
coverage for all possible machines.
Is the issue really implementation-defined instructions with delay
slots?
It is the instructions in the delay slots, not the branch instructions
themselves that are of interest. But, for the sake of the arguments,
this is not a critical point.
Post by Rich Felker
If so it sounds like a made-up issue.
It is not a made up issue.

If you want an architecture that has a well defined instruction set,
stick with x86, Intel will tell you what is good for you and you will
take whatever they give you.

If you want an architecture where you can add implementation defined
instructions to do whatever you want, then you use an architecture like
MIPS.
Post by Rich Felker
They're not going to
occur in real binaries. Certainly a compiler is not going to generate
implementation-defined instructions,
Why not? It will emit any instructions we care to make it emit. If we
want it to emit crypto instructions with patented algorithms, then it
will do that. But we would still like to use a generic kernel with
generic FPU support.

The most straight forward way (and the currently implemented way) of
doing this is to execute the instructions in question out-of-line (on
the userspace stack).

The question here is: What is the best way to get to a non-executable
stack.

The consensus among MIPS developers is that we should continue using the
out-of-line execution trick, but do it somewhere other than in stack memory.

One way of doing this is to have the kernel magically generate thread
local memory regions.

Another option is to have userspace manage the out-of-line execution areas.

As is often the case, each approach has different pluses and minuses.
Andy Lutomirski
2014-10-07 00:48:55 UTC
Permalink
Post by David Daney
Post by Rich Felker
Post by David Daney
[...]
Post by Andy Lutomirski
Post by Rich Felker
This is a huge ill-designed mess.
Amen.
Can the kernel not just emulate the instructions directly?
In theory it could, but since there can be implementation defined
instructions, there is no way to achieve full instruction set
coverage for all possible machines.
Is the issue really implementation-defined instructions with delay
slots?
It is the instructions in the delay slots, not the branch instructions
themselves that are of interest. But, for the sake of the arguments, this
is not a critical point.
Post by Rich Felker
If so it sounds like a made-up issue.
It is not a made up issue.
If you want an architecture that has a well defined instruction set, stick
with x86, Intel will tell you what is good for you and you will take
whatever they give you.
If you want an architecture where you can add implementation defined
instructions to do whatever you want, then you use an architecture like
MIPS.
Post by Rich Felker
They're not going to
occur in real binaries. Certainly a compiler is not going to generate
implementation-defined instructions,
Why not? It will emit any instructions we care to make it emit. If we want
it to emit crypto instructions with patented algorithms, then it will do
that. But we would still like to use a generic kernel with generic FPU
support.
The most straight forward way (and the currently implemented way) of doing
this is to execute the instructions in question out-of-line (on the
userspace stack).
The question here is: What is the best way to get to a non-executable
stack.
The consensus among MIPS developers is that we should continue using the
out-of-line execution trick, but do it somewhere other than in stack memory.
One way of doing this is to have the kernel magically generate thread local
memory regions.
Another option is to have userspace manage the out-of-line execution areas.
As is often the case, each approach has different pluses and minuses.
Your patch is still buggy. Imagine this sequence:

Daft userspace code does:

emulated fp branch to elsewhere (not taken)
insn 1
insn 2

The kernel shoves insn1 and insn2 in this magic trampoline and
re-enters user code there.

An asynchronous signal happens before insn1 executes.

The signal hander runs similar daft code, gets fixed up and returns
*to the now-overwritten trampoline*. Boom. This kind of failure mode
is why using any kind of magic trampoline sucks on all architectures.

Even the current code might have the same bug for all I know -- are
really updating the stack pointer when you emulate these instructions?
Do you have a redzone for exactly this purpose? Does the MIPS signal
delivery code check to see whether you're executing off the stack
outside of the ABI-protected region?


Given that this is documented as an ABI change, I'll ask again: can
you demand that user code that wants the ABI-breaking non-executable
stack must not do this? IOW, binaries that claim to work with
non-executable stacks must not have fp branches (or alternatively must
not have anything other than nops in the delay slots of possibly
emulated FP branches)? Or you could be polite and explicitly define
the set of instructions that are safe in fp branch delay slots.

(Also, seriously, fp branches have usable delay slots? Wow!)

--Andy
Rich Felker
2014-10-07 00:49:15 UTC
Permalink
Post by David Daney
Post by Rich Felker
Post by David Daney
[...]
Post by Andy Lutomirski
Post by Rich Felker
This is a huge ill-designed mess.
Amen.
Can the kernel not just emulate the instructions directly?
In theory it could, but since there can be implementation defined
instructions, there is no way to achieve full instruction set
coverage for all possible machines.
Is the issue really implementation-defined instructions with delay
slots?
It is the instructions in the delay slots, not the branch
instructions themselves that are of interest. But, for the sake of
the arguments, this is not a critical point.
I think it's an important distinction. It means the problem domain is
supporting all possible instructions, not instructions which can
reasonably have delay slots.
Post by David Daney
Post by Rich Felker
If so it sounds like a made-up issue.
It is not a made up issue.
If you want an architecture that has a well defined instruction set,
stick with x86, Intel will tell you what is good for you and you
will take whatever they give you.
If you want an architecture where you can add implementation defined
instructions to do whatever you want, then you use an architecture
like MIPS.
The ability to add arbitrary instructions does not mean that arbitrary
uses of those instructions have to be supported by the ABI. It's
completely reasonable for the ABI to say they cannot be used in delay
slots for coprocessor-conditional branches.

And of course once you're in the realm of custom hardware and software
written to depend on that custom hardware, you know whether you have
an fpu or not anyway. If you have an fpu, you can ignore the
restriction. If you don't, you should follow it. Note that "partial
fpu emulation" (e.g. just denormals) is not relevant here; the issue
only arises if the coprocessor branch instructions have to be
emulated, which means "there's no fpu at all".
Post by David Daney
Post by Rich Felker
They're not going to
occur in real binaries. Certainly a compiler is not going to generate
implementation-defined instructions,
Why not? It will emit any instructions we care to make it emit. If
we want it to emit crypto instructions with patented algorithms,
then it will do that. But we would still like to use a generic
kernel with generic FPU support.
The most straight forward way (and the currently implemented way) of
doing this is to execute the instructions in question out-of-line
(on the userspace stack).
The question here is: What is the best way to get to a
non-executable stack.
The consensus among MIPS developers is that we should continue using
My experience has been that hardware and software developers focused
on a particular hardware target are generally unqualified to make
decisions that affect the design and operation of libc or the kernel.
They are not experts in these areas. It was apparent early on in this
thread, when you mentioned the idea that "not all threads would need
fpu support", that you were thinking from a standpoint of custom
low-level software and not a general purpose libc that cannot read the
application author's mind. It seems nobody had thought of the
impossibility of doing lazy setup (inability to handle failure) and
the necessity of always initializing this stuff at pthread_create
time, either. Design issues like this should be run by experts in the
libc area early on, not as an afterthought.
Post by David Daney
the out-of-line execution trick, but do it somewhere other than in
stack memory.
How do you answer Andy Lutomirski's question about what happens when a
signal handler interrupts execution while the program counter is
pointing at this "out-of-line execution" trampoline? This seems like a
show-stopper for using anything other than the stack.
Post by David Daney
One way of doing this is to have the kernel magically generate
thread local memory regions.
Another option is to have userspace manage the out-of-line execution areas.
As is often the case, each approach has different pluses and minuses.
Having the kernel magically do it would be better, but I'm doubtful
that solution works anyway due to the above signal handler/nesting
issue.

Rich
David Daney
2014-10-07 04:50:47 UTC
Permalink
[...]
Post by Rich Felker
Post by David Daney
Why not? It will emit any instructions we care to make it emit. If
we want it to emit crypto instructions with patented algorithms,
then it will do that. But we would still like to use a generic
kernel with generic FPU support.
The most straight forward way (and the currently implemented way) of
doing this is to execute the instructions in question out-of-line
(on the userspace stack).
The question here is: What is the best way to get to a
non-executable stack.
The consensus among MIPS developers is that we should continue using
My experience has been that hardware and software developers focused
on a particular hardware target are generally unqualified to make
decisions that affect the design and operation of libc or the kernel.
They are not experts in these areas. It was apparent early on in this
thread, when you mentioned the idea that "not all threads would need
fpu support", that you were thinking from a standpoint of custom
low-level software and not a general purpose libc that cannot read the
application author's mind.
Not at all, I was thinking of soft-float ABIs, as they never execute FP
instructions, and are often used on systems with no FPU. In fact many
non-FPU systems never execute any hard-float code. So those system
should not suffer large performance regressions from any change made to
support a non-executable stack.

We use GLibC on many soft-float only systems, and I would posit that
GLibC is a general purpose libc.
Post by Rich Felker
It seems nobody had thought of the
impossibility of doing lazy setup (inability to handle failure) and
the necessity of always initializing this stuff at pthread_create
time, either. Design issues like this should be run by experts in the
libc area early on, not as an afterthought.
I bow down to the experts, as obviously I know nothing about:

1) The Linux kernel
2) The MIPS architecture.
3) Library design.
4) C libraries and their interaction with the kernel, linker and compiler.
Post by Rich Felker
Post by David Daney
the out-of-line execution trick, but do it somewhere other than in
stack memory.
How do you answer Andy Lutomirski's question about what happens when a
signal handler interrupts execution while the program counter is
pointing at this "out-of-line execution" trampoline? This seems like a
show-stopper for using anything other than the stack.
It would be nice to support, but not doing so would not be a regression
from current behavior.
Post by Rich Felker
Post by David Daney
One way of doing this is to have the kernel magically generate
thread local memory regions.
Another option is to have userspace manage the out-of-line execution areas.
As is often the case, each approach has different pluses and minuses.
Having the kernel magically do it would be better, but I'm doubtful
that solution works anyway due to the above signal handler/nesting
issue.
So the perfect is the enemy of the good? No non-executable stack for
you, MIPS.
Post by Rich Felker
Rich
Matthew Fortune
2014-10-07 09:13:22 UTC
Permalink
Post by David Daney
Post by Rich Felker
Post by David Daney
the out-of-line execution trick, but do it somewhere other than in
stack memory.
How do you answer Andy Lutomirski's question about what happens when a
signal handler interrupts execution while the program counter is
pointing at this "out-of-line execution" trampoline? This seems like a
show-stopper for using anything other than the stack.
It would be nice to support, but not doing so would not be a regression
from current behavior.
It seems appropriate to mention another issue which should be addressed as
part of the overall FPU emulation work...
Post by David Daney
From what I can see the out-of-line execution of delay slot instructions
will break micromips R3 addiupc, and all MIPS32r6 and MIPS64r6 PC-relative
instructions (inc load/store) as they will have the wrong base. Is there
anything in the current set of proposals that can address this (beyond
adding restrictions to what is ABI allowed in FPU branch delay slots)?

This is an issue whether the stack is executable or not but does directly
relate to the topic of FPU emulation. It sounds like the kernel would not
be able to emulate a pc-relative load/store even if it was a special case
as it would not run in the correct MM context? [be gentle, I'm no expert
in this area].

Matthew
James Hogan
2014-10-07 10:52:34 UTC
Permalink
Post by Matthew Fortune
Post by David Daney
Post by Rich Felker
Post by David Daney
the out-of-line execution trick, but do it somewhere other than in
stack memory.
How do you answer Andy Lutomirski's question about what happens when a
signal handler interrupts execution while the program counter is
pointing at this "out-of-line execution" trampoline? This seems like a
show-stopper for using anything other than the stack.
It would be nice to support, but not doing so would not be a regression
from current behavior.
It seems appropriate to mention another issue which should be addressed as
part of the overall FPU emulation work...
From what I can see the out-of-line execution of delay slot instructions
will break micromips R3 addiupc, and all MIPS32r6 and MIPS64r6 PC-relative
instructions (inc load/store) as they will have the wrong base. Is there
anything in the current set of proposals that can address this (beyond
adding restrictions to what is ABI allowed in FPU branch delay slots)?
This is an issue whether the stack is executable or not but does directly
relate to the topic of FPU emulation. It sounds like the kernel would not
be able to emulate a pc-relative load/store even if it was a special case
as it would not run in the correct MM context? [be gentle, I'm no expert
in this area].
I think special casing and emulating them in the kernel would work in
these cases, since it'd be a known set of instructions rather than
arbitrary unknown instructions, the kernel needs to read/write safely
into the user address space all the time for system calls.

Cheers
James
Rich Felker
2014-10-07 11:19:00 UTC
Permalink
Post by David Daney
From what I can see the out-of-line execution of delay slot instructions
will break micromips R3 addiupc, and all MIPS32r6 and MIPS64r6 PC-relative
instructions (inc load/store) as they will have the wrong base. Is there
anything in the current set of proposals that can address this (beyond
adding restrictions to what is ABI allowed in FPU branch delay slots)?
Yes. If a trampoline is being generated to replace the delay slot
instruction, it can just contain more complex code to duplicate what
the PC-relative instruction would have done. Since the ABI already
assumes a stack is available, it can use the stack to backup registers
it needs for scratch space and restore them.
Post by David Daney
This is an issue whether the stack is executable or not but does directly
relate to the topic of FPU emulation. It sounds like the kernel would not
be able to emulate a pc-relative load/store even if it was a special case
as it would not run in the correct MM context? [be gentle, I'm no expert
in this area].
Really everything should be done in the kernel, and it's not as hard
as people are making it look. The kernel _already_ has to enforce MM
context permissions for every syscall that reads or writes user memory
(e.g. futex with PI mutexes or FUTEX_WAKE_OP, or even simple things
like read/write) so there's no reason it can't do emulated
loads/stores the exact same way.

Rich
David Daney
2014-10-07 16:04:36 UTC
Permalink
Post by Matthew Fortune
Post by David Daney
Post by Rich Felker
Post by David Daney
the out-of-line execution trick, but do it somewhere other than in
stack memory.
How do you answer Andy Lutomirski's question about what happens when a
signal handler interrupts execution while the program counter is
pointing at this "out-of-line execution" trampoline? This seems like a
show-stopper for using anything other than the stack.
It would be nice to support, but not doing so would not be a regression
from current behavior.
It seems appropriate to mention another issue which should be addressed as
part of the overall FPU emulation work...
From what I can see the out-of-line execution of delay slot instructions
will break micromips R3 addiupc, and all MIPS32r6 and MIPS64r6 PC-relative
instructions (inc load/store) as they will have the wrong base. Is there
anything in the current set of proposals that can address this (beyond
adding restrictions to what is ABI allowed in FPU branch delay slots)?
This is an issue whether the stack is executable or not but does directly
relate to the topic of FPU emulation. It sounds like the kernel would not
be able to emulate a pc-relative load/store even if it was a special case
as it would not run in the correct MM context? [be gentle, I'm no expert
in this area].
I haven't studied the r6 ISA in depth. But you are correct, the r6 ISA
cannot be supported with the eXecute-Out-of-Line tricks due to the PC
relative instructions.

So probably the best path forward is to abandon the current method, and
bite the bullet and write an entire instruction set emulator. It
doesn't have to be fast.

David Daney
Post by Matthew Fortune
Matthew
Leonid Yegoshin
2014-10-07 18:32:58 UTC
Permalink
Well, I am not a subscriber to mail-list, so I read it the first time
and some notes:

1) David's approach would likely work for FPU emulation but unlikely
works for MIPS Rel 2/Rel 1/ MIPS I emulation in MIPS R6 architecture.
The reason is that the first MIPS R2 instruction (removed from MIPS R6)
can be hit long before GLIBC/bionic/etc can determine how to use
properly a new system call. And that instruction needs to be emulated. I
actually hit this problem with ssh-keygen first and referred to FPU
emulation because I got it later, during my attempt to salvage a situation.

2) The issue of uMIPS ADDIUPC and similar instructions are overblown in
my opinion. Never of them are memory-related and their emulation in
BD-slot can be easily done in kernel and that actually accelerates an
emulation. Look at piece of code which I wrote to accelerate an
emulation of some instructions in BD-slot of JR instruction:

switch (MIPSInst_OPCODE(ir)) {
case addiu_op:
if (MIPSInst_RT(ir))
regs->regs[MIPSInst_RT(ir)] =
(s32)regs->regs[MIPSInst_RS(ir)] +
(s32)MIPSInst_SIMM(ir);
return(0);
#ifdef CONFIG_64BIT
case daddiu_op:
if (MIPSInst_RT(ir))
regs->regs[MIPSInst_RT(ir)] =
(s64)regs->regs[MIPSInst_RS(ir)] +
(s64)MIPSInst_SIMM(ir);
return(0);
#endif

Five lines per instruction.

3) The signal happened during execution of emulated instruction -
signals are under control of kernel and we can easily delay a signal
during execution of emulated instruction until return from do_dsemulret.
It is not a big deal - nor code, nor performance. Thank you for good point.

4) The voice for doing any instruction emulation in kernel - it is not
a MIPS business model to force customer to put details of all
Coprocessor 2 instructions public. We provide an interface and the rest
is a customer business. Besides that it is really painful to make a
differentiation between Cavium Octeon and some another CPU instructions
with the same opcode. On other side, leaving emulation of their
instructions to them is not a wise after having some good way doing that
multiple years.

- Leonid.
David Daney
2014-10-07 18:43:10 UTC
Permalink
Post by Leonid Yegoshin
Well, I am not a subscriber to mail-list, so I read it the first time
1) David's approach would likely work for FPU emulation but unlikely
works for MIPS Rel 2/Rel 1/ MIPS I emulation in MIPS R6 architecture.
The reason is that the first MIPS R2 instruction (removed from MIPS R6)
can be hit long before GLIBC/bionic/etc can determine how to use
properly a new system call. And that instruction needs to be emulated. I
actually hit this problem with ssh-keygen first and referred to FPU
emulation because I got it later, during my attempt to salvage a situation.
2) The issue of uMIPS ADDIUPC and similar instructions are overblown in
my opinion. Never of them are memory-related and their emulation in
BD-slot can be easily done in kernel and that actually accelerates an
emulation. Look at piece of code which I wrote to accelerate an
switch (MIPSInst_OPCODE(ir)) {
if (MIPSInst_RT(ir))
regs->regs[MIPSInst_RT(ir)] =
(s32)regs->regs[MIPSInst_RS(ir)] +
(s32)MIPSInst_SIMM(ir);
return(0);
#ifdef CONFIG_64BIT
if (MIPSInst_RT(ir))
regs->regs[MIPSInst_RT(ir)] =
(s64)regs->regs[MIPSInst_RS(ir)] +
(s64)MIPSInst_SIMM(ir);
return(0);
#endif
Five lines per instruction.
But you must be able to emulate them, so you need an emulator, not XOL.
Post by Leonid Yegoshin
3) The signal happened during execution of emulated instruction -
signals are under control of kernel and we can easily delay a signal
during execution of emulated instruction until return from do_dsemulret.
It is not a big deal - nor code, nor performance. Thank you for good point.
The problem is what to do with synchronous signals (SIGSEGV) Those
cannot be held off, and you really want the EPC value saved in the
register state to be correct.
Post by Leonid Yegoshin
4) The voice for doing any instruction emulation in kernel - it is not
a MIPS business model to force customer to put details of all
Coprocessor 2 instructions public. We provide an interface and the rest
is a customer business. Besides that it is really painful to make a
differentiation between Cavium Octeon and some another CPU instructions
with the same opcode. On other side, leaving emulation of their
instructions to them is not a wise after having some good way doing that
multiple years.
With all the new information we have begun to understand, it seems like
the only sane thing to do is get rid of this XOL approach and go to full
emulation of the entire instruction set.

David Daney
Leonid Yegoshin
2014-10-07 19:13:37 UTC
Permalink
(repeat it because of some e-mail failure, sorry)
Post by David Daney
Post by Leonid Yegoshin
Five lines per instruction.
But you must be able to emulate them, so you need an emulator, not XOL.
I feel I didn't say clear - emulation of ADDIUPC (after second look it
is the only instruction requires a special handling) is A FIVE LINE OF
CODE. At least in MIPS R2 it would require 5 lines. In MIPS R2 emulator
I have some routine (50 lines) which checks BD-slot instruction for some
popular opcodes and emulates that and leave other opcodes to dsemul().

The same can be done for FPU emulator.
Post by David Daney
The problem is what to do with synchronous signals (SIGSEGV) Those
cannot be held off, and you really want the EPC value saved in the
register state to be correct.
Any synchronous exception is not a problem, we know that emulation in
VDSO (read today - stack) is running and should take care of it. We can
easily change EPC before we start doing signal and pretend that problem
happened in correct place.

The async signals seem to be some problem... yet... until I finish look
into common Linux kernel code, I think.
Post by David Daney
What happens if one of those out-of-line instructions causes a synchronous trap?
If we need to return that as signal then we change EPC to proper value
from emulframe->epc. If we do a nested emulation - continue.
Post by David Daney
What if SIGSTOP arrives before ret?
I am looking into way to delay asynchronous signals until an emulated
instruction is finished. Signals are not time accurate and never been,
so it is not a big deal to delay it.
Post by David Daney
What if another thread removes the magic ret sequence?
It can't do it in my approach, emulation is done in write protected area
and it is done in per-thread memory space.

- Leonid.
Andy Lutomirski
2014-10-07 18:44:35 UTC
Permalink
On Tue, Oct 7, 2014 at 11:32 AM, Leonid Yegoshin
Well, I am not a subscriber to mail-list, so I read it the first time and
3) The signal happened during execution of emulated instruction - signals
are under control of kernel and we can easily delay a signal during
execution of emulated instruction until return from do_dsemulret. It is not
a big deal - nor code, nor performance. Thank you for good point.
If you go down this particular rabbit hole, you will never come back out.

What happens if one of those out-of-line instructions causes a
synchronous trap? What if SIGSTOP arrives before ret? What if
another thread removes the magic ret sequence?
4) The voice for doing any instruction emulation in kernel - it is not a
MIPS business model to force customer to put details of all Coprocessor 2
instructions public. We provide an interface and the rest is a customer
business. Besides that it is really painful to make a differentiation
between Cavium Octeon and some another CPU instructions with the same
opcode. On other side, leaving emulation of their instructions to them is
not a wise after having some good way doing that multiple years.
IMO this is all backwards. If MIPS customers put proprietary
instructions into their ISA, they leave out the FPU, and they put a
proprietary insn in a branch delay slot, then I think that they
deserve a fatal signal.

There's a really easy solution for new systems: fix the toolchain.
Teach the assembler to disallow any proprietary instructions in an FP
branch delay slot.

--Andy
David Daney
2014-10-07 18:50:18 UTC
Permalink
Post by Andy Lutomirski
On Tue, Oct 7, 2014 at 11:32 AM, Leonid Yegoshin
Well, I am not a subscriber to mail-list, so I read it the first time and
3) The signal happened during execution of emulated instruction - signals
are under control of kernel and we can easily delay a signal during
execution of emulated instruction until return from do_dsemulret. It is not
a big deal - nor code, nor performance. Thank you for good point.
If you go down this particular rabbit hole, you will never come back out.
What happens if one of those out-of-line instructions causes a
synchronous trap? What if SIGSTOP arrives before ret? What if
another thread removes the magic ret sequence?
4) The voice for doing any instruction emulation in kernel - it is not a
MIPS business model to force customer to put details of all Coprocessor 2
instructions public. We provide an interface and the rest is a customer
business. Besides that it is really painful to make a differentiation
between Cavium Octeon and some another CPU instructions with the same
opcode. On other side, leaving emulation of their instructions to them is
not a wise after having some good way doing that multiple years.
IMO this is all backwards. If MIPS customers put proprietary
instructions into their ISA, they leave out the FPU, and they put a
proprietary insn in a branch delay slot, then I think that they
deserve a fatal signal.
There's a really easy solution for new systems: fix the toolchain.
Teach the assembler to disallow any proprietary instructions in an FP
branch delay slot.
Yes, gas for MIPS already has an instruction attribute for instructions
that cannot be placed in delay slots. It should be a fairly simple
matter to extend this to instructions that cannot be emulated.

Thanks,
David Daney
Post by Andy Lutomirski
--Andy
Rich Felker
2014-10-07 19:09:43 UTC
Permalink
Post by Andy Lutomirski
4) The voice for doing any instruction emulation in kernel - it is not a
MIPS business model to force customer to put details of all Coprocessor 2
instructions public. We provide an interface and the rest is a customer
business. Besides that it is really painful to make a differentiation
between Cavium Octeon and some another CPU instructions with the same
opcode. On other side, leaving emulation of their instructions to them is
not a wise after having some good way doing that multiple years.
IMO this is all backwards. If MIPS customers put proprietary
instructions into their ISA, they leave out the FPU, and they put a
proprietary insn in a branch delay slot, then I think that they
deserve a fatal signal.
I agree completely here. We should not break things (or, as it seems,
leave them broken) for common usage cases that affect everyone just to
coddle proprietary vendor-specific instructions. The latter just
should not be used in delay slots unless the chip vendor also promises
to provide fpu branch in hardware.

Rich
Leonid Yegoshin
2014-10-07 19:16:59 UTC
Permalink
Post by Rich Felker
I agree completely here. We should not break things (or, as it seems,
leave them broken) for common usage cases that affect everyone just to
coddle proprietary vendor-specific instructions. The latter just
should not be used in delay slots unless the chip vendor also promises
to provide fpu branch in hardware. Rich
And what do you propose - remove a current in-stack emulation and you
still think it doesn't break a status-quo?
Rich Felker
2014-10-07 19:21:07 UTC
Permalink
Post by Leonid Yegoshin
Post by Rich Felker
I agree completely here. We should not break things (or, as it
seems, leave them broken) for common usage cases that affect
everyone just to coddle proprietary vendor-specific instructions.
The latter just should not be used in delay slots unless the chip
vendor also promises to provide fpu branch in hardware. Rich
And what do you propose - remove a current in-stack emulation and
you still think it doesn't break a status-quo?
The in-stack trampoline support could be left but used only for
emulating instructions the kernel doesn't know. This would make all
normal binaries immediately usable with non-executable stack, and
would avoid the only potential source of regressions. Ultimately I
think the "xol" stuff should be removed, but that could be a long term
goal.

Rich
Leonid Yegoshin
2014-10-07 19:27:48 UTC
Permalink
Post by Rich Felker
The in-stack trampoline support could be left but used only for
emulating instructions the kernel doesn't know. This would make all
normal binaries immediately usable with non-executable stack, and
would avoid the only potential source of regressions. Ultimately I
think the "xol" stuff should be removed, but that could be a long term
goal.
Thank you, it is exactly what I am doing in patch series named "[PATCH
0/3] MIPS executable stack protection".
I just setup a special stack for that.
Andy Lutomirski
2014-10-07 19:28:18 UTC
Permalink
Post by Rich Felker
Post by Leonid Yegoshin
Post by Rich Felker
I agree completely here. We should not break things (or, as it
seems, leave them broken) for common usage cases that affect
everyone just to coddle proprietary vendor-specific instructions.
The latter just should not be used in delay slots unless the chip
vendor also promises to provide fpu branch in hardware. Rich
And what do you propose - remove a current in-stack emulation and
you still think it doesn't break a status-quo?
The in-stack trampoline support could be left but used only for
emulating instructions the kernel doesn't know. This would make all
normal binaries immediately usable with non-executable stack, and
would avoid the only potential source of regressions. Ultimately I
think the "xol" stuff should be removed, but that could be a long term
goal.
Does anything break if the xol stuff is disabled for PT_GNU_STACK tasks?
Post by Rich Felker
Rich
--
Andy Lutomirski
AMA Capital Management, LLC
David Daney
2014-10-07 20:03:32 UTC
Permalink
Post by Andy Lutomirski
Post by Rich Felker
Post by Leonid Yegoshin
Post by Rich Felker
I agree completely here. We should not break things (or, as it
seems, leave them broken) for common usage cases that affect
everyone just to coddle proprietary vendor-specific instructions.
The latter just should not be used in delay slots unless the chip
vendor also promises to provide fpu branch in hardware. Rich
And what do you propose - remove a current in-stack emulation and
you still think it doesn't break a status-quo?
The in-stack trampoline support could be left but used only for
emulating instructions the kernel doesn't know. This would make all
normal binaries immediately usable with non-executable stack, and
would avoid the only potential source of regressions. Ultimately I
think the "xol" stuff should be removed, but that could be a long term
goal.
Does anything break if the xol stuff is disabled for PT_GNU_STACK tasks?
The instructions must be executed, if you turn on a non-executable
stack, you cannot execute them on the stack, so they must be handled in
another way, which is the subject of this thread.

Options:

1a) XOL kernel manages the memory
1b) XOL userspace manages the menory
2) Emulate the instructions.
3) I don't think there is a 3rd. option.

As the imgtec people have said, you have to do #2 for their new r6 ISA,
as it uses PC relative instructions.

I really think we should bite the bullet and do #2 for everything, it
will be the cleanest long term solutions.

David Daney
Andy Lutomirski
2014-10-08 00:22:01 UTC
Permalink
Post by Andy Lutomirski
Post by Rich Felker
Post by Leonid Yegoshin
Post by Rich Felker
I agree completely here. We should not break things (or, as it
seems, leave them broken) for common usage cases that affect
everyone just to coddle proprietary vendor-specific instructions.
The latter just should not be used in delay slots unless the chip
vendor also promises to provide fpu branch in hardware. Rich
And what do you propose - remove a current in-stack emulation and
you still think it doesn't break a status-quo?
The in-stack trampoline support could be left but used only for
emulating instructions the kernel doesn't know. This would make all
normal binaries immediately usable with non-executable stack, and
would avoid the only potential source of regressions. Ultimately I
think the "xol" stuff should be removed, but that could be a long term
goal.
Does anything break if the xol stuff is disabled for PT_GNU_STACK tasks?
The instructions must be executed, if you turn on a non-executable stack, you cannot execute them on the stack, so they must be handled in another way, which is the subject of this thread.
1a) XOL kernel manages the memory
1b) XOL userspace manages the menory
2) Emulate the instructions.
3) I don't think there is a 3rd. option.
4) SIGILL

5) single-step or use an HW breakpoint if available


But, yes, 3 seems reasonable.

--Andy
Matthew Fortune
2014-10-07 19:40:27 UTC
Permalink
Post by Andy Lutomirski
4) The voice for doing any instruction emulation in kernel - it is not a
MIPS business model to force customer to put details of all Coprocessor 2
instructions public. We provide an interface and the rest is a customer
business. Besides that it is really painful to make a differentiation
between Cavium Octeon and some another CPU instructions with the same
opcode. On other side, leaving emulation of their instructions to them is
not a wise after having some good way doing that multiple years.
IMO this is all backwards. If MIPS customers put proprietary
instructions into their ISA, they leave out the FPU, and they put a
proprietary insn in a branch delay slot, then I think that they
deserve a fatal signal.
There's a really easy solution for new systems: fix the toolchain.
Teach the assembler to disallow any proprietary instructions in an FP
branch delay slot.
I think I'd be mostly in favour of this from a toolchain perspective but
only from the perspective of FP branch instructions. This still leaves a
problem for normal branches should any of them get removed and need emulating.
The general form of bltzal and bgezal would be the example here of branches
which are removed in R6 (The special case of using $0 remains). This is
really niche but my point is more about how we would deal with such a thing
if it happened. The answer may be just to scream and shout and discourage the
removal of such instructions from the architecture.

Matthe
Rich Felker
2014-10-07 11:11:02 UTC
Permalink
Post by David Daney
Post by Rich Felker
Post by David Daney
the out-of-line execution trick, but do it somewhere other than in
stack memory.
How do you answer Andy Lutomirski's question about what happens when a
signal handler interrupts execution while the program counter is
pointing at this "out-of-line execution" trampoline? This seems like a
show-stopper for using anything other than the stack.
It would be nice to support, but not doing so would not be a
regression from current behavior.
It's not just "nice" to support, it's mandatory. Otherwise you will
execute essentially *random instructions* in this case, providing a
very nice attack vector that can almost certainly be elevated to
arbitrary code execution via timing of signals during floating point
code.

The current behavior in regards to this is correct: because you have a
*stack*, each trampoline is pushed onto the stack in its own context,
and popped when it's no longer needed. You can have arbitrarily many
such trampolines up to the stack size. Note that each nested signal
handler already requires sizeof(ucontext_t) in stack space, so these
trampolines are a negligible additional cost without major effects on
the number of signal handlers you can nest without overflowing the
stack.
Post by David Daney
Post by Rich Felker
Post by David Daney
One way of doing this is to have the kernel magically generate
thread local memory regions.
Another option is to have userspace manage the out-of-line execution areas.
As is often the case, each approach has different pluses and minuses.
Having the kernel magically do it would be better, but I'm doubtful
that solution works anyway due to the above signal handler/nesting
issue.
So the perfect is the enemy of the good? No non-executable stack
for you, MIPS.
No, regressions that make the situation worse than executable-stack
are not "good" to begin with, even if it weren't for the other design
issues and dumping everything on userspace for the sake of being lazy
in the kernel.

Rich
David Daney
2014-10-07 16:08:51 UTC
Permalink
Post by Rich Felker
Post by David Daney
Post by Rich Felker
Post by David Daney
the out-of-line execution trick, but do it somewhere other than in
stack memory.
How do you answer Andy Lutomirski's question about what happens when a
signal handler interrupts execution while the program counter is
pointing at this "out-of-line execution" trampoline? This seems like a
show-stopper for using anything other than the stack.
It would be nice to support, but not doing so would not be a
regression from current behavior.
It's not just "nice" to support, it's mandatory. Otherwise you will
execute essentially *random instructions* in this case, providing a
very nice attack vector that can almost certainly be elevated to
arbitrary code execution via timing of signals during floating point
code.
The current behavior in regards to this is correct: because you have a
*stack*, each trampoline is pushed onto the stack in its own context,
and popped when it's no longer needed. You can have arbitrarily many
such trampolines up to the stack size. Note that each nested signal
handler already requires sizeof(ucontext_t) in stack space, so these
trampolines are a negligible additional cost without major effects on
the number of signal handlers you can nest without overflowing the
stack.
Yes, the stack takes care of the allocations, but the current
implementation has many problems:

1) Signals clobber the emulation area.
2) Signals caused by the emulation, have incorrect saved machine state.

We have a low bar to pass, any new solution doesn't have to be perfect,
it only has to be an improvement.

Keep in mind that we are not starting from a clean slate, there are many
years of legacy code that has built up here.

David Daney
Andy Lutomirski
2014-10-07 18:16:54 UTC
Permalink
Post by David Daney
Post by Rich Felker
Post by David Daney
Post by Rich Felker
Post by David Daney
the out-of-line execution trick, but do it somewhere other than in
stack memory.
How do you answer Andy Lutomirski's question about what happens when a
signal handler interrupts execution while the program counter is
pointing at this "out-of-line execution" trampoline? This seems like a
show-stopper for using anything other than the stack.
It would be nice to support, but not doing so would not be a
regression from current behavior.
It's not just "nice" to support, it's mandatory. Otherwise you will
execute essentially *random instructions* in this case, providing a
very nice attack vector that can almost certainly be elevated to
arbitrary code execution via timing of signals during floating point
code.
The current behavior in regards to this is correct: because you have a
*stack*, each trampoline is pushed onto the stack in its own context,
and popped when it's no longer needed. You can have arbitrarily many
such trampolines up to the stack size. Note that each nested signal
handler already requires sizeof(ucontext_t) in stack space, so these
trampolines are a negligible additional cost without major effects on
the number of signal handlers you can nest without overflowing the
stack.
1) Signals clobber the emulation area.
2) Signals caused by the emulation, have incorrect saved machine state.
We have a low bar to pass, any new solution doesn't have to be perfect, it only has to be an improvement.
Keep in mind that we are not starting from a clean slate, there are many years of legacy code that has built up here.
A lesson I learned when doing the x86 vsyscall stuff: Don't waste time
improving legacy crap without a really good reason. Especially don't
extend the interface. Deprecate it (without breaking working user
code) and move on.

--Andy
Post by David Daney
David Daney
Ralf Baechle
2014-10-07 23:20:20 UTC
Permalink
Post by Rich Felker
As an alternative, if the space of possible instruction with a delay
slot is sufficiently small, all such instructions could be mapped as
immutable code in a shared mapping, each at a fixed offset in the
mapping. I suspect this would be borderline-impractical (multiple
megabytes?), but it is the cleanest solution otherwise.
Yes, there are 2^32 possible instructions. Each one is 4 bytes, plus you
need a way to exit after the instruction has executed, which would require
another instruction. So you would need 32GB of memory to hold all those
instructions, larger than the 32-bit virtual address space.
Plus errata support for some older CPUs requires no other instructions
that might cause an exception to be present in the same cache line inflating
the size to 32 bytes per instruction.

I've contemplated a full emulation - but that would require an emulator that
is capable of most of the instruction set. With all the random ASEs around
that would be hard to implement while the FPU emulator trampoline as currently
used has the advantage of automatically supporting ASEs, known and unknown.
So it's a huge bonus for maintenance.

Ralf
David Daney
2014-10-07 23:59:03 UTC
Permalink
Post by Ralf Baechle
Post by Rich Felker
As an alternative, if the space of possible instruction with a delay
slot is sufficiently small, all such instructions could be mapped as
immutable code in a shared mapping, each at a fixed offset in the
mapping. I suspect this would be borderline-impractical (multiple
megabytes?), but it is the cleanest solution otherwise.
Yes, there are 2^32 possible instructions. Each one is 4 bytes, plus you
need a way to exit after the instruction has executed, which would require
another instruction. So you would need 32GB of memory to hold all those
instructions, larger than the 32-bit virtual address space.
Plus errata support for some older CPUs requires no other instructions
that might cause an exception to be present in the same cache line inflating
the size to 32 bytes per instruction.
I've contemplated a full emulation - but that would require an emulator that
is capable of most of the instruction set. With all the random ASEs around
that would be hard to implement while the FPU emulator trampoline as currently
used has the advantage of automatically supporting ASEs, known and unknown.
So it's a huge bonus for maintenance.
Unfortunatly it breaks when our friends at Imgtec introduce their PC
relative instructions in mipsr6, so an emulator may be unavoidable.

David Daney
Chuck Ebbert
2014-10-08 00:18:33 UTC
Permalink
On Tue, 7 Oct 2014 16:59:03 -0700
Post by David Daney
Post by Ralf Baechle
Post by Rich Felker
As an alternative, if the space of possible instruction with a delay
slot is sufficiently small, all such instructions could be mapped as
immutable code in a shared mapping, each at a fixed offset in the
mapping. I suspect this would be borderline-impractical (multiple
megabytes?), but it is the cleanest solution otherwise.
Yes, there are 2^32 possible instructions. Each one is 4 bytes, plus you
need a way to exit after the instruction has executed, which would require
another instruction. So you would need 32GB of memory to hold all those
instructions, larger than the 32-bit virtual address space.
Plus errata support for some older CPUs requires no other instructions
that might cause an exception to be present in the same cache line inflating
the size to 32 bytes per instruction.
I've contemplated a full emulation - but that would require an emulator that
is capable of most of the instruction set. With all the random ASEs around
that would be hard to implement while the FPU emulator trampoline as currently
used has the advantage of automatically supporting ASEs, known and unknown.
So it's a huge bonus for maintenance.
Unfortunatly it breaks when our friends at Imgtec introduce their PC
relative instructions in mipsr6, so an emulator may be unavoidable.
The x86 kprobes code deals with executing relocated insns with
PC-relative offsets by adjusting the offset in a relocated instruction
before executing it.

See arch/x86/kernel/kprobes/core.c::__copy_instruction()
Rich Felker
2014-10-08 02:37:33 UTC
Permalink
Post by Chuck Ebbert
On Tue, 7 Oct 2014 16:59:03 -0700
Post by David Daney
Post by Ralf Baechle
Post by Rich Felker
As an alternative, if the space of possible instruction with a delay
slot is sufficiently small, all such instructions could be mapped as
immutable code in a shared mapping, each at a fixed offset in the
mapping. I suspect this would be borderline-impractical (multiple
megabytes?), but it is the cleanest solution otherwise.
Yes, there are 2^32 possible instructions. Each one is 4 bytes, plus you
need a way to exit after the instruction has executed, which would require
another instruction. So you would need 32GB of memory to hold all those
instructions, larger than the 32-bit virtual address space.
Plus errata support for some older CPUs requires no other instructions
that might cause an exception to be present in the same cache line inflating
the size to 32 bytes per instruction.
I've contemplated a full emulation - but that would require an emulator that
is capable of most of the instruction set. With all the random ASEs around
that would be hard to implement while the FPU emulator trampoline as currently
used has the advantage of automatically supporting ASEs, known and unknown.
So it's a huge bonus for maintenance.
Unfortunatly it breaks when our friends at Imgtec introduce their PC
relative instructions in mipsr6, so an emulator may be unavoidable.
The x86 kprobes code deals with executing relocated insns with
PC-relative offsets by adjusting the offset in a relocated instruction
before executing it.
See arch/x86/kernel/kprobes/core.c::__copy_instruction()
This only works if you have an ISA that can represent the full range
of possible relative addresses. It does not work for MIPS where the
range is quite restricted by the fixed instruction size.

Rich
Paul Burton
2014-10-08 10:31:49 UTC
Permalink
Post by David Daney
Post by Ralf Baechle
Post by Rich Felker
As an alternative, if the space of possible instruction with a delay
slot is sufficiently small, all such instructions could be mapped as
immutable code in a shared mapping, each at a fixed offset in the
mapping. I suspect this would be borderline-impractical (multiple
megabytes?), but it is the cleanest solution otherwise.
Yes, there are 2^32 possible instructions. Each one is 4 bytes, plus you
need a way to exit after the instruction has executed, which would require
another instruction. So you would need 32GB of memory to hold all those
instructions, larger than the 32-bit virtual address space.
Plus errata support for some older CPUs requires no other instructions
that might cause an exception to be present in the same cache line inflating
the size to 32 bytes per instruction.
I've contemplated a full emulation - but that would require an emulator that
is capable of most of the instruction set. With all the random ASEs around
that would be hard to implement while the FPU emulator trampoline as currently
used has the advantage of automatically supporting ASEs, known and unknown.
So it's a huge bonus for maintenance.
Unfortunatly it breaks when our friends at Imgtec introduce their PC
relative instructions in mipsr6, so an emulator may be unavoidable.
David Daney
Just to note, this was also discussed when I submitted my much older
patch with a similar goal:

http://patchwork.linux-mips.org/patch/6125/

...and the conclusion there also began converging towards full ISA
emulation (or at least, the subset of the ISA which userland can
execute):

http://www.linux-mips.org/archives/linux-mips/2014-07/msg00034.html

For the record my preference is for emulation. It is in some ways more
work, but it's also much cleaner. Given that more instructions will need
to be emulated to run pre-R6 binaries on R6 systems anyway, the emulator
would only become increasingly useful.

Paul

Kevin D. Kissell
2014-10-07 01:02:20 UTC
Permalink
Post by David Daney
In order for MIPS to be able to support a non-executable stack, we
need to supply a method to specify a userspace area that can be used
for executing emulated branch delay slot instructions.
We add a new system call, sys_set_fpuemul_xol_area so that userspace
threads that are using the FPU can specify the location of the FPU
emulation out of line execution area.
MIPS floating point support requires that any instruction that cannot
be directly executed by the FPU, be emulated by the kernel. Part of
this emulation involves executing non-FPU instructions that fall in
the delay slots of FP branch instructions. Since the beginning of
MIPS/Linux time, this has been done by placing the instructions on the
userspace thread stack, and executing them there, as the instructions
must be executed in the MM context of the thread receiving the
emulation.
Well, actually it doesn't go back to the beginning of MIPS/Linux time -
I was the b*astard who took the user-mode functional emulator from
Algorithmics and got it to work in the MIPS Linux kernel context, some
time in the early 2000s. It was all pretty straightforward, except for
the delay-slot-of-an-emulated-FP-conditional-branch problem. As you
note, it may be a load or store (though not a branch), so it needs to be
done in the user's MM context, and the user stack has nice properties of
being intrinsically per-thread and re-entrancy tolerant.
Post by David Daney
Because of this, the de facto MIPS Linux userspace ABI requires that
the userspace thread have an executable stack. It is de facto,
because it is not written anywhere that this must be the case, but it
is never the less a requirement.
IIRC, when I first did the code, we already needed this for signal
trampolines. I just extended it. Is it no longer required for signal
support? If not, how are signal trampolines now done, and could this
not be re-extended to the FP branch delay slot emulation problem?
Post by David Daney
How do we get MIPS Linux to use a non-executable stack in the face of
the FPU emulation problem?
Since userspace desires to change the ABI, put some of the onus on the
userspace code. Any userspace thread desiring a non-executable stack,
must allocate a 4-byte aligned area at least 8 bytes long with that
has read/write/execute permissions and pass the address of that area
to the kernel with the new sys_set_fpuemul_xol_area system call.
This is similar to how we require userspace to notify the kernel of
the value of the thread local pointer.
It's easy for me to criticise, since I'm no longer responsible for
maintenance, but I hope you'll excuse me for commenting that, while this
seems like a small enough and neat enough patch per se, I find it
disagreeable to break legacy binaries and to see a mechanism whose name
and implementation is so strictly tied to the one purpose. Even if it's
only used for the FP delay slot emulation today, shouldn't it be
designed/coded/documented as a sort of generic trampoline scratchpad?
And shouldn't we try to design things so that legacy code with FP but no
new magic system call "just works"? e.g. auto-allocate and initialize
the space for a thread the first time it actually needs to emulate an FP
branch?

/K.
Post by David Daney
---
reattempting to a reduced list of recipients.
This patch has only been compile tested, and lacks the userspace
component. It is presented as an alternate approch to the recently
http://www.linux-mips.org/archives/linux-mips/2014-10/msg00024.html
arch/mips/include/asm/thread_info.h | 2 ++
arch/mips/include/uapi/asm/unistd.h | 15 +++++++++------
arch/mips/kernel/process.c | 1 +
arch/mips/kernel/scall32-o32.S | 1 +
arch/mips/kernel/scall64-64.S | 1 +
arch/mips/kernel/scall64-n32.S | 1 +
arch/mips/kernel/scall64-o32.S | 1 +
arch/mips/kernel/syscall.c | 8 ++++++++
arch/mips/math-emu/dsemul.c | 11 +++++++----
9 files changed, 31 insertions(+), 10 deletions(-)
diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
index 7de8658..20d47f6 100644
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -26,6 +26,7 @@ struct thread_info {
struct exec_domain *exec_domain; /* execution domain */
unsigned long flags; /* low level flags */
unsigned long tp_value; /* thread pointer */
+ unsigned long fpu_emul_xol; /* FPU emul eXecute Out of Line VA */
__u32 cpu; /* current CPU */
int preempt_count; /* 0 => preemptable, <0 => BUG */
@@ -46,6 +47,7 @@ struct thread_info {
.task = &tsk, \
.exec_domain = &default_exec_domain, \
.flags = _TIF_FIXADE, \
+ .fpu_emul_xol = ~0ul, \
.cpu = 0, \
.preempt_count = INIT_PREEMPT_COUNT, \
.addr_limit = KERNEL_DS, \
diff --git a/arch/mips/include/uapi/asm/unistd.h b/arch/mips/include/uapi/asm/unistd.h
index fdb4923..f1270ee 100644
--- a/arch/mips/include/uapi/asm/unistd.h
+++ b/arch/mips/include/uapi/asm/unistd.h
@@ -375,16 +375,17 @@
#define __NR_seccomp (__NR_Linux + 352)
#define __NR_getrandom (__NR_Linux + 353)
#define __NR_memfd_create (__NR_Linux + 354)
+#define __NR_set_fpuemul_xol_area (__NR_Linux + 355)
/*
* Offset of the last Linux o32 flavoured syscall
*/
-#define __NR_Linux_syscalls 354
+#define __NR_Linux_syscalls 355
#endif /* _MIPS_SIM == _MIPS_SIM_ABI32 */
#define __NR_O32_Linux 4000
-#define __NR_O32_Linux_syscalls 354
+#define __NR_O32_Linux_syscalls 355
#if _MIPS_SIM == _MIPS_SIM_ABI64
@@ -707,16 +708,17 @@
#define __NR_seccomp (__NR_Linux + 312)
#define __NR_getrandom (__NR_Linux + 313)
#define __NR_memfd_create (__NR_Linux + 314)
+#define __NR_set_fpuemul_xol_area (__NR_Linux + 315)
/*
* Offset of the last Linux 64-bit flavoured syscall
*/
-#define __NR_Linux_syscalls 314
+#define __NR_Linux_syscalls 315
#endif /* _MIPS_SIM == _MIPS_SIM_ABI64 */
#define __NR_64_Linux 5000
-#define __NR_64_Linux_syscalls 314
+#define __NR_64_Linux_syscalls 315
#if _MIPS_SIM == _MIPS_SIM_NABI32
@@ -1043,15 +1045,16 @@
#define __NR_seccomp (__NR_Linux + 316)
#define __NR_getrandom (__NR_Linux + 317)
#define __NR_memfd_create (__NR_Linux + 318)
+#define __NR_set_fpuemul_xol_area (__NR_Linux + 319)
/*
* Offset of the last N32 flavoured syscall
*/
-#define __NR_Linux_syscalls 318
+#define __NR_Linux_syscalls 319
#endif /* _MIPS_SIM == _MIPS_SIM_NABI32 */
#define __NR_N32_Linux 6000
-#define __NR_N32_Linux_syscalls 318
+#define __NR_N32_Linux_syscalls 319
#endif /* _UAPI_ASM_UNISTD_H */
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index 636b074..6dde6bb 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -151,6 +151,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
if (clone_flags & CLONE_SETTLS)
ti->tp_value = regs->regs[7];
+ ti->fpu_emul_xol = ~0ul;
return 0;
}
diff --git a/arch/mips/kernel/scall32-o32.S b/arch/mips/kernel/scall32-o32.S
index 744cd10..8c19a39 100644
--- a/arch/mips/kernel/scall32-o32.S
+++ b/arch/mips/kernel/scall32-o32.S
@@ -579,3 +579,4 @@ EXPORT(sys_call_table)
PTR sys_seccomp
PTR sys_getrandom
PTR sys_memfd_create
+ PTR sys_set_fpuemul_xol_area /* 4355 */
diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S
index 002b1bc..0b9f72e 100644
--- a/arch/mips/kernel/scall64-64.S
+++ b/arch/mips/kernel/scall64-64.S
@@ -434,4 +434,5 @@ EXPORT(sys_call_table)
PTR sys_seccomp
PTR sys_getrandom
PTR sys_memfd_create
+ PTR sys_set_fpuemul_xol_area /* 5315 */
.size sys_call_table,.-sys_call_table
diff --git a/arch/mips/kernel/scall64-n32.S b/arch/mips/kernel/scall64-n32.S
index ca6cbbe..48f1760 100644
--- a/arch/mips/kernel/scall64-n32.S
+++ b/arch/mips/kernel/scall64-n32.S
@@ -427,4 +427,5 @@ EXPORT(sysn32_call_table)
PTR sys_seccomp
PTR sys_getrandom
PTR sys_memfd_create
+ PTR sys_set_fpuemul_xol_area
.size sysn32_call_table,.-sysn32_call_table
diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
index 9e10d11..60def68 100644
--- a/arch/mips/kernel/scall64-o32.S
+++ b/arch/mips/kernel/scall64-o32.S
@@ -564,4 +564,5 @@ EXPORT(sys32_call_table)
PTR sys_seccomp
PTR sys_getrandom
PTR sys_memfd_create
+ PTR sys_set_fpuemul_xol_area /* 4355 */
.size sys32_call_table,.-sys32_call_table
diff --git a/arch/mips/kernel/syscall.c b/arch/mips/kernel/syscall.c
index 4a4f9dd..5f9d9e8 100644
--- a/arch/mips/kernel/syscall.c
+++ b/arch/mips/kernel/syscall.c
@@ -96,6 +96,14 @@ SYSCALL_DEFINE1(set_thread_area, unsigned long, addr)
return 0;
}
+SYSCALL_DEFINE1(set_fpuemul_xol_area, unsigned long, addr)
+{
+ struct thread_info *ti = task_thread_info(current);
+
+ ti->fpu_emul_xol = addr;
+ return 0;
+}
+
static inline int mips_atomic_set(unsigned long addr, unsigned long new)
{
unsigned long old, tmp;
diff --git a/arch/mips/math-emu/dsemul.c b/arch/mips/math-emu/dsemul.c
index 4f514f3..bf4ff61 100644
--- a/arch/mips/math-emu/dsemul.c
+++ b/arch/mips/math-emu/dsemul.c
@@ -34,6 +34,7 @@ struct emuframe {
int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
{
extern asmlinkage void handle_dsemulret(void);
+ struct thread_info *ti = task_thread_info(current);
struct emuframe __user *fr;
int err;
@@ -64,10 +65,12 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
* branches, but gives us a cleaner interface to the exception
* handler (single entry point).
*/
-
- /* Ensure that the two instructions are in the same cache line */
- fr = (struct emuframe __user *)
- ((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
+ if (ti->fpu_emul_xol != ~0ul)
+ fr = (struct emuframe *)ti->fpu_emul_xol;
+ else
+ /* Ensure that the two instructions are in the same cache line */
+ fr = (struct emuframe __user *)
+ ((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
/* Verify that the stack pointer is not competely insane */
if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct emuframe))))
Rich Felker
2014-10-07 01:38:47 UTC
Permalink
Post by Kevin D. Kissell
Post by David Daney
In order for MIPS to be able to support a non-executable stack, we
need to supply a method to specify a userspace area that can be used
for executing emulated branch delay slot instructions.
We add a new system call, sys_set_fpuemul_xol_area so that userspace
threads that are using the FPU can specify the location of the FPU
emulation out of line execution area.
MIPS floating point support requires that any instruction that cannot
be directly executed by the FPU, be emulated by the kernel. Part of
this emulation involves executing non-FPU instructions that fall in
the delay slots of FP branch instructions. Since the beginning of
MIPS/Linux time, this has been done by placing the instructions on the
userspace thread stack, and executing them there, as the instructions
must be executed in the MM context of the thread receiving the
emulation.
Well, actually it doesn't go back to the beginning of MIPS/Linux
time - I was the b*astard who took the user-mode functional emulator
from Algorithmics and got it to work in the MIPS Linux kernel
context, some time in the early 2000s. It was all pretty
straightforward, except for the
delay-slot-of-an-emulated-FP-conditional-branch problem. As you
note, it may be a load or store (though not a branch), so it needs
to be done in the user's MM context, and the user stack has nice
properties of being intrinsically per-thread and re-entrancy
tolerant.
If the space of possible instructions that need to run in the user's
MM context is sufficiently small, perhaps we could emulate the rest in
kernelspace and have a fixed code mapping exposed to userspace
containing each possible MM-context-dependent instruction combination.

As an alternative, the kernel could expose emulator code to run in
userspace as part of the vdso or other magic kernel-provided pages,
and this code would be capable of emulating arbitrary instructions,
which would of course take place in the user MM context.

This does not solve the problem for hardware with custom instructions,
but I still believe it's totally reasonable to say that the ABI does
not allow putting custom instructions in delay slots for coprocessor
branches.
Post by Kevin D. Kissell
Post by David Daney
Because of this, the de facto MIPS Linux userspace ABI requires that
the userspace thread have an executable stack. It is de facto,
because it is not written anywhere that this must be the case, but it
is never the less a requirement.
IIRC, when I first did the code, we already needed this for signal
trampolines. I just extended it. Is it no longer required for
signal support? If not, how are signal trampolines now done, and
could this not be re-extended to the FP branch delay slot emulation
problem?
Signal trampolines were nonsense to begin with. The code needed is
fixed, not variable per-signal-instance, so it can be provided by libc
or by the kernel in the vdso page or similar.
Post by Kevin D. Kissell
Post by David Daney
How do we get MIPS Linux to use a non-executable stack in the face of
the FPU emulation problem?
Since userspace desires to change the ABI, put some of the onus on the
userspace code. Any userspace thread desiring a non-executable stack,
must allocate a 4-byte aligned area at least 8 bytes long with that
has read/write/execute permissions and pass the address of that area
to the kernel with the new sys_set_fpuemul_xol_area system call.
This is similar to how we require userspace to notify the kernel of
the value of the thread local pointer.
It's easy for me to criticise, since I'm no longer responsible for
maintenance, but I hope you'll excuse me for commenting that, while
this seems like a small enough and neat enough patch per se, I find
it disagreeable to break legacy binaries and to see a mechanism
whose name and implementation is so strictly tied to the one
purpose. Even if it's only used for the FP delay slot emulation
today, shouldn't it be designed/coded/documented as a sort of
generic trampoline scratchpad? And shouldn't we try to design
things so that legacy code with FP but no new magic system call
"just works"? e.g. auto-allocate and initialize the space for a
thread the first time it actually needs to emulate an FP branch?
"First time it actually needs to emulate" does not work, since it may
be impossible to allocate at that time, and then there's no way the
program can proceed. The allocation must be done at a time when you
can report failure, which means at the time of execve (for the main
thread) and clone (for other threads).

Rich
David Daney
2014-10-07 04:32:45 UTC
Permalink
Post by Kevin D. Kissell
Post by David Daney
In order for MIPS to be able to support a non-executable stack, we
need to supply a method to specify a userspace area that can be used
for executing emulated branch delay slot instructions.
We add a new system call, sys_set_fpuemul_xol_area so that userspace
threads that are using the FPU can specify the location of the FPU
emulation out of line execution area.
MIPS floating point support requires that any instruction that cannot
be directly executed by the FPU, be emulated by the kernel. Part of
this emulation involves executing non-FPU instructions that fall in
the delay slots of FP branch instructions. Since the beginning of
MIPS/Linux time, this has been done by placing the instructions on the
userspace thread stack, and executing them there, as the instructions
must be executed in the MM context of the thread receiving the
emulation.
Well, actually it doesn't go back to the beginning of MIPS/Linux time
- I was the b*astard who took the user-mode functional emulator from
Algorithmics and got it to work in the MIPS Linux kernel context, some
time in the early 2000s. It was all pretty straightforward, except
for the delay-slot-of-an-emulated-FP-conditional-branch problem. As
you note, it may be a load or store (though not a branch), so it needs
to be done in the user's MM context, and the user stack has nice
properties of being intrinsically per-thread and re-entrancy tolerant.
Post by David Daney
Because of this, the de facto MIPS Linux userspace ABI requires that
the userspace thread have an executable stack. It is de facto,
because it is not written anywhere that this must be the case, but it
is never the less a requirement.
IIRC, when I first did the code, we already needed this for signal
trampolines. I just extended it. Is it no longer required for signal
support? If not, how are signal trampolines now done, and could this
not be re-extended to the FP branch delay slot emulation problem?
I moved signal trampolines off the stack quite a few years ago. This is
the only thing blocking non-executable stack.

The problem with the FP branch delay slot emulation is that the code
that needs to be executed varies. The signal trampoline code is known
at kernel build time.
Post by Kevin D. Kissell
Post by David Daney
How do we get MIPS Linux to use a non-executable stack in the face of
the FPU emulation problem?
Since userspace desires to change the ABI, put some of the onus on the
userspace code. Any userspace thread desiring a non-executable stack,
must allocate a 4-byte aligned area at least 8 bytes long with that
has read/write/execute permissions and pass the address of that area
to the kernel with the new sys_set_fpuemul_xol_area system call.
This is similar to how we require userspace to notify the kernel of
the value of the thread local pointer.
It's easy for me to criticise, since I'm no longer responsible for
maintenance, but I hope you'll excuse me for commenting that, while
this seems like a small enough and neat enough patch per se, I find
it disagreeable to break legacy binaries
It doesn't break legacy binaries. They continue to use a executable
stack and the emulation is done there.

This only would change new binaries that explicitly asked for a
non-executable stack.
Post by Kevin D. Kissell
and to see a mechanism whose name and implementation is so strictly
tied to the one purpose. Even if it's only used for the FP delay slot
emulation today, shouldn't it be designed/coded/documented as a sort
of generic trampoline scratchpad? And shouldn't we try to design
things so that legacy code with FP but no new magic system call "just
works"? e.g. auto-allocate and initialize the space for a thread the
first time it actually needs to emulate an FP branch?
The binaries have to be tagged as non-executable stack, this is because
GCC can, and does, generate trampolines on the stack as part of its
normal code generation strategy.

That said, there are many problems with both the current code, and my
proposal.

The main issue, as mentioned by another commenter, is the problem of
signals and nested emulations.

If the emulated instruction raises a synchronous exception that is
converted to a signal, what is the EPC in the register state on the
stack? Should it be the original location of the instruction, or the
out-of-line location used by emulation? Are there userspace runtime
systems that care about this?

If we are emulating on the stack, a signal stack state could clobber the
emulation location.

If the kernel automatically allocated the emulation locations, what
would happen if there were a signal that interrupted the emulation, and
the signal handler did a longjump to somewhere else? How would we clean
up the now unused emulation memory allocations?
Post by Kevin D. Kissell
/K.
Post by David Daney
---
reattempting to a reduced list of recipients.
This patch has only been compile tested, and lacks the userspace
component. It is presented as an alternate approch to the recently
http://www.linux-mips.org/archives/linux-mips/2014-10/msg00024.html
arch/mips/include/asm/thread_info.h | 2 ++
arch/mips/include/uapi/asm/unistd.h | 15 +++++++++------
arch/mips/kernel/process.c | 1 +
arch/mips/kernel/scall32-o32.S | 1 +
arch/mips/kernel/scall64-64.S | 1 +
arch/mips/kernel/scall64-n32.S | 1 +
arch/mips/kernel/scall64-o32.S | 1 +
arch/mips/kernel/syscall.c | 8 ++++++++
arch/mips/math-emu/dsemul.c | 11 +++++++----
9 files changed, 31 insertions(+), 10 deletions(-)
diff --git a/arch/mips/include/asm/thread_info.h
b/arch/mips/include/asm/thread_info.h
index 7de8658..20d47f6 100644
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -26,6 +26,7 @@ struct thread_info {
struct exec_domain *exec_domain; /* execution domain */
unsigned long flags; /* low level flags */
unsigned long tp_value; /* thread pointer */
+ unsigned long fpu_emul_xol; /* FPU emul eXecute Out of
Line VA */
__u32 cpu; /* current CPU */
int preempt_count; /* 0 => preemptable, <0 => BUG */
@@ -46,6 +47,7 @@ struct thread_info {
.task = &tsk, \
.exec_domain = &default_exec_domain, \
.flags = _TIF_FIXADE, \
+ .fpu_emul_xol = ~0ul, \
.cpu = 0, \
.preempt_count = INIT_PREEMPT_COUNT, \
.addr_limit = KERNEL_DS, \
diff --git a/arch/mips/include/uapi/asm/unistd.h
b/arch/mips/include/uapi/asm/unistd.h
index fdb4923..f1270ee 100644
--- a/arch/mips/include/uapi/asm/unistd.h
+++ b/arch/mips/include/uapi/asm/unistd.h
@@ -375,16 +375,17 @@
#define __NR_seccomp (__NR_Linux + 352)
#define __NR_getrandom (__NR_Linux + 353)
#define __NR_memfd_create (__NR_Linux + 354)
+#define __NR_set_fpuemul_xol_area (__NR_Linux + 355)
/*
* Offset of the last Linux o32 flavoured syscall
*/
-#define __NR_Linux_syscalls 354
+#define __NR_Linux_syscalls 355
#endif /* _MIPS_SIM == _MIPS_SIM_ABI32 */
#define __NR_O32_Linux 4000
-#define __NR_O32_Linux_syscalls 354
+#define __NR_O32_Linux_syscalls 355
#if _MIPS_SIM == _MIPS_SIM_ABI64
@@ -707,16 +708,17 @@
#define __NR_seccomp (__NR_Linux + 312)
#define __NR_getrandom (__NR_Linux + 313)
#define __NR_memfd_create (__NR_Linux + 314)
+#define __NR_set_fpuemul_xol_area (__NR_Linux + 315)
/*
* Offset of the last Linux 64-bit flavoured syscall
*/
-#define __NR_Linux_syscalls 314
+#define __NR_Linux_syscalls 315
#endif /* _MIPS_SIM == _MIPS_SIM_ABI64 */
#define __NR_64_Linux 5000
-#define __NR_64_Linux_syscalls 314
+#define __NR_64_Linux_syscalls 315
#if _MIPS_SIM == _MIPS_SIM_NABI32
@@ -1043,15 +1045,16 @@
#define __NR_seccomp (__NR_Linux + 316)
#define __NR_getrandom (__NR_Linux + 317)
#define __NR_memfd_create (__NR_Linux + 318)
+#define __NR_set_fpuemul_xol_area (__NR_Linux + 319)
/*
* Offset of the last N32 flavoured syscall
*/
-#define __NR_Linux_syscalls 318
+#define __NR_Linux_syscalls 319
#endif /* _MIPS_SIM == _MIPS_SIM_NABI32 */
#define __NR_N32_Linux 6000
-#define __NR_N32_Linux_syscalls 318
+#define __NR_N32_Linux_syscalls 319
#endif /* _UAPI_ASM_UNISTD_H */
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index 636b074..6dde6bb 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -151,6 +151,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
if (clone_flags & CLONE_SETTLS)
ti->tp_value = regs->regs[7];
+ ti->fpu_emul_xol = ~0ul;
return 0;
}
diff --git a/arch/mips/kernel/scall32-o32.S
b/arch/mips/kernel/scall32-o32.S
index 744cd10..8c19a39 100644
--- a/arch/mips/kernel/scall32-o32.S
+++ b/arch/mips/kernel/scall32-o32.S
@@ -579,3 +579,4 @@ EXPORT(sys_call_table)
PTR sys_seccomp
PTR sys_getrandom
PTR sys_memfd_create
+ PTR sys_set_fpuemul_xol_area /* 4355 */
diff --git a/arch/mips/kernel/scall64-64.S
b/arch/mips/kernel/scall64-64.S
index 002b1bc..0b9f72e 100644
--- a/arch/mips/kernel/scall64-64.S
+++ b/arch/mips/kernel/scall64-64.S
@@ -434,4 +434,5 @@ EXPORT(sys_call_table)
PTR sys_seccomp
PTR sys_getrandom
PTR sys_memfd_create
+ PTR sys_set_fpuemul_xol_area /* 5315 */
.size sys_call_table,.-sys_call_table
diff --git a/arch/mips/kernel/scall64-n32.S
b/arch/mips/kernel/scall64-n32.S
index ca6cbbe..48f1760 100644
--- a/arch/mips/kernel/scall64-n32.S
+++ b/arch/mips/kernel/scall64-n32.S
@@ -427,4 +427,5 @@ EXPORT(sysn32_call_table)
PTR sys_seccomp
PTR sys_getrandom
PTR sys_memfd_create
+ PTR sys_set_fpuemul_xol_area
.size sysn32_call_table,.-sysn32_call_table
diff --git a/arch/mips/kernel/scall64-o32.S
b/arch/mips/kernel/scall64-o32.S
index 9e10d11..60def68 100644
--- a/arch/mips/kernel/scall64-o32.S
+++ b/arch/mips/kernel/scall64-o32.S
@@ -564,4 +564,5 @@ EXPORT(sys32_call_table)
PTR sys_seccomp
PTR sys_getrandom
PTR sys_memfd_create
+ PTR sys_set_fpuemul_xol_area /* 4355 */
.size sys32_call_table,.-sys32_call_table
diff --git a/arch/mips/kernel/syscall.c b/arch/mips/kernel/syscall.c
index 4a4f9dd..5f9d9e8 100644
--- a/arch/mips/kernel/syscall.c
+++ b/arch/mips/kernel/syscall.c
@@ -96,6 +96,14 @@ SYSCALL_DEFINE1(set_thread_area, unsigned long, addr)
return 0;
}
+SYSCALL_DEFINE1(set_fpuemul_xol_area, unsigned long, addr)
+{
+ struct thread_info *ti = task_thread_info(current);
+
+ ti->fpu_emul_xol = addr;
+ return 0;
+}
+
static inline int mips_atomic_set(unsigned long addr, unsigned long new)
{
unsigned long old, tmp;
diff --git a/arch/mips/math-emu/dsemul.c b/arch/mips/math-emu/dsemul.c
index 4f514f3..bf4ff61 100644
--- a/arch/mips/math-emu/dsemul.c
+++ b/arch/mips/math-emu/dsemul.c
@@ -34,6 +34,7 @@ struct emuframe {
int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
{
extern asmlinkage void handle_dsemulret(void);
+ struct thread_info *ti = task_thread_info(current);
struct emuframe __user *fr;
int err;
@@ -64,10 +65,12 @@ int mips_dsemul(struct pt_regs *regs,
mips_instruction ir, unsigned long cpc)
* branches, but gives us a cleaner interface to the exception
* handler (single entry point).
*/
-
- /* Ensure that the two instructions are in the same cache line */
- fr = (struct emuframe __user *)
- ((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
+ if (ti->fpu_emul_xol != ~0ul)
+ fr = (struct emuframe *)ti->fpu_emul_xol;
+ else
+ /* Ensure that the two instructions are in the same cache
line */
+ fr = (struct emuframe __user *)
+ ((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
/* Verify that the stack pointer is not competely insane */
if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct
emuframe))))
James Hogan
2014-10-07 11:53:13 UTC
Permalink
Post by David Daney
If the kernel automatically allocated the emulation locations, what
would happen if there were a signal that interrupted the emulation, and
the signal handler did a longjump to somewhere else? How would we clean
up the now unused emulation memory allocations?
AFAICT, Leonid's implementation also has this problem, and that has a
separate stack of emuframes per thread managed completely by the kernel.

Essentially the kernel doesn't manage the stack, userland does, and
userland can choose to skip over sigframes and emuframes with siglongjmp
without telling the kernel.

Userland can even switch between contexts (which includes stack) with
setcontext (coroutines etc) which breaks the assumption in Leonid's
patches that emuframes will be completed in reverse order to them being
started, again demonstrating that it is essentially userland that
manages the stack.

I think any attempt by the kernel to keep track of user stacks (e.g. by
storing a stack pointer along with the emuframe so that unused emuframes
can be discarded later when stack pointer goes high again) will be
foiled by setcontext.

Hmm, I can't see a way forward that doesn't involve invasive userland
handling & ABI changes other than giving up with non-executable stacks
or limiting permitted instructions in delay slots to those Linux knows
how to emulate directly.

Cheers
James
James Hogan
2014-10-07 12:22:18 UTC
Permalink
Post by James Hogan
Post by David Daney
If the kernel automatically allocated the emulation locations, what
would happen if there were a signal that interrupted the emulation, and
the signal handler did a longjump to somewhere else? How would we clean
up the now unused emulation memory allocations?
AFAICT, Leonid's implementation also has this problem, and that has a
separate stack of emuframes per thread managed completely by the kernel.
Essentially the kernel doesn't manage the stack, userland does, and
userland can choose to skip over sigframes and emuframes with siglongjmp
without telling the kernel.
Userland can even switch between contexts (which includes stack) with
setcontext (coroutines etc) which breaks the assumption in Leonid's
patches that emuframes will be completed in reverse order to them being
started, again demonstrating that it is essentially userland that
manages the stack.
I think any attempt by the kernel to keep track of user stacks (e.g. by
storing a stack pointer along with the emuframe so that unused emuframes
can be discarded later when stack pointer goes high again) will be
foiled by setcontext.
Hmm, I can't see a way forward that doesn't involve invasive userland
handling & ABI changes other than giving up with non-executable stacks
or limiting permitted instructions in delay slots to those Linux knows
how to emulate directly.
Would it work for a signal encountered during branch delay slot
emulation (maybe where the PC is pointing at that magic location the
kernel uses for emulation) to be treated as a return from emulation, but
leaving the user PC pointing to the original branch (with Cause.BD=1 I
suppose) prior to handling the signal, so that no more than one emuframe
is needed by each thread at a time?

Cheers
James
Loading...