Discussion:
[PATCH v2 0/3] MIPS executable stack protection
Leonid Yegoshin
2014-10-09 20:00:01 UTC
Permalink
The following series implements an executable stack protection in MIPS.

It sets up a per-thread 'VDSO' page and appropriate TLB support.
Page is set write-protected from user and is maintained via kernel VA.
MIPS FPU emulation is shifted to new page and stack is relieved for
execute protection as is as all data pages in default setup during ELF
binary initialization. The real protection is controlled by GLIBC and
it can do stack protected now as it is done in other architectures and
I learned today that GLIBC team is ready for this.

Note: actual execute-protection depends from HW capability, of course.

This patch is required for MIPS32/64 R2 emulation on MIPS R6 architecture.
Without it 'ssh-keygen' crashes pretty fast on attempt to execute instruction
in stack.

v2 changes:
- Added an optimization during mmap switch - doesn't switch if the same
thread is rescheduled and other threads don't intervene (Peter Zijlstra)
- Fixed uMIPS support (Paul Burton)
- Added unwinding of VDSO emulation stack at signal handler invocation,
hiding an emulation page (Andy Lutomirski note in other patch comments)

---

Leonid Yegoshin (3):
MIPS: mips_flush_cache_range is added
MIPS: Setup an instruction emulation in VDSO protected page instead of user stack
MIPS: set stack/data protection as non-executable


arch/mips/include/asm/cacheflush.h | 3 +
arch/mips/include/asm/fpu_emulator.h | 2
arch/mips/include/asm/mmu.h | 3 +
arch/mips/include/asm/page.h | 2
arch/mips/include/asm/processor.h | 2
arch/mips/include/asm/switch_to.h | 14 +++
arch/mips/include/asm/thread_info.h | 3 +
arch/mips/include/asm/tlbmisc.h | 1
arch/mips/include/asm/vdso.h | 3 +
arch/mips/kernel/process.c | 7 ++
arch/mips/kernel/signal.c | 4 +
arch/mips/kernel/vdso.c | 41 +++++++++
arch/mips/math-emu/cp1emu.c | 8 +-
arch/mips/math-emu/dsemul.c | 153 ++++++++++++++++++++++++++++------
arch/mips/mm/c-octeon.c | 8 ++
arch/mips/mm/c-r3k.c | 8 ++
arch/mips/mm/c-r4k.c | 43 ++++++++++
arch/mips/mm/c-tx39.c | 9 ++
arch/mips/mm/cache.c | 4 +
arch/mips/mm/fault.c | 5 +
arch/mips/mm/tlb-r4k.c | 42 +++++++++
21 files changed, 333 insertions(+), 32 deletions(-)
--
Signed-off-by: Leonid Yegoshin <***@imgtec.com>
Leonid Yegoshin
2014-10-09 20:00:10 UTC
Permalink
New function mips_flush_cache_range() is added.
It flushes D-cache on kernel VA and I-cache on user VA.
It is significant in case of cache aliasing systems.
It can be used to flush a short sequence of newly written code
to user space and especially usefull in ptrace() and dsemul().
Today a full page is flushed by flush_cache_page in ptrace().

Signed-off-by: Leonid Yegoshin <***@imgtec.com>
---
arch/mips/include/asm/cacheflush.h | 3 +++
arch/mips/mm/c-octeon.c | 8 +++++++
arch/mips/mm/c-r3k.c | 8 +++++++
arch/mips/mm/c-r4k.c | 43 ++++++++++++++++++++++++++++++++++++
arch/mips/mm/c-tx39.c | 9 ++++++++
arch/mips/mm/cache.c | 4 +++
6 files changed, 75 insertions(+), 0 deletions(-)

diff --git a/arch/mips/include/asm/cacheflush.h b/arch/mips/include/asm/cacheflush.h
index e08381a..8305241 100644
--- a/arch/mips/include/asm/cacheflush.h
+++ b/arch/mips/include/asm/cacheflush.h
@@ -94,6 +94,9 @@ extern void (*flush_cache_sigtramp)(unsigned long addr);
extern void (*flush_icache_all)(void);
extern void (*local_flush_data_cache_page)(void * addr);
extern void (*flush_data_cache_page)(unsigned long addr);
+extern void (*mips_flush_data_cache_range)(struct vm_area_struct *vma,
+ unsigned long vaddr, struct page *page, unsigned long addr,
+ unsigned long size);

/*
* This flag is used to indicate that the page pointed to by a pte
diff --git a/arch/mips/mm/c-octeon.c b/arch/mips/mm/c-octeon.c
index 05b1d7c..38349d1 100644
--- a/arch/mips/mm/c-octeon.c
+++ b/arch/mips/mm/c-octeon.c
@@ -178,6 +178,13 @@ static void octeon_flush_kernel_vmap_range(unsigned long vaddr, int size)
BUG();
}

+static void octeon_flush_data_cache_range(struct vm_area_struct *vma,
+ unsigned long vaddr, struct page *page, unsigned long addr,
+ unsigned long size)
+{
+ octeon_flush_cache_page(vma, addr, page_to_pfn(page));
+}
+
/**
* Probe Octeon's caches
*
@@ -292,6 +299,7 @@ void octeon_cache_init(void)
flush_cache_sigtramp = octeon_flush_cache_sigtramp;
flush_icache_all = octeon_flush_icache_all;
flush_data_cache_page = octeon_flush_data_cache_page;
+ mips_flush_data_cache_range = octeon_flush_data_cache_range;
flush_icache_range = octeon_flush_icache_range;
local_flush_icache_range = local_octeon_flush_icache_range;

diff --git a/arch/mips/mm/c-r3k.c b/arch/mips/mm/c-r3k.c
index 135ec31..93b4810 100644
--- a/arch/mips/mm/c-r3k.c
+++ b/arch/mips/mm/c-r3k.c
@@ -273,6 +273,13 @@ static void r3k_flush_data_cache_page(unsigned long addr)
{
}

+static void r3k_mips_flush_data_cache_range(struct vm_area_struct *vma,
+ unsigned long vaddr, struct page *page, unsigned long addr,
+ unsigned long size)
+{
+ r3k_flush_cache_page(vma, addr, page_to_pfn(page));
+}
+
static void r3k_flush_cache_sigtramp(unsigned long addr)
{
unsigned long flags;
@@ -322,6 +329,7 @@ void r3k_cache_init(void)
__flush_cache_all = r3k___flush_cache_all;
flush_cache_mm = r3k_flush_cache_mm;
flush_cache_range = r3k_flush_cache_range;
+ mips_flush_data_cache_range = r3k_mips_flush_data_cache_range;
flush_cache_page = r3k_flush_cache_page;
flush_icache_range = r3k_flush_icache_range;
local_flush_icache_range = r3k_flush_icache_range;
diff --git a/arch/mips/mm/c-r4k.c b/arch/mips/mm/c-r4k.c
index ad6ff7b..ee014e4 100644
--- a/arch/mips/mm/c-r4k.c
+++ b/arch/mips/mm/c-r4k.c
@@ -636,6 +636,48 @@ static void r4k_flush_data_cache_page(unsigned long addr)
r4k_on_each_cpu(local_r4k_flush_data_cache_page, (void *) addr);
}

+struct mips_flush_data_cache_range_args {
+ struct vm_area_struct *vma;
+ unsigned long vaddr;
+ unsigned long start;
+ unsigned long len;
+};
+
+static inline void local_r4k_mips_flush_data_cache_range(void *args)
+{
+ struct mips_flush_data_cache_range_args *f_args = args;
+ unsigned long vaddr = f_args->vaddr;
+ unsigned long start = f_args->start;
+ unsigned long len = f_args->len;
+ struct vm_area_struct * vma = f_args->vma;
+
+ blast_dcache_range(start, start + len);
+
+ if ((vma->vm_flags & VM_EXEC) && !cpu_has_ic_fills_f_dc) {
+ wmb();
+
+ /* vma is given for exec check only, mmap is current,
+ so - no non-current vma page flush, just user or kernel */
+ protected_blast_icache_range(vaddr, vaddr + len);
+ }
+}
+
+/* flush dirty kernel data and a corresponding user instructions (if needed).
+ used in copy_to_user_page() */
+static void r4k_mips_flush_data_cache_range(struct vm_area_struct *vma,
+ unsigned long vaddr, struct page *page, unsigned long start,
+ unsigned long len)
+{
+ struct mips_flush_data_cache_range_args args;
+
+ args.vma = vma;
+ args.vaddr = vaddr;
+ args.start = start;
+ args.len = len;
+
+ r4k_on_each_cpu(local_r4k_mips_flush_data_cache_range, (void *)&args);
+}
+
struct flush_icache_range_args {
unsigned long start;
unsigned long end;
@@ -1656,6 +1698,7 @@ void r4k_cache_init(void)
flush_icache_all = r4k_flush_icache_all;
local_flush_data_cache_page = local_r4k_flush_data_cache_page;
flush_data_cache_page = r4k_flush_data_cache_page;
+ mips_flush_data_cache_range = r4k_mips_flush_data_cache_range;
flush_icache_range = r4k_flush_icache_range;
local_flush_icache_range = local_r4k_flush_icache_range;

diff --git a/arch/mips/mm/c-tx39.c b/arch/mips/mm/c-tx39.c
index 8d909db..9316e92 100644
--- a/arch/mips/mm/c-tx39.c
+++ b/arch/mips/mm/c-tx39.c
@@ -230,6 +230,13 @@ static void tx39_flush_data_cache_page(unsigned long addr)
tx39_blast_dcache_page(addr);
}

+static void local_flush_data_cache_range(struct vm_area_struct *vma,
+ unsigned long vaddr, struct page *page, unsigned long addr,
+ unsigned long size)
+{
+ flush_cache_page(vma, addr, page_to_pfn(page));
+}
+
static void tx39_flush_icache_range(unsigned long start, unsigned long end)
{
if (end - start > dcache_size)
@@ -371,6 +378,7 @@ void tx39_cache_init(void)

flush_cache_sigtramp = (void *) tx39h_flush_icache_all;
local_flush_data_cache_page = (void *) tx39h_flush_icache_all;
+ mips_flush_data_cache_range = (void *) local_flush_data_cache_range;
flush_data_cache_page = (void *) tx39h_flush_icache_all;

_dma_cache_wback_inv = tx39h_dma_cache_wback_inv;
@@ -402,6 +410,7 @@ void tx39_cache_init(void)

flush_cache_sigtramp = tx39_flush_cache_sigtramp;
local_flush_data_cache_page = local_tx39_flush_data_cache_page;
+ mips_flush_data_cache_range = (void *) local_flush_data_cache_range;
flush_data_cache_page = tx39_flush_data_cache_page;

_dma_cache_wback_inv = tx39_dma_cache_wback_inv;
diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c
index 7e3ea77..b7bdd01 100644
--- a/arch/mips/mm/cache.c
+++ b/arch/mips/mm/cache.c
@@ -44,10 +44,14 @@ void (*__invalidate_kernel_vmap_range)(unsigned long vaddr, int size);
void (*flush_cache_sigtramp)(unsigned long addr);
void (*local_flush_data_cache_page)(void * addr);
void (*flush_data_cache_page)(unsigned long addr);
+void (*mips_flush_data_cache_range)(struct vm_area_struct *vma,
+ unsigned long vaddr, struct page *page, unsigned long addr,
+ unsigned long size);
void (*flush_icache_all)(void);

EXPORT_SYMBOL_GPL(local_flush_data_cache_page);
EXPORT_SYMBOL(flush_data_cache_page);
+EXPORT_SYMBOL(mips_flush_data_cache_range);
EXPORT_SYMBOL(flush_icache_all);

#if defined(CONFIG_DMA_NONCOHERENT) || defined(CONFIG_DMA_MAYBE_COHERENT)
Leonid Yegoshin
2014-10-09 20:00:17 UTC
Permalink
Historically, during FPU emulation MIPS runs live BD-slot instruction in stack.
This is needed because it was the only way to correctly handle branch
exceptions with unknown COP2 or ASE instructions in BD-slot. Now there is
an eXecuteInhibit feature and it is desirable to protect stack from execution
for security reasons.
This patch moves FPU emulation from stack area to VDSO-located page which is set
write-protected for application access. VDSO page itself is now per-thread and
it's addresses and offsets are stored in thread_info.
Small stack of emulation blocks is supported because nested traps are possible
in MIPS32/64 R6 emulation mix with FPU emulation.
Signal happend during run in emulation block is handled properly - EPC is
changed to before an emulated jump or to target address, depending from point of
signal.

Explanation of problem (current state before patch):

If I set eXecute-disabled stack in ELF binary initialisation then GLIBC ignores
it and may change stack protection at library load. If this library has
eXecute-disabled stack then anything is OK, but if this section (PT_GNU_STACK)
is just missed as usual, then GLIBC applies it's own default == eXecute-enabled
stack.
So, ssh_keygen is built explicitly with eXecute-disabled stack. But GLIBC
ignores it and set stack executable. And because of that - anything works,
FPU emulation and hacker tools.
However, if I use all *.SO libraries with eXecute-disabled stack in PT_GNU_STACK
section then GLIBC keeps stack non-executable but things fails at FPU emulation
later.

Here are two issues which are bind together and to solve an incorrect
behaviour of GLIBC (ignoring X ssh-keygen intention) the splitting both issues
is needed. So, I did a kernel emulation protected and out of stack.

Signed-off-by: Leonid Yegoshin <***@imgtec.com>
---
arch/mips/include/asm/fpu_emulator.h | 2
arch/mips/include/asm/mmu.h | 3 +
arch/mips/include/asm/processor.h | 2
arch/mips/include/asm/switch_to.h | 14 +++
arch/mips/include/asm/thread_info.h | 3 +
arch/mips/include/asm/tlbmisc.h | 1
arch/mips/include/asm/vdso.h | 3 +
arch/mips/kernel/process.c | 7 ++
arch/mips/kernel/signal.c | 4 +
arch/mips/kernel/vdso.c | 41 +++++++++
arch/mips/math-emu/cp1emu.c | 8 +-
arch/mips/math-emu/dsemul.c | 153 ++++++++++++++++++++++++++++------
arch/mips/mm/fault.c | 5 +
arch/mips/mm/tlb-r4k.c | 42 +++++++++
14 files changed, 257 insertions(+), 31 deletions(-)

diff --git a/arch/mips/include/asm/fpu_emulator.h b/arch/mips/include/asm/fpu_emulator.h
index 0195745..8ba644e 100644
--- a/arch/mips/include/asm/fpu_emulator.h
+++ b/arch/mips/include/asm/fpu_emulator.h
@@ -60,7 +60,7 @@ do { \
#endif /* CONFIG_DEBUG_FS */

extern int mips_dsemul(struct pt_regs *regs, mips_instruction ir,
- unsigned long cpc);
+ unsigned long cpc, unsigned long bpc, unsigned long r31);
extern int do_dsemulret(struct pt_regs *xcp);
extern int fpu_emulator_cop1Handler(struct pt_regs *xcp,
struct mips_fpu_struct *ctx, int has_fpu,
diff --git a/arch/mips/include/asm/mmu.h b/arch/mips/include/asm/mmu.h
index c436138..621a513 100644
--- a/arch/mips/include/asm/mmu.h
+++ b/arch/mips/include/asm/mmu.h
@@ -3,7 +3,10 @@

typedef struct {
unsigned long asid[NR_CPUS];
+ unsigned long vdso_asid[NR_CPUS];
+ struct page *vdso_page[NR_CPUS];
void *vdso;
+ struct vm_area_struct *vdso_vma;
} mm_context_t;

#endif /* __ASM_MMU_H */
diff --git a/arch/mips/include/asm/processor.h b/arch/mips/include/asm/processor.h
index 05f0843..c87b1da 100644
--- a/arch/mips/include/asm/processor.h
+++ b/arch/mips/include/asm/processor.h
@@ -40,7 +40,7 @@ extern unsigned int vced_count, vcei_count;
* A special page (the vdso) is mapped into all processes at the very
* top of the virtual memory space.
*/
-#define SPECIAL_PAGES_SIZE PAGE_SIZE
+#define SPECIAL_PAGES_SIZE (PAGE_SIZE * 2)

#ifdef CONFIG_32BIT
#ifdef CONFIG_KVM_GUEST
diff --git a/arch/mips/include/asm/switch_to.h b/arch/mips/include/asm/switch_to.h
index b928b6f..cda8889 100644
--- a/arch/mips/include/asm/switch_to.h
+++ b/arch/mips/include/asm/switch_to.h
@@ -17,6 +17,7 @@
#include <asm/dsp.h>
#include <asm/cop2.h>
#include <asm/msa.h>
+#include <asm/mmu_context.h>

struct task_struct;

@@ -104,6 +105,18 @@ do { \
disable_msa(); \
} while (0)

+static inline void flush_vdso_page(void)
+{
+ int cpu = raw_smp_processor_id();
+
+ if (current->mm && cpu_context(cpu, current->mm) &&
+ (current->mm->context.vdso_page[cpu] != current_thread_info()->vdso_page) &&
+ (current->mm->context.vdso_asid[cpu] == cpu_asid(cpu, current->mm))) {
+ local_flush_tlb_page(current->mm->mmap, (unsigned long)current->mm->context.vdso);
+ current->mm->context.vdso_asid[cpu] = 0;
+ }
+}
+
#define finish_arch_switch(prev) \
do { \
u32 __c0_stat; \
@@ -118,6 +131,7 @@ do { \
__restore_dsp(current); \
if (cpu_has_userlocal) \
write_c0_userlocal(current_thread_info()->tp_value); \
+ flush_vdso_page(); \
__restore_watch(); \
} while (0)

diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
index 7de8658..8d16003 100644
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -34,6 +34,8 @@ struct thread_info {
* 0x7fffffff for user-thead
* 0xffffffff for kernel-thread
*/
+ unsigned long vdso_offset;
+ struct page *vdso_page;
struct restart_block restart_block;
struct pt_regs *regs;
};
@@ -49,6 +51,7 @@ struct thread_info {
.cpu = 0, \
.preempt_count = INIT_PREEMPT_COUNT, \
.addr_limit = KERNEL_DS, \
+ .vdso_page = NULL, \
.restart_block = { \
.fn = do_no_restart_syscall, \
}, \
diff --git a/arch/mips/include/asm/tlbmisc.h b/arch/mips/include/asm/tlbmisc.h
index 3a45228..abd7bf6 100644
--- a/arch/mips/include/asm/tlbmisc.h
+++ b/arch/mips/include/asm/tlbmisc.h
@@ -6,5 +6,6 @@
*/
extern void add_wired_entry(unsigned long entrylo0, unsigned long entrylo1,
unsigned long entryhi, unsigned long pagemask);
+int install_vdso_tlb(void);

#endif /* __ASM_TLBMISC_H */
diff --git a/arch/mips/include/asm/vdso.h b/arch/mips/include/asm/vdso.h
index cca56aa..77056fc 100644
--- a/arch/mips/include/asm/vdso.h
+++ b/arch/mips/include/asm/vdso.h
@@ -11,6 +11,9 @@

#include <linux/types.h>

+void mips_thread_vdso(struct thread_info *ti);
+void arch_release_thread_info(struct thread_info *info);
+void vdso_epc_adjust(struct pt_regs *xcp);

#ifdef CONFIG_32BIT
struct mips_vdso {
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index 636b074..762738a 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -42,6 +42,7 @@
#include <asm/isadep.h>
#include <asm/inst.h>
#include <asm/stacktrace.h>
+#include <asm/vdso.h>

#ifdef CONFIG_HOTPLUG_CPU
void arch_cpu_idle_dead(void)
@@ -59,6 +60,8 @@ void start_thread(struct pt_regs * regs, unsigned long pc, unsigned long sp)
{
unsigned long status;

+ mips_thread_vdso(current_thread_info());
+
/* New thread loses kernel privileges. */
status = regs->cp0_status & ~(ST0_CU0|ST0_CU1|ST0_FR|KU_MASK);
status |= KU_USER;
@@ -75,6 +78,7 @@ void start_thread(struct pt_regs * regs, unsigned long pc, unsigned long sp)

void exit_thread(void)
{
+ arch_release_thread_info(current_thread_info());
}

void flush_thread(void)
@@ -103,6 +107,9 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,

preempt_enable();

+ ti->vdso_page = NULL;
+ mips_thread_vdso(ti);
+
/* set up new TSS. */
childregs = (struct pt_regs *) childksp - 1;
/* Put the stack after the struct pt_regs. */
diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c
index 1d57605..9f8fd64 100644
--- a/arch/mips/kernel/signal.c
+++ b/arch/mips/kernel/signal.c
@@ -559,6 +559,10 @@ static void handle_signal(struct ksignal *ksig, struct pt_regs *regs)
regs->regs[0] = 0; /* Don't deal with this again. */
}

+ /* adjust emulation stack if signal happens during emulation */
+ if (current_thread_info()->vdso_page)
+ vdso_epc_adjust(regs);
+
if (sig_uses_siginfo(&ksig->ka))
ret = abi->setup_rt_frame(vdso + abi->rt_signal_return_offset,
ksig, regs, oldset);
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 0f1af58..7b31bba 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -19,6 +19,8 @@

#include <asm/vdso.h>
#include <asm/uasm.h>
+#include <asm/cacheflush.h>
+#include <asm/tlbflush.h>

/*
* Including <asm/unistd.h> would give use the 64-bit syscall numbers ...
@@ -88,14 +90,18 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)

ret = install_special_mapping(mm, addr, PAGE_SIZE,
VM_READ|VM_EXEC|
- VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
+ VM_MAYREAD|VM_MAYEXEC,
&vdso_page);

if (ret)
goto up_fail;

mm->context.vdso = (void *)addr;
+ /* if cache aliasing - use a different cache flush later */
+ if (cpu_has_rixi && cpu_has_dc_aliases)
+ mm->context.vdso_vma = find_vma(mm,addr);

+ mips_thread_vdso(current_thread_info());
up_fail:
up_write(&mm->mmap_sem);
return ret;
@@ -107,3 +113,36 @@ const char *arch_vma_name(struct vm_area_struct *vma)
return "[vdso]";
return NULL;
}
+
+void mips_thread_vdso(struct thread_info *ti)
+{
+ struct page *vdso;
+ unsigned long addr;
+
+ if (cpu_has_rixi && ti->task->mm && !ti->vdso_page) {
+ vdso = alloc_page(GFP_USER);
+ if (!vdso)
+ return;
+ ti->vdso_page = vdso;
+ ti->vdso_offset = PAGE_SIZE;
+ addr = (unsigned long)page_address(vdso);
+ copy_page((void *)addr, (void *)page_address(vdso_page));
+ if (!cpu_has_ic_fills_f_dc)
+ flush_data_cache_page(addr);
+ /* any vma in mmap is used, just to get ASIDs back from mm */
+ local_flush_tlb_page(ti->task->mm->mmap,addr);
+ }
+}
+
+void arch_release_thread_info(struct thread_info *info)
+{
+ if (info->vdso_page) {
+ if (info->task->mm) {
+ /* any vma in mmap is used, just to get ASIDs */
+ local_flush_tlb_page(info->task->mm->mmap,(unsigned long)page_address(info->vdso_page));
+ info->task->mm->context.vdso_asid[smp_processor_id()] = 0;
+ }
+ __free_page(info->vdso_page);
+ info->vdso_page = NULL;
+ }
+}
diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
index 7a47277..c648b43 100644
--- a/arch/mips/math-emu/cp1emu.c
+++ b/arch/mips/math-emu/cp1emu.c
@@ -703,6 +703,8 @@ static int cop1Emulate(struct pt_regs *xcp, struct mips_fpu_struct *ctx,
struct mm_decoded_insn dec_insn, void *__user *fault_addr)
{
unsigned long contpc = xcp->cp0_epc + dec_insn.pc_inc;
+ unsigned long r31;
+ unsigned long s_epc;
unsigned int cond, cbit;
mips_instruction ir;
int likely, pc_inc;
@@ -720,6 +722,8 @@ static int cop1Emulate(struct pt_regs *xcp, struct mips_fpu_struct *ctx,
if (!cpu_has_mmips && dec_insn.micro_mips_mode)
unreachable();

+ s_epc = xcp->cp0_epc;
+ r31 = xcp->regs[31];
/* XXX NEC Vr54xx bug workaround */
if (delay_slot(xcp)) {
if (dec_insn.micro_mips_mode) {
@@ -998,7 +1002,7 @@ emul:
* Single step the non-CP1
* instruction in the dslot.
*/
- return mips_dsemul(xcp, ir, contpc);
+ return mips_dsemul(xcp, ir, contpc, s_epc, r31);
}
} else
contpc = (xcp->cp0_epc + (contpc << 2));
@@ -1042,7 +1046,7 @@ emul:
* Single step the non-cp1
* instruction in the dslot
*/
- return mips_dsemul(xcp, ir, contpc);
+ return mips_dsemul(xcp, ir, contpc, s_epc, r31);
} else if (likely) { /* branch not taken */
/*
* branch likely nullifies
diff --git a/arch/mips/math-emu/dsemul.c b/arch/mips/math-emu/dsemul.c
index 4f514f3..c521498 100644
--- a/arch/mips/math-emu/dsemul.c
+++ b/arch/mips/math-emu/dsemul.c
@@ -3,6 +3,7 @@
#include <asm/fpu_emulator.h>
#include <asm/inst.h>
#include <asm/mipsregs.h>
+#include <asm/vdso.h>
#include <asm/uaccess.h>

#include "ieee754.h"
@@ -29,13 +30,19 @@ struct emuframe {
mips_instruction badinst;
mips_instruction cookie;
unsigned long epc;
+ unsigned long bpc;
+ unsigned long r31;
};
+/* round structure size to N*8 to force a fit two instructions in a single cache line */
+#define EMULFRAME_ROUNDED_SIZE ((sizeof(struct emuframe) + 0x7) & ~0x7)

-int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
+int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc,
+ unsigned long bpc, unsigned long r31)
{
extern asmlinkage void handle_dsemulret(void);
struct emuframe __user *fr;
int err;
+ unsigned char *pg_addr;

if ((get_isa16_mode(regs->cp0_epc) && ((ir >> 16) == MM_NOP16)) ||
(ir == 0)) {
@@ -48,7 +55,7 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
pr_debug("dsemul %lx %lx\n", regs->cp0_epc, cpc);

/*
- * The strategy is to push the instruction onto the user stack
+ * The strategy is to push the instruction onto the user stack/VDSO page
* and put a trap after it which we can catch and jump to
* the required address any alternative apart from full
* instruction emulation!!.
@@ -65,36 +72,81 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
* handler (single entry point).
*/

- /* Ensure that the two instructions are in the same cache line */
- fr = (struct emuframe __user *)
- ((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
+ if (current_thread_info()->vdso_page) {
+ /*
+ * Use VDSO page and fill structure via kernel VA,
+ * user write is disabled
+ */
+ pg_addr = (unsigned char *)page_address(current_thread_info()->vdso_page);
+ fr = (struct emuframe __user *)
+ (pg_addr + current_thread_info()->vdso_offset -
+ EMULFRAME_ROUNDED_SIZE);
+
+ /* verify that we don't overflow into trampoline areas */
+ if ((unsigned char *)fr < (unsigned char *)(((struct mips_vdso *)pg_addr) + 1)) {
+ MIPS_FPU_EMU_INC_STATS(errors);
+ return SIGBUS;
+ }
+
+ current_thread_info()->vdso_offset -= EMULFRAME_ROUNDED_SIZE;

- /* Verify that the stack pointer is not competely insane */
- if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct emuframe))))
- return SIGBUS;
+ if (get_isa16_mode(regs->cp0_epc)) {
+ *(u16 *)&fr->emul = (u16)(ir >> 16);
+ *((u16 *)(&fr->emul) + 1) = (u16)(ir & 0xffff);
+ *((u16 *)(&fr->emul) + 2) = (u16)(BREAK_MATH >> 16);
+ *((u16 *)(&fr->emul) + 3) = (u16)(BREAK_MATH &0xffff);
+ } else {
+ fr->emul = ir;
+ fr->badinst = (mips_instruction)BREAK_MATH;
+ }
+ fr->cookie = (mips_instruction)BD_COOKIE;
+ fr->epc = cpc;
+ fr->bpc = bpc;
+ fr->r31 = r31;

- if (get_isa16_mode(regs->cp0_epc)) {
- err = __put_user(ir >> 16, (u16 __user *)(&fr->emul));
- err |= __put_user(ir & 0xffff, (u16 __user *)((long)(&fr->emul) + 2));
- err |= __put_user(BREAK_MATH >> 16, (u16 __user *)(&fr->badinst));
- err |= __put_user(BREAK_MATH & 0xffff, (u16 __user *)((long)(&fr->badinst) + 2));
+ /* fill CP0_EPC with user VA */
+ regs->cp0_epc = ((unsigned long)(current->mm->context.vdso +
+ current_thread_info()->vdso_offset)) |
+ get_isa16_mode(regs->cp0_epc);
+ if (cpu_has_dc_aliases)
+ mips_flush_data_cache_range(current->mm->context.vdso_vma,
+ regs->cp0_epc, current_thread_info()->vdso_page,
+ (unsigned long)fr, sizeof(struct emuframe));
+ else
+ /* it is a less expensive on CPU with correct SYNCI */
+ flush_cache_sigtramp((unsigned long)fr);
} else {
- err = __put_user(ir, &fr->emul);
- err |= __put_user((mips_instruction)BREAK_MATH, &fr->badinst);
- }
+ /* Ensure that the two instructions are in the same cache line */
+ fr = (struct emuframe __user *)
+ ((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);

- err |= __put_user((mips_instruction)BD_COOKIE, &fr->cookie);
- err |= __put_user(cpc, &fr->epc);
+ /* Verify that the stack pointer is not competely insane */
+ if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct emuframe))))
+ return SIGBUS;

- if (unlikely(err)) {
- MIPS_FPU_EMU_INC_STATS(errors);
- return SIGBUS;
- }
+ if (get_isa16_mode(regs->cp0_epc)) {
+ err = __put_user(ir >> 16, (u16 __user *)(&fr->emul));
+ err |= __put_user(ir & 0xffff, (u16 __user *)((long)(&fr->emul) + 2));
+ err |= __put_user(BREAK_MATH >> 16, (u16 __user *)(&fr->badinst));
+ err |= __put_user(BREAK_MATH & 0xffff, (u16 __user *)((long)(&fr->badinst) + 2));
+ } else {
+ err = __put_user(ir, &fr->emul);
+ err |= __put_user((mips_instruction)BREAK_MATH, &fr->badinst);
+ }
+
+ err |= __put_user((mips_instruction)BD_COOKIE, &fr->cookie);
+ err |= __put_user(cpc, &fr->epc);

- regs->cp0_epc = ((unsigned long) &fr->emul) |
- get_isa16_mode(regs->cp0_epc);
+ if (unlikely(err)) {
+ MIPS_FPU_EMU_INC_STATS(errors);
+ return SIGBUS;
+ }

- flush_cache_sigtramp((unsigned long)&fr->badinst);
+ regs->cp0_epc = ((unsigned long) &fr->emul) |
+ get_isa16_mode(regs->cp0_epc);
+
+ flush_cache_sigtramp((unsigned long)&fr->badinst);
+ }

return SIGILL; /* force out of emulation loop */
}
@@ -132,7 +184,10 @@ int do_dsemulret(struct pt_regs *xcp)
}
err |= __get_user(cookie, &fr->cookie);

- if (unlikely(err || (insn != BREAK_MATH) || (cookie != BD_COOKIE))) {
+ if (unlikely(err || (insn != BREAK_MATH) || (cookie != BD_COOKIE) ||
+ (current_thread_info()->vdso_page &&
+ ((xcp->cp0_epc & PAGE_MASK) !=
+ (unsigned long)current->mm->context.vdso)))) {
MIPS_FPU_EMU_INC_STATS(errors);
return 0;
}
@@ -156,8 +211,54 @@ int do_dsemulret(struct pt_regs *xcp)
return 0;
}

+ if (current_thread_info()->vdso_page) {
+ /* restore VDSO stack level */
+ current_thread_info()->vdso_offset += EMULFRAME_ROUNDED_SIZE;
+ if (current_thread_info()->vdso_offset > PAGE_SIZE) {
+ /* This is not a good situation to be in */
+ current_thread_info()->vdso_offset -= EMULFRAME_ROUNDED_SIZE;
+ force_sig(SIGBUS, current);
+
+ return 0;
+ }
+ }
/* Set EPC to return to post-branch instruction */
xcp->cp0_epc = epc;

return 1;
}
+
+/* check and adjust an emulation stack before start a signal handler */
+void vdso_epc_adjust(struct pt_regs *xcp)
+{
+ struct emuframe __user *fr;
+ unsigned long epc;
+ unsigned long r31;
+
+ while (current_thread_info()->vdso_offset < PAGE_SIZE) {
+ epc = msk_isa16_mode(xcp->cp0_epc);
+ if ((epc >= ((unsigned long)current->mm->context.vdso + PAGE_SIZE)) ||
+ (epc < (unsigned long)((struct mips_vdso *)current->mm->context.vdso + 1)))
+ return;
+
+ fr = (struct emuframe __user *)
+ ((unsigned long)current->mm->context.vdso +
+ current_thread_info()->vdso_offset);
+
+ /*
+ * epc must point to emul instruction or badinst
+ * in case of emul - it is not executed, so return to start
+ * and restore GPR31
+ * in case of badinst - instruction is executed, return to destination
+ */
+ if (epc == (unsigned long)&fr->emul) {
+ __get_user(r31, &fr->r31);
+ xcp->regs[31] = r31;
+ __get_user(epc, &fr->bpc);
+ } else {
+ __get_user(epc, &fr->epc);
+ }
+ xcp->cp0_epc = epc;
+ current_thread_info()->vdso_offset += EMULFRAME_ROUNDED_SIZE;
+ }
+}
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index becc42b..516045a 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -26,6 +26,7 @@
#include <asm/uaccess.h>
#include <asm/ptrace.h>
#include <asm/highmem.h> /* For VMALLOC_END */
+#include <asm/tlbmisc.h>
#include <linux/kdebug.h>

/*
@@ -138,6 +139,9 @@ good_area:
#endif
goto bad_area;
}
+ if (((address & PAGE_MASK) == (unsigned long)(mm->context.vdso)) &&
+ install_vdso_tlb())
+ goto up_return;
} else {
if (!(vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
goto bad_area;
@@ -186,6 +190,7 @@ good_area:
}
}

+up_return:
up_read(&mm->mmap_sem);
return;

diff --git a/arch/mips/mm/tlb-r4k.c b/arch/mips/mm/tlb-r4k.c
index fa6ebd4..de5dcc8 100644
--- a/arch/mips/mm/tlb-r4k.c
+++ b/arch/mips/mm/tlb-r4k.c
@@ -350,6 +350,48 @@ void __update_tlb(struct vm_area_struct * vma, unsigned long address, pte_t pte)
local_irq_restore(flags);
}

+int install_vdso_tlb(void)
+{
+ int tlbidx;
+ int cpu;
+ unsigned long flags;
+
+ if (!current_thread_info()->vdso_page)
+ return(0);
+
+ local_irq_save(flags);
+ cpu = smp_processor_id();
+ write_c0_entryhi(((unsigned long)current->mm->context.vdso & (PAGE_MASK << 1)) |
+ cpu_asid(cpu, current->mm));
+
+ mtc0_tlbw_hazard();
+ tlb_probe();
+ tlb_probe_hazard();
+ tlbidx = read_c0_index();
+#if defined(CONFIG_64BIT_PHYS_ADDR) && defined(CONFIG_CPU_MIPS32)
+ write_c0_entrylo0(pte_val(pfn_pte(
+ page_to_pfn(current_thread_info()->vdso_page),
+ __pgprot(_page_cachable_default|_PAGE_VALID)))>>32);
+#else
+ write_c0_entrylo0(pte_to_entrylo(pte_val(pfn_pte(
+ page_to_pfn(current_thread_info()->vdso_page),
+ __pgprot(_page_cachable_default|_PAGE_VALID)))));
+#endif
+ write_c0_entrylo1(0);
+ mtc0_tlbw_hazard();
+ if (tlbidx < 0)
+ tlb_write_random();
+ else
+ tlb_write_indexed();
+ tlbw_use_hazard();
+
+ current->mm->context.vdso_asid[cpu] = cpu_asid(cpu, current->mm);
+ current->mm->context.vdso_page[cpu] = current_thread_info()->vdso_page;
+ local_irq_restore(flags);
+
+ return(1);
+}
+
void add_wired_entry(unsigned long entrylo0, unsigned long entrylo1,
unsigned long entryhi, unsigned long pagemask)
{
James Hogan
2014-10-09 22:43:04 UTC
Permalink
Hi Leonid,
Post by Leonid Yegoshin
Signal happend during run in emulation block is handled properly - EPC is
changed to before an emulated jump or to target address, depending from point of
signal.
Great!
Post by Leonid Yegoshin
Small stack of emulation blocks is supported because nested traps are possible
in MIPS32/64 R6 emulation mix with FPU emulation.
Could you please clarify how this nesting of emulation blocks could
happen now that signals are handled more cleanly.

I.e. isn't the emuframe stuff only required for instructions in branch
delay slots, and branches shouldn't be in branch delay slots anyway, so
I don't get how they could nest.

Thanks
James
Leonid Yegoshin
2014-10-09 23:10:15 UTC
Permalink
Post by James Hogan
Post by Leonid Yegoshin
Small stack of emulation blocks is supported because nested traps are possible
in MIPS32/64 R6 emulation mix with FPU emulation.
Could you please clarify how this nesting of emulation blocks could
happen now that signals are handled more cleanly.
I.e. isn't the emuframe stuff only required for instructions in branch
delay slots, and branches shouldn't be in branch delay slots anyway, so
I don't get how they could nest.
It may be a case for mix of FPU and MIPS R6 emulations. I just keep both
emulators separate as much as possible but I assume that without prove
it may be stackable - some rollback is needed to join both and it may
(probably) cause a double emulation setup - dsemul may be called twice
for the same pair of instructions. I didn't see that yet, honestly and
you may be right.

And as for signals - it is a different issue, some signal may happen
before or after emulated instruction in emulation block and I see that.
But I see it only before because of probability for it is a lot of
higher. Unwinding is need because signal handler may not return but
longjump to somewhere.
James Hogan
2014-10-09 23:40:44 UTC
Permalink
Hi Leonid,
Post by Leonid Yegoshin
Post by James Hogan
Post by Leonid Yegoshin
Small stack of emulation blocks is supported because nested traps are possible
in MIPS32/64 R6 emulation mix with FPU emulation.
Could you please clarify how this nesting of emulation blocks could
happen now that signals are handled more cleanly.
I.e. isn't the emuframe stuff only required for instructions in branch
delay slots, and branches shouldn't be in branch delay slots anyway, so
I don't get how they could nest.
It may be a case for mix of FPU and MIPS R6 emulations. I just keep both
emulators separate as much as possible but I assume that without prove
it may be stackable - some rollback is needed to join both and it may
(probably) cause a double emulation setup - dsemul may be called twice
for the same pair of instructions. I didn't see that yet, honestly and
you may be right.
If the only time they're used is for emulation of a branch delay slot
instruction which should never be another branch, and signals always
undo the emuframe before being handled (btw, should the BD bit in cause
get set if rewinding for signal handlers/gdb?), then it stands to reason
it should never nest.

You could then avoid the whole stack and per-thread thing and just have
a maximum of one emuframe dedicated to each thread or allocated on
demand, and if there genuinely is a use case for nesting later on, worry
about it then.

So long as the kernel handles a long sequence of sequential emulated
branches gracefully (not necessarily correctly).

Cheers
James
Leonid Yegoshin
2014-10-10 00:07:11 UTC
Permalink
Post by James Hogan
Post by Leonid Yegoshin
It may be a case for mix of FPU and MIPS R6 emulations. I just keep both
emulators separate as much as possible but I assume that without prove
it may be stackable - some rollback is needed to join both and it may
(probably) cause a double emulation setup - dsemul may be called twice
for the same pair of instructions. I didn't see that yet, honestly and
you may be right.
If the only time they're used is for emulation of a branch delay slot
instruction which should never be another branch, and signals always
undo the emuframe before being handled (btw, should the BD bit in cause
get set if rewinding for signal handlers/gdb?), then it stands to reason
it should never nest.
I don't want to give a chance. If it is proved excessive, then slashing
it - 15minutes, it doesn't harm.

OK, I will spend some time to look into that, it have sense to
reconsider after unwinding signals.
Post by James Hogan
You could then avoid the whole stack and per-thread thing and just have
a maximum of one emuframe dedicated to each thread or allocated on
demand, and if there genuinely is a use case for nesting later on, worry
about it then.
As I understand, you propose to allocate some space in mmap.

This requires a stuff to handle allocation of user space beyond VMAs.
It also may have some pain during thread creation, stopping and
subsequent cloning because that memory allocator should service that
events too and it may be not easy if emulation blocks are packed into
page. If it is not packed then it waste of user space and put additional
constraint to number of thread on single mmap.

Some cooperation with GLIBC may be needed to prevent re-use of user
address space, at a moment not sure the extent of it.

I estimated that it can be much more troubling.
Post by James Hogan
So long as the kernel handles a long sequence of sequential emulated
branches gracefully (not necessarily correctly).
I don't understand a question. Each pair/single instruction is emulated
separately but there is some pipeline of that, even in FPU emulator, it
is just not this patch issue.
James Hogan
2014-10-10 10:03:34 UTC
Permalink
Hi Leonid,
Post by Leonid Yegoshin
Post by James Hogan
You could then avoid the whole stack and per-thread thing and just have
a maximum of one emuframe dedicated to each thread or allocated on
demand, and if there genuinely is a use case for nesting later on, worry
about it then.
As I understand, you propose to allocate some space in mmap.
No, sorry if I wasn't very clear. I just mean that you can get away with
a single kernel managed page per mm, with an emuframe allocated
per-thread which that thread always uses, since they never nest, which I
think simplifies the whole thing significantly.

The allocation could be smarter than that of course in case you have
thousands of threads and only a subset doing lots of FP branches, but a
single thread should never need more than one at a time since the new
signal behaviour effectively makes the delay slot emulation sort of
atomic from the point of view of usermode, and the kernel knows for sure
whether BD emulation is in progress from the PC.

(If there is some other way than signals that I haven't taken into
account that the emulation could be pre-empted then please let me know!)
Post by Leonid Yegoshin
Post by James Hogan
So long as the kernel handles a long sequence of sequential emulated
branches gracefully (not necessarily correctly).
I don't understand a question. Each pair/single instruction is emulated
separately but there is some pipeline of that, even in FPU emulator, it
is just not this patch issue.
I just mean an (illegal/undefined) sequence of FPU branch instructions
in one anothers delay slots shouldn't be able to crash the kernel.

Actually 2 of them would be enough to verify the kernel didn't get too
confused. Maybe the second will be detected & ignored, or maybe it
doesn't matter if the first emuframe gets overwritten by the second one
from the kernels point of view.

Cheers
James
Peter Zijlstra
2014-10-10 10:24:06 UTC
Permalink
Post by James Hogan
Hi Leonid,
Post by Leonid Yegoshin
Post by James Hogan
You could then avoid the whole stack and per-thread thing and just have
a maximum of one emuframe dedicated to each thread or allocated on
demand, and if there genuinely is a use case for nesting later on, worry
about it then.
As I understand, you propose to allocate some space in mmap.
No, sorry if I wasn't very clear. I just mean that you can get away with
a single kernel managed page per mm, with an emuframe allocated
per-thread which that thread always uses, since they never nest, which I
think simplifies the whole thing significantly.
The allocation could be smarter than that of course in case you have
thousands of threads and only a subset doing lots of FP branches, but a
single thread should never need more than one at a time since the new
signal behaviour effectively makes the delay slot emulation sort of
atomic from the point of view of usermode, and the kernel knows for sure
whether BD emulation is in progress from the PC.
(If there is some other way than signals that I haven't taken into
account that the emulation could be pre-empted then please let me know!)
Right, look at uprobes, it does exactly all this with a single page.
Slot allocation will block waiting for a free slot when all are in use.

If you need to support nesting, you need to do greedy slot allocation,
which is possible with limited nesting.
Leonid Yegoshin
2014-10-10 22:47:56 UTC
Permalink
Post by James Hogan
I just mean an (illegal/undefined) sequence of FPU branch instructions
in one anothers delay slots shouldn't be able to crash the kernel.
Actually 2 of them would be enough to verify the kernel didn't get too
confused. Maybe the second will be detected & ignored, or maybe it
doesn't matter if the first emuframe gets overwritten by the second
one from the kernels point of view.
Yes, I am looking into that sequences. I try to keep both emulators
isolated from the rest of kernel and from each other as much as possible
but intercalls via illegal combinations are still possible.
Post by James Hogan
Right, look at uprobes, it does exactly all this with a single page.
Slot allocation will block waiting for a free slot when all are in use.
I don't see a reason to change my 300 lines design into much more
lengthy code. That code has more links to the rest of kernel and high
possibility to execute atomic operation/locks/mutex/etc - I can't do it
for emulation of MIPS locking instructions.

- Leonid.
David Daney
2014-10-10 22:56:56 UTC
Permalink
Post by Leonid Yegoshin
Post by James Hogan
I just mean an (illegal/undefined) sequence of FPU branch instructions
in one anothers delay slots shouldn't be able to crash the kernel.
Actually 2 of them would be enough to verify the kernel didn't get too
confused. Maybe the second will be detected & ignored, or maybe it
doesn't matter if the first emuframe gets overwritten by the second
one from the kernels point of view.
Yes, I am looking into that sequences. I try to keep both emulators
isolated from the rest of kernel and from each other as much as possible
but intercalls via illegal combinations are still possible.
Post by James Hogan
Right, look at uprobes, it does exactly all this with a single page.
Slot allocation will block waiting for a free slot when all are in use.
I don't see a reason to change my 300 lines design into much more
lengthy code. That code has more links to the rest of kernel and high
possibility to execute atomic operation/locks/mutex/etc - I can't do it
for emulation of MIPS locking instructions.
It isn't just the number of lines of code that is important.

Doesn't your solution consume an extra page for each thread requiring
emulation? That could be a significant amount of memory in a system
with many threads.

Are you are using this to emulate atomic operations in addition to FPU
branch delay slot instructions? Where is the code that does that?

David Daney
Post by Leonid Yegoshin
- Leonid.
Leonid Yegoshin
2014-10-10 23:40:26 UTC
Permalink
Post by David Daney
Post by Leonid Yegoshin
Post by Peter Zijlstra
Right, look at uprobes, it does exactly all this with a single page.
Slot allocation will block waiting for a free slot when all are in
use.
I don't see a reason to change my 300 lines design into much more
lengthy code. That code has more links to the rest of kernel and high
possibility to execute atomic operation/locks/mutex/etc - I can't do it
for emulation of MIPS locking instructions.
It isn't just the number of lines of code that is important.
Doesn't your solution consume an extra page for each thread requiring
emulation? That could be a significant amount of memory in a system
with many threads.
Yes, you right. However, per-thread memory is useful for many goals.
Post by David Daney
Are you are using this to emulate atomic operations in addition to FPU
branch delay slot instructions? Where is the code that does that?
Yes, in MIPS R2 emulator - MIPS R6 changed LL/LLD/SC/SCD opcodes and
offset size.
The code-in-developement is in
ssh://git.linux-mips.org/pub/scm/yegoshin/mips.git, branch
android-linux-mti-3.10.14

- Leonid.

Leonid Yegoshin
2014-10-09 20:00:26 UTC
Permalink
This is a last step of 3 patches which shift FPU emulation out of
stack into protected area. So, it disables a default executable stack.

Additionally, it sets a default data area non-executable protection.

Signed-off-by: Leonid Yegoshin <***@imgtec.com>
---
arch/mips/include/asm/page.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/mips/include/asm/page.h b/arch/mips/include/asm/page.h
index 3be8180..d49ba81 100644
--- a/arch/mips/include/asm/page.h
+++ b/arch/mips/include/asm/page.h
@@ -230,7 +230,7 @@ extern int __virt_addr_valid(const volatile void *kaddr);
#define virt_addr_valid(kaddr) \
__virt_addr_valid((const volatile void *) (kaddr))

-#define VM_DATA_DEFAULT_FLAGS (VM_READ | VM_WRITE | VM_EXEC | \
+#define VM_DATA_DEFAULT_FLAGS (VM_READ | VM_WRITE | \
VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)

#define UNCAC_ADDR(addr) ((addr) - PAGE_OFFSET + UNCAC_BASE)
David Daney
2014-10-09 21:42:51 UTC
Permalink
Post by Leonid Yegoshin
The following series implements an executable stack protection in MIPS.
It sets up a per-thread 'VDSO' page and appropriate TLB support.
Page is set write-protected from user and is maintained via kernel VA.
MIPS FPU emulation is shifted to new page and stack is relieved for
execute protection as is as all data pages in default setup during ELF
binary initialization. The real protection is controlled by GLIBC and
it can do stack protected now as it is done in other architectures and
I learned today that GLIBC team is ready for this.
What does it mean to be 'ready'? If they committed patches before there
was kernel support, that it putting the cart before the horse. GlibC's
state cannot be used as valid reason for committing major kernel
changes. There would be no regression in any GLibC based system as a
result of not merging this patch.
Post by Leonid Yegoshin
Note: actual execute-protection depends from HW capability, of course.
This patch is required for MIPS32/64 R2 emulation on MIPS R6 architecture.
Without it 'ssh-keygen' crashes pretty fast on attempt to execute instruction
in stack.
There is much more blocking MIPS32/64 R2 emulation on MIPS R6 than just
this patch isn't there?

Also, if you are supporting MIPS R6, this patch doesn't even work,
because it doesn't handle PC relative instructions at all.
Post by Leonid Yegoshin
- Added an optimization during mmap switch - doesn't switch if the same
thread is rescheduled and other threads don't intervene (Peter Zijlstra)
- Fixed uMIPS support (Paul Burton)
- Added unwinding of VDSO emulation stack at signal handler invocation,
hiding an emulation page (Andy Lutomirski note in other patch comments)
---
MIPS: mips_flush_cache_range is added
MIPS: Setup an instruction emulation in VDSO protected page instead of user stack
MIPS: set stack/data protection as non-executable
The recent discussions on this subject, including many comments from
Imgtec e-mail addresses, brought to light the need to use an instruction
set emulator for newer MIPSr6 ISA processors.

In light of this, why does it make sense to merge this patch, instead of
taking the approach of emulating the instructions in the delay slot?

David Daney
Post by Leonid Yegoshin
arch/mips/include/asm/cacheflush.h | 3 +
arch/mips/include/asm/fpu_emulator.h | 2
arch/mips/include/asm/mmu.h | 3 +
arch/mips/include/asm/page.h | 2
arch/mips/include/asm/processor.h | 2
arch/mips/include/asm/switch_to.h | 14 +++
arch/mips/include/asm/thread_info.h | 3 +
arch/mips/include/asm/tlbmisc.h | 1
arch/mips/include/asm/vdso.h | 3 +
arch/mips/kernel/process.c | 7 ++
arch/mips/kernel/signal.c | 4 +
arch/mips/kernel/vdso.c | 41 +++++++++
arch/mips/math-emu/cp1emu.c | 8 +-
arch/mips/math-emu/dsemul.c | 153 ++++++++++++++++++++++++++++------
arch/mips/mm/c-octeon.c | 8 ++
arch/mips/mm/c-r3k.c | 8 ++
arch/mips/mm/c-r4k.c | 43 ++++++++++
arch/mips/mm/c-tx39.c | 9 ++
arch/mips/mm/cache.c | 4 +
arch/mips/mm/fault.c | 5 +
arch/mips/mm/tlb-r4k.c | 42 +++++++++
21 files changed, 333 insertions(+), 32 deletions(-)
Leonid Yegoshin
2014-10-09 22:18:56 UTC
Permalink
Post by David Daney
Post by Leonid Yegoshin
The following series implements an executable stack protection in MIPS.
It sets up a per-thread 'VDSO' page and appropriate TLB support.
Page is set write-protected from user and is maintained via kernel VA.
MIPS FPU emulation is shifted to new page and stack is relieved for
execute protection as is as all data pages in default setup during ELF
binary initialization. The real protection is controlled by GLIBC and
it can do stack protected now as it is done in other architectures and
I learned today that GLIBC team is ready for this.
What does it mean to be 'ready'? If they committed patches before
there was kernel support, that it putting the cart before the horse.
GlibC's state cannot be used as valid reason for committing major
kernel changes. There would be no regression in any GLibC based
system as a result of not merging this patch.
Rich Fuhler said me that they discussed it internally and have a
solution to fix their problem (ignoring PT_GNU_STACK on first library
load - they need to sort out the logic). But we need to split both issue
- right now stack can't be protected because of emulation. If they set
stack protected then emulation fails on CPU without FPU.
Post by David Daney
Post by Leonid Yegoshin
Note: actual execute-protection depends from HW capability, of course.
This patch is required for MIPS32/64 R2 emulation on MIPS R6
architecture.
Without it 'ssh-keygen' crashes pretty fast on attempt to execute instruction
in stack.
There is much more blocking MIPS32/64 R2 emulation on MIPS R6 than
just this patch isn't there?
This one is critical - ssh-keygen crashes during running MIPS R2. I have
a patch in my R6 repository but GLIBC still can't set stack executable
and security suffers.
Post by David Daney
Also, if you are supporting MIPS R6, this patch doesn't even work,
because it doesn't handle PC relative instructions at all.
It seems like you missed my statement - adding support for PC-relative
instruction is just 5 lines of code. I just refrain from this until
toolchain starts generating that.

Besides that, this version 2 of patch just passed 20-22 hours on P5600
and Virtuoso (no FPU on both) under SOAK test and it gets around 1 per
hour of signal right at emulated instruction in VDSO and unwind works
(as I can see in debug prints).
Post by David Daney
The recent discussions on this subject, including many comments from
Imgtec e-mail addresses, brought to light the need to use an
instruction set emulator for newer MIPSr6 ISA processors.
In Imgtec I am only one who works on MIPS R6 SW and FPU branch emulation
and I say you - it is not needed, this solution is enough.
Post by David Daney
In light of this, why does it make sense to merge this patch, instead
of taking the approach of emulating the instructions in the delay slot?
Well, because it does exist now. But full MIPS emulator... for all
ASEs... for any MIPS vendor... I even doesn't want to estimate an amount
of time and code size to develop it.

Besides that, you missed my another statement - we don't force customer
to disclose all details of their COP2 instructions.

- Leonid
Paul Burton
2014-10-09 22:28:54 UTC
Permalink
Post by David Daney
The recent discussions on this subject, including many comments from
Imgtec e-mail addresses, brought to light the need to use an instruction
set emulator for newer MIPSr6 ISA processors.
In Imgtec I am only one who works on MIPS R6 SW and FPU branch emulation and
I say you - it is not needed, this solution is enough.
On R6, yes, for some time now but you still have not submitted that code
for review. Your having worked on something does not automatically make
it correct.

On the FP branch delays, clearly not given that I submitted a similar
patch almost a year ago...

Paul
David Daney
2014-10-09 22:59:26 UTC
Permalink
Post by Leonid Yegoshin
Post by David Daney
Post by Leonid Yegoshin
The following series implements an executable stack protection in MIPS.
It sets up a per-thread 'VDSO' page and appropriate TLB support.
Page is set write-protected from user and is maintained via kernel VA.
MIPS FPU emulation is shifted to new page and stack is relieved for
execute protection as is as all data pages in default setup during ELF
binary initialization. The real protection is controlled by GLIBC and
it can do stack protected now as it is done in other architectures and
I learned today that GLIBC team is ready for this.
What does it mean to be 'ready'? If they committed patches before
there was kernel support, that it putting the cart before the horse.
GlibC's state cannot be used as valid reason for committing major
kernel changes. There would be no regression in any GLibC based
system as a result of not merging this patch.
Rich Fuhler said me that they discussed it internally and have a
solution to fix their problem (ignoring PT_GNU_STACK on first library
load - they need to sort out the logic). But we need to split both issue
- right now stack can't be protected because of emulation. If they set
stack protected then emulation fails on CPU without FPU.
Yes, we understand why non-executable stack is not compatible with the
FPU emulator.
Post by Leonid Yegoshin
Post by David Daney
Post by Leonid Yegoshin
Note: actual execute-protection depends from HW capability, of course.
This patch is required for MIPS32/64 R2 emulation on MIPS R6
architecture.
Without it 'ssh-keygen' crashes pretty fast on attempt to execute instruction
in stack.
There is much more blocking MIPS32/64 R2 emulation on MIPS R6 than
just this patch isn't there?
This one is critical - ssh-keygen crashes during running MIPS R2. I have
a patch in my R6 repository but GLIBC still can't set stack executable
and security suffers.
But is the R6 code already in the lmo or kernel.org repositories?

If not, then the lack of this patch is not a gating issue. If this
patch is really needed for R6 support, why not submit the R6
prerequisite patches first?

If this patch has nothing to do with MIPS R6, then state that.
Post by Leonid Yegoshin
Post by David Daney
Also, if you are supporting MIPS R6, this patch doesn't even work,
because it doesn't handle PC relative instructions at all.
It seems like you missed my statement - adding support for PC-relative
instruction is just 5 lines of code. I just refrain from this until
toolchain starts generating that.
How can it be just 5 lines of code? You have to emulate all those
instructions:

ADDIUPC
AUIPC
ALUIPC
LDPC
LWPC
LWUPC

I think that is all of them. You can emulate all of those in 5 lines of
code?

We need to support everything the toolchain could product in the future.
I don't think it makes sense to add all this stuff when it is well
known that it doesn't solve the problem for MIPS R6, especially when the
justification for the patch is that it is needed for R6.

I understand what your goals are here, I have spend many months working
towards a non-executable stack (see the patches that moved the signal
trampolines off the stack). But I am worried that there are many cases
that it will not handle.
Post by Leonid Yegoshin
Besides that, this version 2 of patch just passed 20-22 hours on P5600
and Virtuoso (no FPU on both) under SOAK test and it gets around 1 per
hour of signal right at emulated instruction in VDSO and unwind works
(as I can see in debug prints).
I'm not saying that the patch doesn't work under your highly constrained
test conditions, I believe that it does.

I am not familiar with the SOAK test. Does it really put faulting
instructions the delay slots of FP branch instructions, catch the
resulting signal, and then throw an exception from the signal handler?
Post by Leonid Yegoshin
Post by David Daney
The recent discussions on this subject, including many comments from
Imgtec e-mail addresses, brought to light the need to use an
instruction set emulator for newer MIPSr6 ISA processors.
In Imgtec I am only one who works on MIPS R6 SW and FPU branch emulation
and I say you - it is not needed, this solution is enough.
It can't be true the PC relative support is not needed, why did you add
the PC relative instructions, if you didn't want to use them in Linux
userspace?

Really what I was talking about was a wider audience, the people that
will write tools and code that target userspace. They will want a
solution without a bunch of restrictions forced upon them by the
limitations of the FPU emulator.
Post by Leonid Yegoshin
Post by David Daney
In light of this, why does it make sense to merge this patch, instead
of taking the approach of emulating the instructions in the delay slot?
Well, because it does exist now. But full MIPS emulator... for all
ASEs... for any MIPS vendor... I even doesn't want to estimate an amount
of time and code size to develop it.
Besides that, you missed my another statement - we don't force customer
to disclose all details of their COP2 instructions.
Here is my proposal:

1) Add an emulator for all documented MIPS R6 instructions that can
appear in a linux userspace delay slot.

2) Document as not supported placing COP2 instructions in FP branch
delay slots.

3) Get rid of this execute-out-of-line code in the FPU emulator all
together.

4) Enable non-execute stack.

In order to have full MIPS R6 support in the kernel, you will need an
emulator for a subset of the instructions anyhow. Going to a full ISA
emulator will be a little more work, but it shouldn't be too hard.


David Daney
Leonid Yegoshin
2014-10-09 23:48:12 UTC
Permalink
Post by David Daney
Post by Leonid Yegoshin
Post by David Daney
Post by Leonid Yegoshin
Note: actual execute-protection depends from HW capability, of course.
This patch is required for MIPS32/64 R2 emulation on MIPS R6 architecture.
Without it 'ssh-keygen' crashes pretty fast on attempt to execute instruction
in stack.
There is much more blocking MIPS32/64 R2 emulation on MIPS R6 than
just this patch isn't there?
This one is critical - ssh-keygen crashes during running MIPS R2. I have
a patch in my R6 repository but GLIBC still can't set stack executable
and security suffers.
But is the R6 code already in the lmo or kernel.org repositories?
If not, then the lack of this patch is not a gating issue. If this
patch is really needed for R6 support, why not submit the R6
prerequisite patches first?
Because -

1) security concern still does exist for MIPS R5 (MIPS R2 has no RI/XI
support, it was defined in MIPS R3 but for simplicity it is referred as
"MIPS R2")
2) GLIBC need that to start development
Post by David Daney
If this patch has nothing to do with MIPS R6, then state that.
It has value for both - MIPS R5 and MIPS R6.
Post by David Daney
Post by Leonid Yegoshin
Post by David Daney
Also, if you are supporting MIPS R6, this patch doesn't even work,
because it doesn't handle PC relative instructions at all.
It seems like you missed my statement - adding support for PC-relative
instruction is just 5 lines of code. I just refrain from this until
toolchain starts generating that.
How can it be just 5 lines of code? You have to emulate all those
ADDIUPC
AUIPC
ALUIPC
LDPC
LWPC
LWUPC
I think that is all of them. You can emulate all of those in 5 lines
of code?
You misread my statement - 5 lines of code for PC-related instruction.
And only ADDIUPC is a part of microMIPS R2 which I can emulate.
But we discuss something insignificant, MIPS R6 load instructions takes
more, of course, but definitely less than LWL/LWR/LDL/LDR which I should
emulate anyway and do.
Post by David Daney
We need to support everything the toolchain could product in the
future. I don't think it makes sense to add all this stuff when it is
well known that it doesn't solve the problem for MIPS R6, especially
when the justification for the patch is that it is needed for R6.
I understand what your goals are here, I have spend many months
working towards a non-executable stack (see the patches that moved the
signal trampolines off the stack). But I am worried that there are
many cases that it will not handle.
Post by Leonid Yegoshin
Besides that, this version 2 of patch just passed 20-22 hours on P5600
and Virtuoso (no FPU on both) under SOAK test and it gets around 1 per
hour of signal right at emulated instruction in VDSO and unwind works
(as I can see in debug prints).
I'm not saying that the patch doesn't work under your highly
constrained test conditions, I believe that it does.
I am not familiar with the SOAK test. Does it really put faulting
instructions the delay slots of FP branch instructions, catch the
resulting signal, and then throw an exception from the signal handler?
Yes, the debug output shows me that. "from the signal handler" -> "to
the signal handler"?
Post by David Daney
Post by Leonid Yegoshin
Post by David Daney
The recent discussions on this subject, including many comments from
Imgtec e-mail addresses, brought to light the need to use an
instruction set emulator for newer MIPSr6 ISA processors.
In Imgtec I am only one who works on MIPS R6 SW and FPU branch emulation
and I say you - it is not needed, this solution is enough.
It can't be true the PC relative support is not needed, why did you
add the PC relative instructions, if you didn't want to use them in
Linux userspace?
Sorry, I misunderstood you here - I assume you told here about FULL
INSTRUCTION SET emulator. Of course, some emulation is needed like PC
relative instructions, but not a full instruction set. I never said that
PC-relative instruction doesn't require an emulation.

But see your point (1) below, if you retract from that HERE, please
confirm the difference - do you want a full instruction set emulator or
you speak about only PC relative instructions?
Post by David Daney
1) Add an emulator for all documented MIPS R6 instructions that can
appear in a linux userspace delay slot.
Post by David Daney
2) Document as not supported placing COP2 instructions in FP branch
delay slots.
Post by David Daney
3) Get rid of this execute-out-of-line code in the FPU emulator all
together.
Post by David Daney
4) Enable non-execute stack.
In order to have full MIPS R6 support in the kernel, you will need an
emulator for a subset of the instructions anyhow. Going to a full ISA
emulator will be a little
Post by David Daney
more work, but it shouldn't be too hard.
It is too restrictive and kills the idea of customised processor.

- Leonid.
Loading...