| guess | measured time |
| aaaa | \(100\pm5\) |
| baaa | \(103\pm4\) |
| caaa | \(102\pm6\) |
| daaa | \(111\pm5\) |
| eaaa | \(99 \pm6\) |
| faaa | \(101\pm7\) |
| gaaa | \(104\pm4\) |
| … | … |
| guess | measured time |
| daaa | \(102\pm5\) |
| dbaa | \(99\pm4\) |
| dcaa | \(104\pm4\) |
| ddaa | \(100\pm6\) |
| deaa | \(102\pm4\) |
| dfaa | \(109\pm7\) |
| dgaa | \(103\pm4\) |
| … | … |
From Ross Anderson, Security Engineering, Third Edition
From Kuhn, ‘‘Electromagnetic Eavesdropping Risks of Flat-Panel Displays’’ (PET 2004)
say we have two 64-bit integers \(x\), \(y\)
one way to multiply:
divide \(x\), \(y\) into 32-bit parts: \(x=x_1\cdot2^{32}+x_0\) and \(y=y_1\cdot2^{32}+y_0\)
then \(xy = x_1y_12^{64}+x_1y_0\cdot2^{32}+x_0y_1\cdot2^{32}+x_0y_0\)
can extend this idea to arbitrarily large numbers
number of smaller multiplies depends on size of numbers!
naive multiplication idea:
problem: sometimes the value of the number is a secret
oops! revealed through timing
early versions of OpenSSL (TLS implementation) had timing attack
attacker could figure out bits of private key from timing
why? variable-time multiplication and modulus operations
0 Figure 3a from Brumley and Boneh, ‘‘Remote Timing Attacks are Practical’’
<script>
var the_color = window.getComputedStyle(
document.querySelector('a[href=~"foo.com"]')
).color
if (the_color == ...) { ... }
</script>
<script>
var the_color = window.getComputedStyle(
document.querySelector('a[href=~"foo.com"]')
).color
if (the_color == ...) { ... }
</script>
turns out, other webpages create distinct ``signatures’’
Figure from Cook et al, ``There’s Always a Bigger Fish: Clarifying Analysis of a Machine-Learning-Assisted Side-Channel Attack’’ (ISCA ’22)
observing machine operation, indirectly
common types of indirect observations
side channels can be security and privacy issues
suppose I time accesses to array of chars:
what could cause this difference?
some psuedocode:
array[128] is slower to access*other_address loaded into cache + evicted itother_address? (select all that apply)| A. same cache tag | B. same cache index | C. same cache offset |
| D. diff. cache tag | E. diff. cache index | F. diff. cache offset |
caches often use physical, not virtual addresses
storing/processing timings evicts things in the cache
processor ‘‘pre-fetching’’ may load things into cache before access is timed
some L3 caches use a simple hash function to select index instead of index bits
pointer is 0x10001880x1000188: … 0000 0001 1000 1000
array[0] starts at multiple of cache size — index 0, offset 00b1 1000 0000] = array[0x180]… 0000 1000 0000 0000 (cache index = 0x20)0x2004000x200400 + X has cache index 0x20?0x200400 |
…0000 0100 0000 0000 |
|
| \(+\) X | \(+\) | …0000 0100 0000 0000 |
0x200400 + X |
…?000 1000 0000 0000 |
|
previous idea: learn bits of mystery that correspond to index bits
array[0x6*BLOCK_SIZE] is slow to accessfrom Osvik, Shamir, and Tromer, ‘‘Cache Attacks and Countermeasures: the Case of AES’’ (2004)
early AES implementation used lookup tables
goal: detect index into lookup table
tricks they did to make this work
from Osvik, Shamir, and Tromer, ‘‘Cache Attacks and Countermeasures: the Case of AES’’ (2004)
if (... /*something false*/) {
access *pointer;
}
mystery despite ifcould cache access for *pointer still happen?
yes, if:
*pointer starts before segfault detectedoperations in virtual memory lookup:
Intel processors: looks like these were separate steps, so…
Prime();
if (something false) {
int value = ReadMemoryMarkedNonReadableInPageTable();
access other_array[value * ...];
}
Probe();
from Lipp et al, ‘‘Meltdown: Reading Kernel Memory from User Space’’
// %rcx = kernel address
// %rbx = array to load from to cause eviction
xor %rax, %rax // rax <- 0
retry:
// rax <- memory[kernel address] (segfaults)
// but check for segfault done out-of-order on Intel at time
movb (%rcx), %al
// rax <- memory[kernel address] * 4096 \[speculated]
shl $0xC, %rax
jz retry // not-taken branch
// access array[memory[kernel address] * 4096]
mov (%rbx, %rax), %rbx
space out accessed by 4096
ensure sepaqrate cache sets and
avoid triggering prefetcher
repeat access if zero
apparently value of zero speculatively read
when real value not yet available
acesss cache to allow measurement in later
in paper with FLUSH+RELOAD instead of PRIME+PROBE
segfault actually happens eventually
option 1: okay, just start a new process each time
option 2: suppress segfault
(paper used (obscure) transactional memory support,
conceptually, could have used mispredicted branch instead)
HW: permissions check done with/before physical address lookup
SW: separate page tables for kernel and user space
mystery despite ifprevious idea: learn bits of mystery that correspond to index bits
array[0x6*BLOCK_SIZE] is slow to accessarray[0x100*BLOCK_SIZE] is slow to accesssuppose this C code is run with extra privileges
assume x chosen by attacker
(example from original Spectre paper)
0x1000000 and0x103F0003;array1[x] access first byte of secret?array2[0] is char stored in cache set 0
array1[x]?array2[array1[x] * 4096] accesses set with index =
(array1[x] * 4096 / BLOCK_SIZE mod 8192)
know that set number is 128 from probing
array1[x] * 64 = 128 (mod 8192)
\(\rightarrow\) array1[x] * 64 = 128 + 8192 * K
array1[x] = 2 + 128 * K
/* exploiting pseudocode */
/* step 1: mistrain branch predictor */
for (a lot) {
systemCallHandler(0 /* less than array1_size */);
}
/* step 2: evict from cache using misprediction */
Prime();
systemCallHandler((targetAddress - array1Address) / A1ElemSize);
int evictedSet = ProbeAndFindEviction();
int targetValue = (evictedSet - array2StartSet) / setsPer4KA2Elem;times 4096 shifts so we can get lower bits of target value
void SomeSystemCallHandler(int index) {
if (index > some_table_size)
return ERROR;
int kind = table[index];
switch (other_table[kind].foo) {
...
}
}
if (x < array1_size) {
y = array2[array1[x]];
}
limited in what address we can learn about based on how big entries in tables are
need to adjust calculations to actual addresses / array element sizes / etc.
jmp *%rax predict target with cache: | bottom 12 bits of jmp address | last seen target |
| 0x0–0x7 | 0x200000 |
| 0x8–0xF | 0x440004 |
| 0x10-0x18 | 0x4CD894 |
| 0x18-0x20 | 0x510194 |
| 0x20-0x28 | 0x4FF194 |
| … | … |
| 0xFF8–0xFFF | 0x3F8403 |
Intel Haswell CPU did something similar to this
1: find some kernel function with jmp *%rax
2: mistrain branch target predictor for it to jump to chosen code
3: have chosen code be attack code (e.g. array access)
4: run the kernel function
showed Spectre variant 1 (array bounds), 2 (indirect jump)
other possible variations:
could cause other things to be mispredicted
could use side-channel other than data cache changes
array[x] with array[x & ComputeMask(x, size)]0xFFFF..F if x \(\le\) sizefor indirect branches:
with hardware help:
without hardware help:
jmp *(%rax), etc. into code that jmp *%raxchar *array, *other_array;
// PRIME
posix_memalign(&array, CACHE_SIZE, CACHE_SIZE);
AccessAllOf(array);
// (some code we don't control)
other_array[mystery * N] += 1; // previously: * BLOCK_SIZE
// PROBE
for (int i = 0; i < CACHE_SIZE; i += BLOCK_SIZE) {
if (CheckIfSlowToAccess(&array[i])) {
...
}
}0x4000000 (index 0, offset 0)mystery if N = 1? N = 32 * 64?\[\begin{align*}
\left\lfloor\text{mystery} * N / \text{BLOCK_SIZE}\right\rfloor~\text{mod}~1024 & = & 32 \\
\left\lfloor\text{mystery} * N / \text{BLOCK_SIZE}\right\rfloor & = & 32 + 1024K \\
\end{align*}\]
let offset be some number in [0,BLOCK_SIZE):
\[\begin{align*}
\text{mystery} * N & = & \text{BLOCK_SIZE}\times(32+1024Z) + \text{offset}\\
\text{mystery} & = & \text{BLOCK_SIZE}\times(32+1024Z) + N\times\text{offset} \\
\text{mystery} & = & 64\times(32+1024Z)+N\times\text{offset} \\
\end{align*}\]
N=1: mystery = \(2048\), \(2049\), \(2050\), …, \(2048+63\), \(64\cdot1024+2048\), \(64\cdot1024+2048+1\), …
32*64)\[\begin{align*} \text{mystery}\cdot 32\cdot 64 & = & 64(32+1024Z) + \text{offset} \\ & = & 64\cdot32 + 65536Z + \text{offset}\\ \text{mystery} & = & 1 + \frac{65536}{64\cdot32}Z + \frac{\text{offset}}{64\cdot32} = 1+32Z \\ \end{align*}\]
0x4000000mystery?NUM_SETS = 64KB/64B = 1K (1024) sets
array[0x800] has cache index 0x800/BLOCK_SIZE mod NUM_SETS
know other_array[mystery * BLOCK_SIZE] had same index
other_array[0] at cache index 0
recall have found:
other_array[0] at index 0;other_array[mystery*BLOCK_SIZE] has index 32 (same as array[0x800])other_array[X] at cache index (0 + X/BLOCK_SIZE mod NUM_SETS)
other_array starts at 0x4001440
then other_array[0] at cache index
(0x51 + mystery * BLOCK_SIZE / BLOCK_SIZE) mod NUM_SETS = 32
mystery = -49 or 975 or 1099 or …
char *array;
// PRIME
posix_memalign(&array, CACHE_SIZE, CACHE_SIZE);
AccessAllOf(array);
// (some code we don't control)
other_array[mystery * BLOCK_SIZE] += 1;
// PROBE
for (int i = 0; i < CACHE_SIZE; i += BLOCK_SIZE) {
if (CheckIfSlowToAccess(&array[i])) { ... }
}
char *array;
// PRIME
posix_memalign(&array, CACHE_SIZE, CACHE_SIZE);
AccessAllOf(array);
// (some code we don't control)
other_array[mystery * BLOCK_SIZE] += 1;
// PROBE
for (int i = 0; i < CACHE_SIZE; i += BLOCK_SIZE) {
if (CheckIfSlowToAccess(&array[i])) { ... }
}
array[0x8800] is slowif 4-way 64KB cache w/64B blocks and something from cache set 32 evicted,
then where could slow access be?
A. i=0x400, i=0x800, i=0x8400, i=0x8800
B. i=0x800, i=0x8800, i=0x10800, i=0x18800
C. i=0x800, i=0x4800, i=0x8800, i=0xc800
D. i=0x800, i=0x4800, i=0x8800, i=0x10800
E. something else
unsigned char *probe_array;
posix_memalign(&probe_array, CACHE_SIZE, CACHE_SIZE);
access OTHER things to evict all of probe_array
if (something false) {
read probe_array\[mystery * BLOCK_SIZE];
}
check which value from probe_array is faster
uint8_t* probe_array = new uint8_t[256 * 4096];
// ... Make sure probe_array is not cached
uint8_t kernel_memory_val = *(uint8_t*)(kernel_address);
uint64_t final_kernel_memory = kernel_memory_val * 4096;
uint8_t dummy = probe_array[final_kernel_memory];
// ... catch page fault
// ... in signal handler, determine which of 256 slots in probe_array is cached