System Developers Glossary (also for dummies)

Do you know what’s a table, a bitset, the stack, the high and lows. Yes, no, not quite sure … well in any way this post might come in handy. This glossary is aimed at Intel x86 and x86_64 and any consequent solutions built on PC architecture.

Preface

The internet is full of information of what is what in computer sciences. It’s easy to find the answers you need, if it’s popular enough (think web development). It’s hard to find these small missing bits that only a few enthusiasts or professionals in large companies are talking about. It’s even harder to find missing links between one thing and another. And it’s especially hard to find anything if you’ve never learned computer sciences – thus you don’t know where to start or where to go next. Most of the tutorials are written for a specific task, but there are really few that connect these tasks together. This is MY story so far, but whilst I’m on my journey (of self taught system programming), I’m marking down important stuff and publishing it here, in hopes that some day I’ll write a whole step-by-step “Build Your Own OS” tutorial.

Simple data types

bit

The smallest unit of data in modern electronics (and programming). It’s just 1 or 0.

You can think about it as: 1 0
a switch On Off
a lie detector (logical expression) True False
a simple answer Yes No
a programmer Set Clear
a hardware engineer, it’s just that current does flow doesn’t flow

byte

The smallest available chunk of data. And by available I mean – you can not access a single bit directly in memory.

Byte is an array of 8 bits, and it can hold numeric values from -128 up to 127 or from 0 to 255 if you don’t need the sign and use it’s unsigned brother, which is just a difference in human readable representation of byte, for CPU it’s still just an array of 8 bits. Sign is determined by the 8th bit (1 – negative, 0 – positive). If you’re confused about bit numbering, please, read the “Data structures” section further down.

In popular programming languages it’s also known as char, int8 and it’s unsigned counterparts are called unsigned char, uchar, uint8, etc. This little unit of data has a lot of history and because of it’s history it’s still here. Also it’s used in computer systems to encode ASCII characters. More details on Wikipedia.

word

Also known as short integer or short for short (pun!?), int16, unsigned short, ushort, unsigned int16 and uint16. It’s a chunk of data  that consists of 16 bits (or you can say 2 bytes glued together). It can hold values from -32768 up to +32767 or 0 to 65535.

Actually word used to mean largest chunk of bits, the CPU can swallow in a single instruction, but since the department of marketing has completely lost it – it’s stuck at 16 bits even in 64 bit world. Word is the largest memory addressing size available in Real Mode. More details on Wikipedia.

dword

There we have it, we’re calling things double to make them twice as big. This chunk of data, also known as integer or int for short, long, int32, unsigned int, unsigned long, ulonguint32, etc. is the 32 bit integer (4 bytes). Range? -2147483648 to +2147483647 or 0 to 4294967295. This is the default addressing size in Protected Mode.

qword

Can you spot the pattern? – this one is quad word – 64bits (8 bytes), a.k.a. long long, int64, unsigned long long, uint64, etc. Range: -9223372036854775808 to +9223372036854775807 or 0 to 18446744073709551615. Astronomers are starting to pay attention at this point. :) This is the default addressing size in Long Mode. I think in the next 20 years we’ll see something like oword (for octa word) and hdword for (hexade word). :)

Now to the funky borhters.

float

this one is called a floating point number. It’s an array of 32 bits (4 bytes) and it’s funky one in many ways:

  • it can go in a range from -3.4e+38 up to +3.4e+38 (that is 3.4 multiplied by 1000…(38 zeros)). Astronomers are getting excited.
  • that’s not all, it can be as small as 1.75e-38 (that is 1.75 divided by 1000…(38 zeros)). Now quantum physicists get excited as well.
  • Also it can store some special case numbers like infinity and “not-a-number” (NaN).

Float actually consists of 3 numbers:

  1. the fraction (23 bits) – the number you multiply by 1 and those long strings of zeros
  2. the exponent (8 bits) – the number that creates those many zeroes (indirectly)
  3. sign – a single bit at the end, that says whether it’s a positive or a negative value

It’s hell of a number, and it actually takes hell of a computing cycles to process, but it’s worth it. For one thing – it lets you divide stuff without dropping the decimal part, i.e. 2 / 3 = 0.6666…, not 0 and 3 / 2 = 1.5 not 1. But as the point is called “floating” it looses it’s precision when the values is increasing. For example taking a number that’s big as 1 000 (9 zeros or more) 000 and adding 0.000 (some more zeros) 1 will still be the first number, as the difference in exponent is way too big to sum these two values.

double and long double

These are the big brothers of float. Double consists of 64 bits (52 bit fraction and 11 bit exponent + the sign bit) and long double is 80 bits long. Now astronomers and quantum physicists are dancing together, cheering, naked and intoxicated. Double precision number, for example, has an exponent of 308 zeros. And 80 bit one (63 bit fraction and 15 bit exponent + the sign bit) … do the math.

Data structures

Array

I don’t want to dig deep into the theory, but as the definition says it’s a systematic arrangement of objects. In computer science terms it’s an arrangement of items at equally spaced addresses in computer memory. So if you want to use “an array of bytes”, that means that it will hold some amount of bytes just one after another. If you want to mix bytes and some larger data elements then the equal spacing will be aligned to the largest of elements and it will actually be an array of these largest elements, but you’ll access the lowest parts of these large elements (we’ll get to that high and low talk soon).

Now, if you’ve already done some programming yourself (and I think you have, because no ordinary person would have read this far), then you know that arrays, and basically everything you can split up “equally spaced”, are so called “zero based“. That just means that the 1st element of an array is accessed by number 0, not number 1. It’s because everywhere you pass an array around you’re actually passing the memory address of it’s first element, but the index number is just an addition to the first element’s address multiplied by the size of element, thus the first element is located at [address of first element + (0 * element size)], second one is located at [address of first element + (1 * element size)] and in general it’s [address of first element + (element number * element size)].

Structures

This one is important in C – as it’s the magic variable group, that allows you to build up a chunk of memory representing more advanced data structure than array (as it’s not mandatory to equally space it’s elements). I don’t know how much can I say about it, but it’s da-shit you should definitely love and cherish.

For example, in x86 architecture there are a lot of structures in memory – and these structures are there, just floating in this big soup of bytes. You could just access a single byte or a word at a desired memory address, but you can build a struct template in your source code and point a variable (a pointer) of this struct type to the starting address of this structure in memory – Voila! All the variables are now at your fingertips – all that you need is to find the specification of the structure somewhere on the internet or in the official developer’s manuals.

As a practical example I can give you something from ACPI: the RSDP – the main pointer of ACPI pointers. In C it looks like this:

struct RSDP_struct {
	// Version 1.0
	char signature[8];
	uint8 checksum;
	char oem_id[6];
	uint8 revision;
	uint32 RSDT_address;
	// Version 2.0
	uint32 length;
	uint64 XSDT_address;
	uint8 extended_checksum;
	uint8 reserved[3];
}

Now all you have to do is to define a variable as a pointer to this structure like this:

RSDP_t *rsdp = (RSDP_t *)0x80000; // We start at the beginning of EBDA

and scan through the memory (that is assigning an address to the pointer incremented by some value on every iteration) to find a valid signature in the signature field (the char signature[8]).

do {
	// test if we've found it and exit this loop if we have
	rsdp = (RSDP_t *)((uint64)rsdp + 0x10);
} while ((uint64)rsdp < 0x100000); // Up until 1MB mark

Once you’ve found it – Jackpot! You’ve full access to ACPI data structures, as this parent pointer has pointers to other structure pointers and so on.

Bitset and usage

It’s an array of bits. Word, for example, is an array of 16 bits. Bitsets are a great choice if you wan’t to optimize the memory usage when you have a lot of yes or no variables. As I said previously – byte is the smallest unit of memory you can access directly. If you chose a single byte to store yes or no (1 or 0) you’d waste 8 times more space as you’d use with bitsets. Think of a byte as a bitset of 8 bit variables.

A real world examples can be found throughout Intel Architectures Software Developer’s Manual. CPUID register for example is a bitset, where each bit let’s you determine whether a specific feature is available or not.

When using bitsets you’ll also need to understand bitwise operations and boolean logic. Here are some practical (although not so logical in human terms) examples:

AND – is an operator used to test two values and return third one which consists of all the bits (set) in BOTH values. In reality it’s used to filter out unwanted bits or test bits whether they are set or cleared.

  • Let’s say we have a byte (8 bit bitset) 00010011 and we want to test whether the 2nd bit is set – we’ll do an AND operation with 00000010 and if the result is equal to 00000010, then the 2nd bit was set.
  • Filtering out is similar: value 1 = 10101010 and value 2 = 11110000, performing AND on these you’ll get 10100000 – first 4 bits are cleared

Now I know I touched the bit order here, but bare with me, the explanations will come soon.

OR – is an operator used to set bits. Simple as that. Imagine that you have byte of decimal value 123, that is 01111011 in binary. Now if you’ll do an OR operation with 0000100, then resulting value will be 01111111 and it’s decimal value 127.

Ok, if we’re still in bitwise boolean world I’ll just drop in some more info:

NOT – think about it as an annoying kid, who’s always doing things opposite of what you’ve said them to do. If you perform a NOT operation with a set bit (1) on another bit of any value, it will swap the value around. So, if you have a value of 11110000 and you perform a NOT on that with 11111111, the result is complete invert value 00001111. Tadaa!

XOR – I don’t know what to say about this, because the only way I’ve ever used it is for clearing a value (setting it to 0) as the Intel Architectures Optimization Manual says it’s faster to do XOR on a register than moving a value 0 into it. So if you XOR one value with the same value, you’ll get 0. In depth it tests whether the bit is set only in one of two values, if thy do mach it will return 0, if not then 1.

Bit and byte order (high and low, bottom and top)

Now this is the part I’ve been burning with impatience to talk about, as in the x84 architecture, in combination with our left-to-right writing/reading, it’s a little bit confusing. First things first:

  • bottom – the first one as the memory is said to grow upwards;
  • low – the one closer to beginning or lower value (bottom);
  • high – the one closer to end or higher value  (top);
  • top- the last one;

Second. As far as the computer architecture goes, everything can be and is accessed using a base address – the location of the start in the memory (a.k.a. bottom address) and then accessed relatively to this base address. I’ve already touched this in the “Array” section – this thing called “zero based“, that means that the first element of an array is actually not accessed by number 1, but by 0. Remember? The computer knows the starting address of an array and that is exactly where the first element of it resides, thus it’s address is [base address + 0].

OK, now let’s get to the point. x86 architecture is little-endian, that means that a data unit (and I presume even bits no just bytes) are stored in memory from left to right in the ascending order of significance (a lot of swearing words, I know, bare with me). Least significant byte (or LSB) vs. most significant byte (or MSB) are like a cent vs dollar. You’re happier with larger number of dollars in you pocket, thus their significance is more important, than a few penny’s. So the 32 bit integer (which consists of 4 bytes) stored in memory is actually stored backwards, starting from the bottom up (the same way as the memory goes). To imagine that visually the hexadecimal representation of numbers comes in handy. A single byte can always be represented by 2 digit hex number, otherwise it’ll be 1 to 3 decimal digits – that’s why programmers use them – it’s actually easier to read them. A single hex digit always represents 4 bits. So for example a decimal unsigned byte of value 255 can also be written as 0xFF. Now let’s take this 32bit integer 0×12345678 (305419896 decimal) and store it into memory. The result will be (bottom-to-top): 0×78, 0×56, 0×34, 0×12. That’s little-endian kids.

To be more informative I’ll just add these examples:

  1. a single byte (dec: 196, hex: 0xC4, bin: 11000100). If you look at the binary representation (that’s how it’s displayed on the screen), you see that LSB is actually on the right side, you can also see that binary value 0100 is equal 0×4 and 1100 is equal 0xC.
  2. a word (dec: 1610, hex: 0x064A, bin: 0000011001001010). Again you can see how the hex and bin parts are ordered in the way we’re used to read them, but in your computer’s memory they’re stored in reverse order: 0x4A and 0×06.

A little philosophical note – as you might know our left-to-right languages have adopted one thing from right-to-left languages – our numbering system. And this is just my theory, but that might explain confusion. We’re writing our numbers in a big-endian order, but words in little-endian order (I think every word ends with a punchline ;) ). I mean – we write and read most significant digits in the number first.

Now with x86 manuals and tutorials on the net, it’s completely opposite, except for when we view memory using some editors, or access separate bytes of larger data structures they are displayed in big-endian order (the way we’re used to read them). While we’re used to reading everything starting from top-left corner down to bottom-right, in every software developers manuals you’ll see bitset or other data structure memory layouts written in a completely diagonally opposite direction (LSB or bottom bits/bytes being in the bottom-right corner).

x86 and CPU Specifics

Tables

One more confusion, that needs be clarified. Tables are arrays (or structures), period! This actually was a big confusion for me – as it’s both arrays and structures. For example page tables are arrays – defined in size an consist of exactly 512 64 bit bitsets (actually pointers to other tables with their lowest bits used as bitsets), ACPI tables on the other hand are structures.

Registers

Confusion does not stop here. Registers – once you start learning about assembly, the first thing that you learn is that a CPU has registers, they have their names, they are variables on the CPU die. They are the fastest storage locations in the whole system as the instructions are performed directly on them (RAM is slow compared to registers and HDD … well it’s like a pigeon post vs e-mail).

Real Mode CPU registers

16 bit 8 bit high 8 bit low Usage
AX AH AL Stores the results of arithmetic opperations (Accumulator)
BX BH BL Used as a memory access register (Base address location)
CX CH CL Can be used as a counter with LOOP instruction (Counter)
DX DH DL BIOS functions that access IO use these (Data register)
SP - - Stack pointer
BP - - Used in memory access to do arithmetical access (arrays)
SI - - Source index in array operations
DI - - Destination index in array operations
CS - - Code segment register (I won’t go into segmentation)
DS - - Data segment register
SS - - Stack segment register
ES - - Extra segment register (use it as you want)

The AH, AL, BH, BL, etc. actually are the high and low parts of their 16 bit counterparts so setting AL with some value will change AX as well. First four registers are also known as general purpose, but they still have their special purposes.

Protected Mode CPU registers

In Protected Mode you have all of the registers as in Real Mode, but the 32 bit versions are prefixed with a letter E (for extended, as they are extended to 32 bits)

32 bit 16 BIT 8 BIT HIGH 8 BIT LOW USAGE
EAX AX AH AL Stores the results of arithmetic opperations (Accumulator)
EBX BX BH BL Used as a memory access register (Base address location)
ECX CX CH CL Can be used as a counter with LOOP instruction (Counter)
EDX DX DH DL BIOS functions that access IO use these (Data register)
ESP SP - - Stack pointer
EBP BP - - Used in memory access to do arithmetical access (arrays)
ESI SI - - Source index in array operations
EDI DI - - Destination index in array operations
- CS - - Code segment register (I won’t go into segmentation)
- DS - - Data segment register
- SS - - Stack segment register
- ES - - Extra segment register (use it as you want)
- FS - - Same as ES
- GS - - Same as ES

Now up to ?X registers everything is the same as in Real Mode, but as you noticed E?X registers do not have a high 16 bit register available – there are only low 16 bit registers that can be split up in high and low 8 bit registers. But the principle remains the same – once you write something into AH, for example, it also changes high 8 bits of low 16 bits of EAX register.

Long Mode CPU registers

x86_64 or AMD64 or Intel 64 introduced a bunch of new registers:

64 bit 32 BIT 16 BIT 8 BIT HIGH 8 BIT LOW USAGE
RAX EAX AX AH AL Stores the results of arithmetic opperations (Accumulator)
RBX EBX BX BH BL Used as a memory access register (Base address location)
RCX ECX CX CH CL Can be used as a counter with LOOP instruction (Counter)
RDX EDX DX DH DL BIOS functions that access IO use these (Data register)
RSP ESP SP - - Stack pointer
RBP EBP BP - - Used in memory access to do arithmetical access (arrays)
RSI ESI SI - SIL Source index in array operations
RDI EDI DI - DIL Destination index in array operations
R8 R8D R8W - R8B
R9 R9D R9W - R9B
R10 R10D R10W - R10B
R11 R11D R11W - R11B
R12 R12D R12W - R12B
R13 R13D R13W - R13B
R14 R14D R14W - R14B
R15 R15D R15W - R15B
- CS - - Code segment register (I won’t go into segmentation)
- DS - - Data segment register
- SS - - Stack segment register
- ES - - Extra segment register (use it as you want)
- FS - - Same as ES
- GS - - Same as ES

That’s it right? …

… no. Everything turns upside down, once you start learning about other hardware and I/O. You see hardware engineers run out of words to name stuff, so they just thought, if a CPU has registers, why can’t we call our specially assigned memory locations registers too? So that’s it – if you hear about a PCI register, then it’s just a location in memory, which actually isn’t there – it’s mapped there. And mapped means that everything you write in that location of memory won’t land there, but will be redirected to some device that’s listening on that specific address.

Why am I telling you this – well, for starters, when I started this system programming business I thought that CPU has direct access to any hardware component, but it soon turned out that it isn’t – the only thing it’s communicating with is RAM. In Real Mode you have BIOS which you can call using software interrupts to get something out from hardware, but there’s this 1MB limit and segments which lock you in 64KB chunks of memory, etc. And once you exit Real Mode into Protected Mode or Long Mode you are on your own. And then the fun part starts – you have to look up ACPI tables, try to enumerate PCI bus (through registers mapped in memory). Scary and fun at the same time – as working with numbers is fun and challenging (there is almost no human readable response – it’s just – number goes in and a number comes out).

Stack

This one is an array, but a funky one at that. As normal arrays grow upward (as is all the memory), this one grows downward on every PUSH and back up on POP. The confusing part is that the top of the stack is at the bottom of it’s memory (yep, it’s the complete reverse of what I just told you about highs and lows except that bits and bytes are still little-endian ordered). But as with any data structure it has it’s goodies. Stack is controlled by a single register – the stack pointer  (a.k.a. SP, ESP, RSP) that tells CPU where the last value passed by PUSH and POP reside, but as with (almost) all the CPU registers it’s possible to modify it’s value directly – you can re-position your stack anywhere in memory, you can also pop or push it on your own. You can completely avoid pushing and popping (like GCC does it) and just decrement the value by how much you need and fill the gap yourself using simple MOV instructions.

Remember though – there are some instructions that manipulate stack on their own, so be careful. For example, CALL instruction pushes current instruction pointer (the address of currently processed instruction) on the stack automatically, before jumping to some other subroutine (function). RET on the other hand assumes that the current value on top of the stack is the address of instruction where you should be redirected back (that is the address of the instruction that got you into current function at the first place and was pushed by CALL).

Conclusion

I know I’m not writing this post in a scientifically correct manner, but it’s intended to represent my understanding of how stuff works. It’s good to have a different opinion or differently structured sentences on the same subject as it might make things a lot clearer to somebody seeking for answers.

The system programming basically is just a number mangling in memory, but with these special numbers you can get wonderful results, just look around you – Windows 7, Linux, Mac OS X  – three marvelous operating systems that are a work of brilliant engineers. They all start with numbers and end up in beautiful GUIs.

So, as always, if you spot an error, have an opinion of your own, please share it with the rest of the readers in the comments section.

Thank you! Peace out! And till the next post.