Skip to content

Generating executable files from scratch

Cameron Swinoga edited this page Jun 9, 2017 · 18 revisions

Introduction

The first step in getting the operating system to execute arbitrary assembly is to figure out what types of executable files your operating system supports. Since I would be developing my compiler on Linux, the suitable Executable and Linkable Format (ELF) was chosen. There is a veritable wealth of information on the composition of ELF files, however since the ELF standard is very large and overarching it is not a simple matter to be able to pick and choose what is needed to get a bare minimum example working. As such, I am writing this as a compendium of all the research and piecing together that I did to be able to write YABFC.

Starting at the top

Looking through the documentation for the system standard header elf.h, there is a few given structures that we can use to set up the executable file. For certain reasons I will be using a 64 bit version of an ELF executable rather than a 32 bit version. The first few lines for setup are pretty straight forward and rigorously defined:

Elf64_Ehdr ELFHeader; // Initialize the ELF header

ELFHeader.e_ident[EI_MAG0]       = 0x7f; // Magic numbers
ELFHeader.e_ident[EI_MAG1]       = 'E';
ELFHeader.e_ident[EI_MAG2]       = 'L';
ELFHeader.e_ident[EI_MAG3]       = 'F';
ELFHeader.e_ident[EI_CLASS]      = ELFCLASS64;    // 64 bit ELF
ELFHeader.e_ident[EI_DATA]       = ELFDATA2LSB;   // little-endian
ELFHeader.e_ident[EI_VERSION]    = EV_CURRENT;    // Current version
ELFHeader.e_ident[EI_OSABI]      = ELFOSABI_SYSV; // UNIX System V ABI
ELFHeader.e_ident[EI_ABIVERSION] = 0x0;           // ABI version needs to be 0

for (int i = EI_PAD; i < EI_NIDENT; i++) ELFHeader.e_ident[i] = 0x0; // Zero padding

ELFHeader.e_type    = ET_EXEC;            // Executable file
ELFHeader.e_machine = EM_X86_64;          // AMD x86-64
ELFHeader.e_version = EV_CURRENT;         // Current version

After this, things start to get a little more complicated. We need to configure the entry point of the program, program & section header table offsets as well as header sizes. The ELF specification does not define where all the different sections are to be placed in the file as long as the memory offsets correspond to a section of memory with the correct data. An important distinction here is the difference between memory on file and program runtime memory, hereby referred to as memory location (_MEM_LOC) and file location (_FILE_LOC).

File location is the physical address offsets (offsets because the operating system abstracts the ACTUAL physical address) of the file that you are creating. This is telling the operating system where to look in your file in order to read the correct data. This data is then put into the program runtime memory (virtual address space) where it can be dynamically read by the program. Saying that, we need to start by picking a virtual address from where to base the program.

Arbitrary numbers and where to find them

I initially was planning to make a 32 bit ELF, so initial poking around on various threads lead me to the magical "somewhere above 0x8048000" number which is ~128 MiB. In 64 bit land, the 0x4000000 address seemed to be used so this is now the origin memory address used going forward. A thread that explains some of the magic that these numbers represent is here.

The structure of our ELF file will be as follows:

#define ORG (0x4000000) // Origin memory address

#define PGM_HEADER_TBL_LOC (sizeof(Elf64_Ehdr)) // Program header location
#define PGM_HEADER_SIZE (sizeof(Elf64_Phdr))    // Program header size
#define SEC_HEADER_SIZE (sizeof(Elf64_Shdr))    // Section header size
#define PGM_HEADER_NUM (2)                      // Number of program headers
#define SEC_HEADER_NUM (4)                      // Number of section headers

#define TEXT_FILE_LOC (PGM_HEADER_TBL_LOC + (PGM_HEADER_NUM * PGM_HEADER_SIZE)) // .text file location
#define TEXT_MEM_LOC (ORG + TEXT_FILE_LOC)                                      // .text memory location

#define ENTRY_POINT TEXT_MEM_LOC // Executable entry point

File layout:

Section Notes
ELF Header How the file is laid out
Program Header Table: Program Header 1 For the .text section
Program Header Table: Program Header 2 For the .data section
.text section x86 Assembly code
.data section General data storage
.shrtrab section String table
Section Header table: Section Header 1 Mandatory null section
Section Header table: Section Header 2 For the .text section
Section Header table: Section Header 3 For the .data section
Section Header table: Section Header 4 For the .shrtrab section

We can now start to plug some values into the ELF header on where data is located.

ELFHeader.e_entry   = ENTRY_POINT;        // Entry point of program
ELFHeader.e_phoff   = PGM_HEADER_TBL_LOC; // Program header table offset
ELFHeader.e_shoff   = 0x0;                // Section header table offset
ELFHeader.e_flags   = 0x0;                // Processor specific flags
ELFHeader.e_ehsize  = sizeof(Elf64_Ehdr); // ELF Header size

As the section header table is at the end of the file, and we don't actually have any data in the file yet, we hold off setting the section header table offset for now.

Section Sizing

Next up is to tell the ELF header how many sections we have and how big they are.

ELFHeader.e_phentsize = PGM_HEADER_SIZE; // Size of each program header
ELFHeader.e_phnum = PGM_HEADER_NUM;      // Number of entries in program header table
ELFHeader.e_shentsize = SEC_HEADER_SIZE; // Section header size, in bytes
ELFHeader.e_shnum = SEC_HEADER_NUM;      // Number of entries in section header

ELFHeader.e_shstrndx = SHN_UNDEF; // Section header table index of the entry associated with the section name string table, to be set later

Program Header Table

Let's assume that we already have an array of bytes that represent out assembly code (I will talk about some gotchas with this later). We now need a program header to represent this data. As we have two segments of data (.text and .data), we need two program headers. This is fortunately relatively easy to set up.

Elf64_Phdr programHeaderText;
programHeaderText.p_type   = PT_LOAD;
programHeaderText.p_flags  = PF_R + PF_X;      // Segment permissions
programHeaderText.p_offset = TEXT_FILE_LOC;    // File offset for the contents of the segment
programHeaderText.p_vaddr  = TEXT_MEM_LOC;     // Virtual address where the segment will be loaded
programHeaderText.p_paddr  = TEXT_MEM_LOC;     // Same as p_vaddr for "reasons"
programHeaderText.p_filesz = lengthoftext;     // Length of segment in bytes
programHeaderText.p_memsz  = lengthoftext;
programHeaderText.p_align  = 0x0;              // No alignment

Elf64_Phdr programHeaderData;
programHeaderText.p_type   = PT_LOAD;
programHeaderText.p_flags  = PF_R + PF_W + PF_X;           // Segment permissions
programHeaderText.p_offset = TEXT_FILE_LOC + lengthoftext; // File offset for the contents of the segment
programHeaderText.p_vaddr  = TEXT_MEM_LOC + lengthoftext;  // Virtual address where the segment will be loaded
programHeaderText.p_paddr  = TEXT_MEM_LOC + lengthoftext;  // Same as p_vaddr for "reasons"
programHeaderText.p_filesz = lengthofdata;                 // Length of segment in bytes
programHeaderText.p_memsz  = lengthofdata;
programHeaderText.p_align  = 0x0;                          // No alignment

Notice how we define the .data section as a relative offset directly after the .text section, as well as the permissions on each segment. You cannot write to the .text section but you can read, write, as well as execute data from the .data section.

Section Header Table

The section headers are much the same thing as the program headers, just filling in the 'where' and 'how much' of the data in our file. One (semi) important piece of information is the sh_name value. This is the index of the string table of where to read the actual name of the section. For example, if the string data is composed of

"\0.text\0.data\0.shrtrab\0"

The sh_name values will be as follows:

Section sh_name
Null 0
.text 1
.data 7
.shrtrab 13

Since there is four section headers, I will only highlight the differences.

Same for all:

sectionHeaderNull.sh_link      = 0; // Currently unused
sectionHeaderNull.sh_info      = 0;
sectionHeaderNull.sh_addralign = 0;
sectionHeaderNull.sh_entsize   = 0;

Null Header

Elf64_Shdr sectionHeaderNull;
sectionHeaderNull.sh_name      = 0;
sectionHeaderNull.sh_type      = SHT_NULL;
sectionHeaderNull.sh_flags     = 0;
sectionHeaderNull.sh_addr      = 0;
sectionHeaderNull.sh_offset    = 0;
sectionHeaderNull.sh_size      = 0;

.text Header

Elf64_Shdr sectionHeaderText;
sectionHeaderText.sh_name      = 1;
sectionHeaderText.sh_type      = SHT_PROGBITS;              // Type of the segment
sectionHeaderText.sh_flags     = SHF_ALLOC + SHF_EXECINSTR; // Section permissions
sectionHeaderText.sh_addr      = TEXT_MEM_LOC;              // Section memory location
sectionHeaderText.sh_offset    = TEXT_FILE_LOC;             // Section file location
sectionHeaderText.sh_size      = lengthoftext;              // Segment size

.data Header

Elf64_Shdr sectionHeaderData;
sectionHeaderData.sh_name      = 7;
sectionHeaderData.sh_type      = SHT_PROGBITS;
sectionHeaderData.sh_flags     = SHF_ALLOC + SHF_WRITE;
sectionHeaderData.sh_addr      = TEXT_MEM_LOC + lengthoftext;
sectionHeaderData.sh_offset    = TEXT_FILE_LOC + lengthoftext;
sectionHeaderData.sh_size      = lengthofdata;

.shrtrab Header

Elf64_Shdr sectionHeaderShrtrab;
sectionHeaderShrtrab.sh_name      = 13;
sectionHeaderShrtrab.sh_type      = SHT_STRTAB;
sectionHeaderShrtrab.sh_flags     = 0;
sectionHeaderShrtrab.sh_addr      = 0;
sectionHeaderShrtrab.sh_offset    = TEXT_FILE_LOC + lengthoftext;
sectionHeaderShrtrab.sh_size      = lengthofstringtable;

List of resources

Clone this wiki locally