Fucking nerd.
Fucking nerd.
Fucking nerd.
What on earth is source code anyway
Source code is the (nerd) human readable form of software. To run it you put it through a thing called a compiler that translates it into the ones and zeros the computer processor can actually process, and those raw ones and zeros are how proprietary software is generally distributed.
Having the source code means it's easy (for nerds) to change the software or to run it on other kinds of processors.
Hm ok. So one writes source code in a coding language, it gets turned into 1s and 0s. Why can't you go back? Source code gets compiled into a specific order of 1s and 0s, but the same set of 1s and 0s could be made from different types of source code?
it's pretty hard to un-bake a cake
It's like trying to figure out the exact tools used to build a house by looking at the finished house. You can figure out some tools (a hammer, a paintbrush, etc) but it's hard to know exactly. Programs are so interdependent on the components that make them up, guessing isn't a good solution.
Like others said, you sort of can. But I also want to add that things like functions names, or comments explaining how a function works, are not needed by your computer when running the program, and thus they get lost after compiling. After running a program designed to reverse engineer a compiled program, you'll be able to see a very dumbed down version; no meaningful function or variable names nor comments explaining the code. You have to figure those out all by yourself.
And add to that that some companies/programmers make some parts of the program difficult to read on purpose, so you have more guesswork to do when reverse engineering, and you've got a giant task ahead of you reverse engineering even small games.
On a side note, the original source code can also just be interesting or funny to read. Valve's source code comments come to mind.
I found a YouTube link in your comment. Here are links to the same video on alternative frontends that protect your privacy:
Why can't you go back
You sort of can, there are de-compilers like Ghidra that can help with this, but it usually takes a lot of manual effort to properly decode.
the same set of 1s and 0s could be made from different types of source code
Yeah, basically. Companies will also take extra steps to make it so people can't get source code from software, since it's their proprietary IP and whatever.
You can go back but it's very difficult. Only the biggest nerds can do it, with great dedication and time. That process is called reverse engineering.
For a very simple example, suppose I wrote some code to add how many apples Jack and Jill have together. The source code might look like
jackApples = 3
jillApples = 4
numApples = jackApples + jillApples
But the computer doesn't care about Jack, or Jill, or apples for that matter. It only cares about numbers. So when the compiler puts it into ones and zeros all those useful names get dropped. And when I decompile the binary (what we call those ones and zeros) what I get back might look more like
var1 = 3
var2 = 4
var3 = var1 + var2
And if I want to change how many apples Jill has it's a whole process of trial and error to figure out which variable is Jill's number of apples.
Now expand that to thousands or millions of lines of code and you begin to see why nerds want source code instead of binaries.
The compiler will see that var3
is just two numbers added together and replace it with 7, which saves having to do an addition every time you run through that code, and is therefore faster. var1
and var2
may be removed from the output as well; shorter code runs faster since you can fit more in the cache. In fact, since var3
is just a number, you can replace every place that it's used with a 7 as well; if you have some functions:
c++
// be careful! if the number of apples is less than six then the UI will not line up properly auto getTheNumberOfApples() -> int { auto jackApples = 3; auto jillApplies = 4; return jackApples + jillApplies; } auto appleWeight() -> float { return 0.2 * getTheNumberOfApples(); }
... then the compiler will look at all that, delete the lot, and just use 1.4f
wherever the appleWeight()
function was called. Comment is gone, the decision making is gone, it's impossible to go backwards any more.
I'm not a professional programmer and just a hobbyist, but if you also had a set function that changes jackApples to an input integer, what happens at compilation?
That disables a whole pile of the potential optimisations, of course. You could define jackApples
as a "static variable" (as opposed to making it eg. a field in a class or struct):
c++
namespace { auto jackApples = 3; } auto setJackApples(int newJackApples) -> void { jackApples = newJackApples; }
The most obvious consequence of this is that jackApples
now has an address in memory, which you could find out with &jackApples
. Executable programs are arranged into a sequence of blocks when they're compiled, which have some historical names based on what they used to be for:
text
section, which contains all of the executable code, and which might be made read-only by the OS.data
section, which contains variables that have a known value at startupbss
section, which contains variables that we know will exist but don't have a value. Might be zero'd out by the OS, might contain unknown leftover values.Because it's statically allocated, jackApples
will be in the data
section; if you opened up the executable with a hex editor, you'd see a 3 there.
getTheNumberOfApples()
will be optimised by the compiler to return the contents of the memory address plus 4. That still counts as a very simple and short function, and it's quite likely that the compiler would inline it and remove the initial function. The actual process of calling a function is to:
That takes a while, and worse - modern CPUs will try to "pipeline" all the instructions that they know are coming so that it all runs faster. Jumping to a function might break that pipeline, causing a "stall", which slows things down enormously. Much better to inline short functions - the fact that the value is "number in memory address plus four" might be optimised away a little wherever it's used, too.
To add on to what the others have said, the compiler will also optimise your code (which is why professional coders write in common patterns as much as possible, so the compiler can recognise them and optimise).
So many times, you literally won't even have the same program.
Also machine understandable code (assembly or 1s and 0s) is different depending on the processor used. You could give me machine code made for a risc-v processor and I could reconstruct a c program that made it. But if I had the same program compiled for an x86 processor ...