Python
Data Structures
Machine Learning
Big O
The origin of computing and programming required the use of languages very close to the machine. These languages were called assembler languages or machine languages. By directly programming sentences at a low level, the efficiency of these codes was very high. However, the complexity of developing them was very high.
Today, we can use high-level languages that delegate issues such as memory access, registers, etc. and allow us not to worry about them. The only drawback of this abstraction is the problem of developing highly inefficient algorithms or computer programs.
An algorithm is a set of instructions that are followed to achieve a goal or produce a result. This term does not only apply to the computer world. For example, the execution of everyday tasks as simple as brushing your teeth, washing your hands or following the instruction manual for assembling a piece of furniture can be seen as an algorithm. In programming, an algorithm is a set of computer instructions that constitute a function.
Let's look at an example of a very simple algorithm that allows us to define a program to calculate the area of a triangle:
Process CalculateTriangleArea(base, height)
Multiply the base by the height;
Divide the result by 2;
Write "The area is", area;
EndProcess
Once the algorithm has been defined, we can implement it in a programming language such as Python:
def calculate_triangle_area(base, height):
product = base * height
area = product / 2
return f"The area is {area}"
calculate_triangle_area(20, 15)
A simple problem can be solved using many different algorithms. Some solutions simply take less time and space than others. But how do we know which solutions are more efficient?
The time complexity is the number of operations an algorithm performs to complete its task (considering that each operation takes the same amount of time). The algorithm that performs the task with the fewest number of operations is considered the most efficient in terms of time complexity. Typically, the programming languages most commonly used in data analysis, such as Python, R or Julia try to optimize the computational complexity as much as possible, and it is, in fact, one of the reasons why there are developers who prefer one or the other.
There are several definitions of time complexity:
There are many more measures to catalog the efficiency of the sentences and, therefore, of the algorithms. The following graph shows the comparison between the most common measures:
It can be clearly seen that the ideal scenario is to have algorithms composed of statements. Normally, Python already has many of these optimizations and streamlines all processes and functions, as well as in the various libraries and packages, so that whenever we use any function, it is of a much reduced complexity. However, something that we have to be fully responsible for and that directly impacts the code and can affect its efficiency is good programming practices.
As hardware advances, so does software. If processors or graphics cards improve in capabilities and speed, so must programming languages. The first principle of any good developer looking for efficient code is to constantly update the version of the libraries and programming language. So, for example, Python 3 is much faster than Python 2, so part of that needed efficiency could be achieved simply by upgrading the language.
Surely at some point you will want to solve something in Python, and you will get down to work, such as calculating the average by first adding the numbers and then dividing by the total sample size. Did you know that this can be done in one line using a multitude of libraries? In addition, the code behind the functions of these libraries will usually be highly optimized by taking advantage of every available resource, so it will always be more efficient to use them rather than to program your own. In addition, the code will be more understandable, cleaner and surely more scalable.
✅ Do this | ❌ Don't do this |
---|---|
np.array([1, 2, 3]).mean() | def mean(elements): sum = elements.sum() n = len(elements) return sum/n mean(np.array([1, 2, 3])) |
names['Gender'].replace('female', 'FEMALE', inplace=True) | names["Gender'].loc[names.Gender=='female'] = 'FEMALE' |
This also applies to projects or applications that you want to make. Maybe they have already been done before and you can start from that project to make your own. Do you want to make a calculator? Do some research, see if someone has already made one in Python and use it as a starting point.
Here are some of the most used and necessary tricks in day-to-day work with data: https://www.turing.com/kb/22-hottest-python-tricksfor-efficient-coding
Python provides many mechanisms for performing computationally and temporally efficient tasks, as shown in the following examples:
✅ Do this | ❌ Don't do this |
---|---|
def good_list(elements): my_list = [value for value in range(elements)] | def bad_list(elements): my_list = [] for value in range(elements): my_list.append(value) |
def good_string_joiner(elements): "".join(elements) | def bad_string_joiner(elements): final_string = "" for value in elements: final_string += value |
There are many ways to optimize your code, like the ones shown above using, first of all, list comprehension, string accumulation using join
, collections
, itertools
...
More information on how you can make your code as efficient as possible by taking advantage of Python's native tools and packages here: https://khuyentran1401.github.io/Efficient_Python_tricks_and_tools_for_data_scientists/README.html
In any programming language, variables and objects take up memory, so a good way to keep your code clean is to remove variables that you no longer need. In Python you can see how much memory your variable occupies with the sys.getsizeof(variable)
function of the sys
package. If you notice that a variable has a considerable weight, you might consider deleting it so as not to unnecessarily load the memory of the execution environment, since the more collapsed it is and the more its memory is used, the worse it will perform.
Many times, we need inspiration from others. Maybe until you have read this content, you did not know that mechanisms such as list comprehension or the NumPy function to calculate the average existed. Therefore, the best way to make your code more efficient is to learn from the code of other developers. Experience is the best way to efficiency.
In addition to having learned about best practices to make code more efficient in terms of time and resources, there are also best practices to make code more understandable and standardized, so that it facilitates the exchange of knowledge between developers and follows a common standard. There are many proposals and guides to developing Python code, but the best known is the PEP 8 - Style Guide for Python Code, which you can read here: https://peps.python.org/pep-0008/
This document provides coding conventions for Python code that comprises the standard library in the main Python distribution. In addition, this guide is under constant revision and evolves with time and language releases.