4Geeks logo
About us

Learning library

For all the self-taught geeks out there, here is our content library with most of the learning materials we have produced throughout the years.

It makes sense to start learning by reading and watching videos about fundamentals and how things work.

Data Science and Machine Learning - 16 wks

Full-Stack Software Developer - 16w

Search from all Lessons

Social & live learning

The most efficient way to learn: Join a cohort with classmates just like you, live streams, impromptu coding sessions, live tutorials with real experts, and stay motivated.

← Back to Lessons
Edit on Github
Open in Collab

Importing Static Files with Pandas

File upload in Python

We are going to see code examples of how we can load different types of files. This code is not executable since you would need to have the files in your working directory to run it. However, you can use it as a reference.

CSV

A CSV file (Comma-Separated Values) is a file that allows to represent information in table format, where columns are usually separated by a comma (,) although other characters are also supported and rows by a line break.

Normally, whenever we want to read a CSV we will need to load it in a Pandas DataFrame, so the following code would make it possible:

In [ ]:
import pandas as pd

# Set the path to the file you want to read
file = "input.csv"

# The 'read_csv' function allows this reading to be carried out, transforming the file into a DataFrame
df = pd.read_csv(file, sep = ",")

The read_csv function of Pandas has a lot of parameters that allow to adapt the reading to the characteristics of the file. You can find the documentation of this function here.

Excel (XLSX, XLS)

A Microsoft Excel file (with XLSX or XLS extension) is itself a pure table definition, so it can also be transformed into a Pandas DataFrame:

In [ ]:
import pandas as pd

# Set the path to the file you want to read
file = "input.xlsx"

# The 'read_excel' function allows this reading to be carried out, transforming the file into a DataFrame
df = pd.read_excel(file, sheet_name = "Hoja 1")

The read_excel function of Pandas has a lot of parameters that allow to adapt the reading to the characteristics of the file. You can find the documentation of this function here.

JSON

A JSON file (JavaScript Object Notation) is a file format whose function is to transmit structured information. Its sorting logic has similar points to XML but the notation is different. In a JSON file, the elements are hierarchical.

This type of file can be read in several ways, since there is a direct relationship between Python dictionaries and this type of file. We can also transform it into a Pandas DataFrame if it has a regular structure.

1.File to dictionary

Assume, for example, the following structure:

{
    "filename": "invoice.pdf",
    "numPages": 3,
    "fields": {
        "customerName": "Telefónica S.A.",
        "invoiceNumber": "1234ABCD",
        "totalAmount": "15.000",
        "currency": "EUR"
    }
}

This type of JSON element can only be transformed into a dictionary in Python. It does not make sense to read it as a Pandas DataFrame since it does not have a related structure, as we will see later. We could read this file with the json package of Python:

In [ ]:
import json

# Set the path to the file you want to read
file = "input.json"

with open(file, "r") as f:
    data = json.load(f)

2. File to DataFrame

Suppose, for example, the following structure:

{
    "files": [
        {
            "filename": "invoice1.pdf",
            "numPages": 3
        },
        {
            "filename": "invoice2.docx",
            "numPages": 10
        },
        {
            "filename": "invoice3.pdf",
            "numPages": 2
        }
    ],
    "status": 200
}

This JSON example replicates a response from a server after a query has been sent to it. Part of its content (actually the one we are interested in) has a table format structure, since each element of the list would represent a row, and each element (dictionary) would represent the column. Thus we would transform it into DataFrame:

In [ ]:
import pandas as pd

# Set the path to the file you want to read
file = "input.json"

# First we read the JSON content
with open(file, "r") as f:
    data = json.load(f)

# The function 'from_dict' allows to perform the transformation from JSON to DataFrame
df = pd.DataFrame.from_dict(data)

TXT

A TXT file (TeXT, TeXTo) is a flat file format containing structured or unstructured information. In this type of files we can replicate CSVs, JSONs, etcetera. Therefore, the readings previously seen also apply to this type of files. To read this type of files, Python has a very simple way to do it:

In [ ]:
# Set the path to the file you want to read
file = "input.txt"

# We read the content of the TXT
with open(file, "r") as f:
    data = f.read()
    data = f.readline(10)
    data = f.readlines()

In the above example, three functions are used, each with a different result. Suppose the above file had the following contents:

Hello, how are you?
This file is an example document
To read it through Python

read() function

This function reads the entire contents of the file in string format including line breaks such as "\n". In the above example, the result would be:

"Hello, how are you?\nThis file is an example document\nTo read it through Python".

readline(10) function.

This function reads the first n characters of the file. In the above example, the result would be:

"Hello, how".

Since we pass a 10 as an argument to the function, it reads the first 10 characters.

readlines() function

This function reads all the content of the file separating the lines to return it in list format. In the above example, the result would be:

["Hello, how are you?", "This file is an example document", "To read it through Python"]