← Back to Lessons
Editar en Github

Connecting to SQL Databases

How to connect to SQL databases using Python

As machine learning engineers we will probably have to interact with SQL databases in order to access data. SQL means Structured Query Language. The key difference between SQL and Python is that developers use SQL to access and extract data from a database, whereas developers use Python to analyze and manipulate data by running regression tests, time series tests and other data processing computations.

Some popular SQL databases are SQLite, PostgreSQL, MySQL. SQLite is best known for being an integrated database. This means that we don't have to install an extra application or use a separate server to run the database. It moves fast but has limited functionality, so if we dont' need a ton of data storage space, we will want to use a SQLite databse. On the other hand, PostgreSQL and MySQL have database types that are great for enterprise solutions. If we need to scale fast, MySQL and PostgreSQL are the best bet. They'll provide long-term infrastructure, bolster security, and handle high performance activities.

In this lecture we'll see how Python and some SQL databases interact. Why should we care about connecting Python and a SQL database?

Perhaps, as machine learning engineers, we will need to build an automated ETL pipeline. Connecting Python to a SQL database will allow us to use Python for its automation capabilities. We'll also be able to communicate between different data sources. We won't have to switch between different programming languages, we'll be able to use our Python skills to manipulate data from a SQL database. We won't need a CSV file.

The important thing to remember is that Python can integrate with each database type. Python and SQL databases connect through custom Python libraries. You can import these libraries into your Python script.

The following is a code example on how to connect to a SQL database:

from dbmodule import connect #Create a connection object CONNECTION = CONNECT('databse name', 'username','password') #Create a cursor object CURSOR = CONNECTION.CURSOR() #Run queries CURSOR.EXECUTE('select * from mytable') RESULTS = CURSOR.FETCHALL() #Free resources CURSOR.CLOSE()

Example code to connect to a PostreSQL database and store data in a pandas dataframe:

In this case, we chose AWS Redshift. We will import the psycopg library. This library translates the Python code we write to speak to the PostgreSQL database (AWS Redshift).

Otherwise, AWS Redshift would not understand our Python code. But because of the psycopg library, youwe will now speak a language AWS Redshift can understand.

#Library for connecting to AWS Redshift import psycopg #Library for reading the config file, which is in JSON import json #Data manipulation library import pandas as pd

\ We imported JSON because creating a JSON config file is a secure way to store your database credentials. We don't want anyone else eyeing those! The json.load() function reads the JSON file so we can access our database credentials in the next step.

config_file = open(r"C:\Users\yourname\config.json") config = json.load(config_file)

\ Now we want to create a databse connection. We'll need to read and use the credentials from our config file:

con = psycopg2.connect(dbname= "db_name", host=config[hostname], port = config["port"],user=config["user_id"], password=config["password_key"]) cur = con.cursor()

Creating and connecting to a SQLite database using Python

As we already mentioned, SQLite is a Relation Database Management System that is lightweight and easy to set up. SQLite is serverless, which is its biggest advantage. It does not require a server to run a database, unlike other RDMS like MySQL or PostgreSQL. So we don’t need any installation setup.

SQLite databases are stored locally, with files stored in the disk. This makes accessing and managing the data in the database is remarkably fast.

Example code to create a database:

import sqlite3 connection = sqlite3.connect('shows.db') #creating a database with name: cursor = connection.cursor() #create a cursor object in order to create a table cursor.execute('''CREATE TABLE IF NOT EXISTS Shows (Title TEXT, Director TEXT, Year INT)''') #create a table with column names and data types connection.commit() #commit the changes in the database connection.close() #close the connection

After running the file, in your current project directory, one file is created called shows.db. This is the SQLite database file generated by Python.

Example code to connect to the databse:

from sqlalchemy import create_engine import pandas as pd # Create engine: engine engine = create_engine('sqlite:///databse_name.sqlite') # Save the table names to a list: table_names table_names = engine.table_names() # Print the table names to the shell print(table_names) # Open engine connection: con, and select specified columns and number of rows with engine.connect() as con: ab = con.execute("SELECT Title, Director FROM Shows") df = pd.DataFrame(ab.fetchmany(size=5)) df.columns = ab.keys() # Close connection con.close() # Print first rows of dataframe print(df.head())

Connecting to a DB2 Database

IBM Db2 is a family of data management products, including the Db2 relational database. The free plan provides 200 MB of data storage on the cloud. In order to practice creating a SQL database and writing SQL queries, this is a good place to start.

We can create our tables on the cloud or directly from our notebook using Python. In order to do it with python, we first need to connect to our cloud database using the credentials provided to us in the moment the database instance was created.

To connect to a DB2, it requires the following information:

  • controler name
  • database name
  • DNS host name or IP
  • Host port
  • Connection protocole
  • User id
  • Password

Example to create a database connection:

#Create database connection dsn = ( "Driver = { {IBM DB2 ODBC DRIVER}};" "Database = {0};" "HOSTNAME = {1};" "PORT = {2};" "PROTOCOL = TCPIP;" "UID = {3};" "PWD = {4};").format(dsn_database, dsn_hostname, dsn_port, dsn_uid, dsn_pwd) try: conn = ibm_db.connect(dsn, " ", " ") print("Connected!") except: print("Unable to connect to database") #Close the database connection ibm_db.close(conn) #Note: It is always important to close the connections to avoid non used connectors taking resources.

How to create a table from python

ibm_db.exec_inmediate() --> function of the ibm_db API

Parameters for the function:

  • connection
  • statement
  • options

Example: Creating a table called CARS in Python


stmt = ibm_db.exec_inmediate(conn, "CREATE TABLE Cars( serial_no VARCHAR(20) PRIMARY KEY NOT NULL, make varchar(20) NOT NULL, model VARCHAR(20) NOT NULL, car_class VARCHAR(20) NOT NULL)" )


stmt = ibm_db.exec_inmediate(conn, "INSERT INTO Cars( serial_no, make, model, car_class) VALUES('A2345453','Ford','Mustang','class3');")


stmt = ibm_db.exec_inmediate(conn, "SELECT * FROM Cars") ibm_db.fetch_both(stmt)

Using pandas to retrieve data from the tables


import pandas import ibm_db_dbi pconn = ibm_db_dbi.connection(conn) df = pandas.read_sql('SELECT * FROM Cars', pconn) df #Example of a plot import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns #categorical scatterplot plot = sns.swarmplot(x="Category", y="Calcium", data=df) plt.setp(plot.get_xticklabels(), rotation=70) plt.title('Calcium content') plt.show() #Making a boxplot #A boxplot is a graph that indicates the distribution of 1 or more variables. The box captures the median 50% of the data. # The line and dots indicate possible outliers and not normal values. plot = sns.set_style('Whitegrid') ax = sns.boxplot(x=df['glucose level']) plt.show()

Getting the properties

DB2 ---> syscat.tables

SQL Server ---> information=schema.tables

Oracle ---> all_tables or user_tables


#Getting table properties from DB2 SELECT * FROM syscat.tables #(this will show too many tables) SELECT tabschema, tabname, create_time FROM syscat.tables WHERE tabschema = 'ABC12345' #---> replace with your own DB2 username #Getting a list of columns in database SELECT * FROM syscat.columns WHERE tabname = 'Cats' #To obtain specific column properties: %sql SELECT DISTINCT(name), coltype, length FROM sysibm.syscolumns WHERE tbname = 'Cats' %sql SELECT DISTINCT(name), coltype, length FROM sysibm.syscolumns WHERE tbname = 'Miami_crime_data'



All rights reserved, 4Geeks LLC 2021. Read more about 4Geeks and what we are going here.