← Back to Lessons
Edit on Github

Connecting to SQL Databases

As machine learning engineers we will probably have to interact with SQL databases in order to access data. SQL means Structured Query Language. The key difference between SQL and Python is that developers use SQL to access and extract data from a database, whereas developers use Python to analyze and manipulate data by running regression tests, time series tests and other data processing computations.

Some popular SQL databases are SQLite, PostgreSQL, MySQL. SQLite is best known for being an integrated database. This means that we don't have to install an extra application or use a separate server to run the database. It moves fast but has limited functionality, so if we dont' need a ton of data storage space, we will want to use a SQLite databse. On the other hand, PostgreSQL and MySQL have database types that are great for enterprise solutions. If we need to scale fast, MySQL and PostgreSQL are the best bet. They'll provide long-term infrastructure, bolster security, and handle high performance activities.

In this lecture we'll see how Python and some SQL databases interact. Why should we care about connecting Python and a SQL database?

Perhaps, as machine learning engineers, we will need to build an automated ETL pipeline. Connecting Python to a SQL database will allow us to use Python for its automation capabilities. We'll also be able to communicate between different data sources. We won't have to switch between different programming languages, we'll be able to use our Python skills to manipulate data from a SQL database. We won't need a CSV file.

The important thing to remember is that Python can integrate with each database type. Python and SQL databases connect through custom Python libraries. You can import these libraries into your Python script.

The following is a code example on how to connect to a SQL database:

1from dbmodule import connect 2 3#Create a connection object 4 5CONNECTION = CONNECT('databse name', 'username','password') 6 7#Create a cursor object 8 9CURSOR = CONNECTION.CURSOR() 10 11#Run queries 12 13CURSOR.EXECUTE('select * from mytable') 14RESULTS = CURSOR.FETCHALL() 15 16#Free resources 17CURSOR.CLOSE()

Example code to connect to a PostreSQL database and store data in a pandas dataframe:

In this case, we chose AWS Redshift. We will import the psycopg library. This library translates the Python code we write to speak to the PostgreSQL database (AWS Redshift).

Otherwise, AWS Redshift would not understand our Python code. But because of the psycopg library, youwe will now speak a language AWS Redshift can understand.

1#Library for connecting to AWS Redshift 2import psycopg 3 4#Library for reading the config file, which is in JSON 5import json 6 7#Data manipulation library 8import pandas as pd

We imported JSON because creating a JSON config file is a secure way to store your database credentials. We don't want anyone else eyeing those! The json.load() function reads the JSON file so we can access our database credentials in the next step.

1config_file = open(r"C:\Users\yourname\config.json") 2config = json.load(config_file)

Now we want to create a databse connection. We'll need to read and use the credentials from our config file:

1con = psycopg2.connect(dbname= "db_name", host=config[hostname], port = config["port"],user=config["user_id"], password=config["password_key"]) 2cur = con.cursor()

Creating and connecting to a SQLite database using Python

As we already mentioned, SQLite is a Relation Database Management System that is lightweight and easy to set up. SQLite is serverless, which is its biggest advantage. It does not require a server to run a database, unlike other RDMS like MySQL or PostgreSQL. So we don’t need any installation setup.

SQLite databases are stored locally, with files stored in the disk. This makes accessing and managing the data in the database is remarkably fast.

Example code to create a database:

1import sqlite3 2 3connection = sqlite3.connect('shows.db') #creating a database with name: 4cursor = connection.cursor() #create a cursor object in order to create a table 5cursor.execute('''CREATE TABLE IF NOT EXISTS Shows 6 (Title TEXT, Director TEXT, Year INT)''') #create a table with column names and data types 7 8connection.commit() #commit the changes in the database 9connection.close() #close the connection

After running the file, in your current project directory, one file is created called shows.db. This is the SQLite database file generated by Python.

Example code to connect to the databse:

1 2from sqlalchemy import create_engine 3import pandas as pd 4 5# Create engine: engine 6engine = create_engine('sqlite:///databse_name.sqlite') 7 8# Save the table names to a list: table_names 9table_names = engine.table_names() 10 11# Print the table names to the shell 12print(table_names) 13 14# Open engine connection: con, and select specified columns and number of rows 15 16with engine.connect() as con: 17 ab = con.execute("SELECT Title, Director FROM Shows") 18 df = pd.DataFrame(ab.fetchmany(size=5)) 19 df.columns = ab.keys() 20 21# Close connection 22con.close() 23 24# Print first rows of dataframe 25print(df.head())

Connecting to a DB2 Database

IBM Db2 is a family of data management products, including the Db2 relational database. The free plan provides 200 MB of data storage on the cloud. In order to practice creating a SQL database and writing SQL queries, this is a good place to start.

We can create our tables on the cloud or directly from our notebook using Python. In order to do it with python, we first need to connect to our cloud database using the credentials provided to us in the moment the database instance was created.

To connect to a DB2, it requires the following information:

  • controler name
  • database name
  • DNS host name or IP
  • Host port
  • Connection protocole
  • User id
  • Password

Example to create a database connection:

1#Create database connection 2 3dsn = ( 4 "Driver = { {IBM DB2 ODBC DRIVER}};" 5 "Database = {0};" 6 "HOSTNAME = {1};" 7 "PORT = {2};" 8 "PROTOCOL = TCPIP;" 9 "UID = {3};" 10 "PWD = {4};").format(dsn_database, dsn_hostname, dsn_port, dsn_uid, dsn_pwd) 11 12try: 13 conn = ibm_db.connect(dsn, " ", " ") 14 print("Connected!") 15 16except: 17 print("Unable to connect to database") 18 19#Close the database connection 20 21ibm_db.close(conn) 22 23#Note: It is always important to close the connections to avoid non used connectors taking resources. 24

How to create a table from python

ibm_db.exec_inmediate() --> function of the ibm_db API

Parameters for the function:

  • connection
  • statement
  • options

Example: Creating a table called CARS in Python


1stmt = ibm_db.exec_inmediate(conn, "CREATE TABLE Cars( 2 serial_no VARCHAR(20) PRIMARY KEY NOT NULL, 3 make varchar(20) NOT NULL, 4 model VARCHAR(20) NOT NULL, 5 car_class VARCHAR(20) NOT NULL)" 6 )


1stmt = ibm_db.exec_inmediate(conn, "INSERT INTO Cars( 2 serial_no, make, model, car_class) 3 VALUES('A2345453','Ford','Mustang','class3');")


1stmt = ibm_db.exec_inmediate(conn, "SELECT * FROM Cars") 2 3ibm_db.fetch_both(stmt)

Using pandas to retrieve data from the tables


1 2import pandas 3import ibm_db_dbi 4pconn = ibm_db_dbi.connection(conn) 5 6df = pandas.read_sql('SELECT * FROM Cars', pconn) 7df 8 9#Example of a plot 10 11import matplotlib.pyplot as plt 12%matplotlib inline 13import seaborn as sns 14 15#categorical scatterplot 16 17plot = sns.swarmplot(x="Category", y="Calcium", data=df) 18plt.setp(plot.get_xticklabels(), rotation=70) 19plt.title('Calcium content') 20plt.show() 21 22#Making a boxplot 23#A boxplot is a graph that indicates the distribution of 1 or more variables. The box captures the median 50% of the data. 24# The line and dots indicate possible outliers and not normal values. 25 26plot = sns.set_style('Whitegrid') 27ax = sns.boxplot(x=df['glucose level']) 28plt.show()

Getting the properties

DB2 ---> syscat.tables

SQL Server ---> information=schema.tables

Oracle ---> all_tables or user_tables


1 2#Getting table properties from DB2 3 4SELECT * FROM syscat.tables 5#(this will show too many tables) 6 7SELECT tabschema, tabname, create_time 8FROM syscat.tables 9WHERE tabschema = 'ABC12345' #---> replace with your own DB2 username 10 11#Getting a list of columns in database 12 13SELECT * FROM syscat.columns 14WHERE tabname = 'Cats' 15 16#To obtain specific column properties: 17 18%sql SELECT DISTINCT(name), coltype, length 19 FROM sysibm.syscolumns 20 WHERE tbname = 'Cats' 21 22%sql SELECT DISTINCT(name), coltype, length 23 FROM sysibm.syscolumns 24 WHERE tbname = 'Miami_crime_data' 25