As machine learning engineers we will probably have to interact with SQL databases in order to access data. SQL means Structured Query Language. The key difference between SQL and Python is that developers use SQL to access and extract data from a database, whereas developers use Python to analyze and manipulate data by running regression tests, time series tests and other data processing computations.
Some popular SQL databases are SQLite, PostgreSQL, MySQL. SQLite is best known for being an integrated database. This means that we don't have to install an extra application or use a separate server to run the database. It moves fast but has limited functionality, so if we dont' need a ton of data storage space, we will want to use a SQLite databse. On the other hand, PostgreSQL and MySQL have database types that are great for enterprise solutions. If we need to scale fast, MySQL and PostgreSQL are the best bet. They'll provide long-term infrastructure, bolster security, and handle high performance activities.
In this lecture we'll see how Python and some SQL databases interact. Why should we care about connecting Python and a SQL database?
Perhaps, as machine learning engineers, we will need to build an automated ETL pipeline. Connecting Python to a SQL database will allow us to use Python for its automation capabilities. We'll also be able to communicate between different data sources. We won't have to switch between different programming languages, we'll be able to use our Python skills to manipulate data from a SQL database. We won't need a CSV file.
The important thing to remember is that Python can integrate with each database type. Python and SQL databases connect through custom Python libraries. You can import these libraries into your Python script.
The following is a code example on how to connect to a SQL database:
Example code to connect to a PostreSQL database and store data in a pandas dataframe:
In this case, we chose AWS Redshift. We will import the psycopg library. This library translates the Python code we write to speak to the PostgreSQL database (AWS Redshift).
Otherwise, AWS Redshift would not understand our Python code. But because of the psycopg library, youwe will now speak a language AWS Redshift can understand.
We imported JSON because creating a JSON config file is a secure way to store your database credentials. We don't want anyone else eyeing those! The json.load() function reads the JSON file so we can access our database credentials in the next step.
Now we want to create a databse connection. We'll need to read and use the credentials from our config file:
As we already mentioned, SQLite is a Relation Database Management System that is lightweight and easy to set up. SQLite is serverless, which is its biggest advantage. It does not require a server to run a database, unlike other RDMS like MySQL or PostgreSQL. So we don’t need any installation setup.
SQLite databases are stored locally, with files stored in the disk. This makes accessing and managing the data in the database is remarkably fast.
Example code to create a database:
After running the file, in your current project directory, one file is created called shows.db. This is the SQLite database file generated by Python.
Example code to connect to the databse:
IBM Db2 is a family of data management products, including the Db2 relational database. The free plan provides 200 MB of data storage on the cloud. In order to practice creating a SQL database and writing SQL queries, this is a good place to start.
We can create our tables on the cloud or directly from our notebook using Python. In order to do it with python, we first need to connect to our cloud database using the credentials provided to us in the moment the database instance was created.
To connect to a DB2, it requires the following information:
Example to create a database connection:
How to create a table from python
ibm_db.exec_inmediate() --> function of the ibm_db API
Parameters for the function:
Example: Creating a table called CARS in Python
#LOAD DATA IN TABLE
#FETCH DATA FROM CARS TABLE
Using pandas to retrieve data from the tables
Getting the properties
DB2 ---> syscat.tables
SQL Server ---> information=schema.tables
Oracle ---> all_tables or user_tables