PB & J, Cookies n’ Cream — Pandas and SQLite3

Published in

DataDrivenInvestor

6 min readFeb 8, 2021

As a data scientist, you’d think I jump out of bed every day eager to process data or write intricate technical content about Algorithms, Machine Learning or Data Migration.

The cliche;

Five Steps to Building a Linear Regression Model

Nonetheless, there are some extremely insightful pieces on the Towards Data Science publication backed by phenomenal writers, which I often reference when tackling my data projects.

But for the life of me, I do not find technical writing particularly glamourous.

I’d rather get tangled in a piece about Social Justice, geared towards dismantling society’s larger perception on the topic of race, gender, and social class — or better yet, a piece about who’d win in a fight between Goku and Superman!

Before you jump to Superman’s corner, remember that Goku can also fly.

I can’t help but feel that I’m not compelled to write technical content because it terrifies me. Perhaps, it might unravel deep-seated thoughts of me not being cut out for this world of Data Science.

Argh, I doubt it would prove useful anyways.

Nonsense. Communication at all fronts is paramount to becoming an indispensable Data Scientist.

Technical writing can help develop your prowess to succinctly convey stubborn code into digestible chunks for less data-savvy folks and solve real-world business challenges.

With that being said, I’ve decided to actually take my own advice.

“Do it Scared.”

“Don’t let your action betray your words”
— Inky Johnson

Big Data

Before we get it cracking, let’s explore what is meant by “Big Data”. Plugging this loosely defined term into Google generates a whole host of varying definitions.

Big data is a field that treats ways to analyze, systematically extract information from, or otherwise, deal with data sets that are too large or complex to be dealt with by traditional data-processing application software.

— Wikipedia

Nowadays, leveraging data for insights seems to be the top priority for companies seeking to gain a competitive edge.

Data Scientists are the individuals tasked with making sense of all this noise, generating actionable insights from complex datasets.

Yay, that’s me!

The Assignment

Withered down from a candidate pool of over 300+ applicants, I had made it to the final stage of a Senior Data Analyst role for a Media & Entertainment firm.

Now there were four!

We were presented with a highly convoluted dataset and asked to devise a strategy on how this data can be scaled beyond its existing Excel form. In addition to being scalable, the solution should demonstrate ways in which the data could be easily accessed, manipulated, and exported into a BI tool, if possible.

Initially rattled, I began searching for various ways to not only deliver a comprehensive solution but one that is specifically tailored to the firm’s industry.

Great, this brings us up to speed with the premise of this piece.

My answer was the Python’s Pandas Library used in conjunction with an SQLite3 database to offer something that is both scalable and highly versatile.

The Dynamic Duo

Another quick definition of Pandas and SQLite3.

Pandas is a highly popular Python Library that allows you to permanently store files into DataFrames that can then easily be manipulated. It is also compatible with multiple visualizations allowing interactive exploration of big data.

SQLite3 is a module that has been included in the Python Library since Python 2.5. It adds a mild relational database element enabling you to write queries in accordance with conventional SQL Syntax and fetch data with relative ease.

Married together, they offer a dynamic duo not seen since Frodo and Sam halted Dark Lord Sauron’s attempts to take over the world.

Step 1: Import Libraries

Pretty straight-forward. Opening Jupyter Notebook, I began by importing all the relevant libraries for this assignment. Duh, Pandas & SQLite3 are obvious choices.

However, “OS” offers a portable way to navigate through folders on Jupyter, whilst “Matplotlib” provides an opportunity to carry out some basic data visualizations on the fly.

Step 2: Load Data

Prior to loading, I had performed some rudimentary data cleaning on Excel achieving a more streamlined look. The data was then converted into a CSV format for quicker processing on Jupyter.

Data Cleaning is 70% of the battle and you don’t want to know what this dataset looked like beforehand.

However, such is the power of Pandas that a single line of code is enough to load a large dataset and save it into a DataFrame.

How dope is that?!

Step 3: Create SQLite3 Connection.

Establishing an SQL connection was next on the agenda. Running the first line of code above creates a connection object in the default directory to store the data.

The second line converts the DataFrame into SQLite3 database.

A bit of a disclaimer here, SQLite3 is a serverless database engine and shouldn’t be considered as a replacement for a more robust server like MySQL or PostgreSQL — at least from a Data Warehousing standpoint.

Step 4: Read SQL With Pandas

Using the “sql_read” function, we can now read the data and write queries to the database.

This is the simplest SQL query we can punch in to return all the columns from the table and test the code is running successfully.

Step 6: Write Some More Queries

Now you can really play about with it and write some more complex queries. Here, I am aiming to return all the columns on “LinkedIn” posts that garnered more than “500 comments”. I’d then like to display the “Top 15” results, ordering these from biggest to smallest.

Voila!

Step 7: Convert Back to CSV

You’ve got to love the one-liners.

This code saves the DataFrame back into a CSV format. Now it’s ready to be exported and kicked into a BI Platform like Tableau or Power BI.

Step 8: Visualization

As a self-proclaimed data visualization geek, I knew a simple bar chart using Matplotlib wouldn’t suffice here. Using the exported dataset, I cranked this task up a notch by employing Tableau and Figma (UX/UI Design) to create a compelling viz on Social Media Analytics.

Here we go.

Interactive link to the full version: https://tabsoft.co/3rsgGBk

Summary

Sex is cool and all but have you ever closed all your search tabs after finishing a project.

Jokes aside, we’ve successfully completed the assignment and explored one of many possibilities of Python in the data science capacity by combining Pandas and SQLite3.

Please feel free to connect with me on LinkedIn or Twitter to share some of your interesting discoveries working with unlikely pairings in the Python package.

“Don’t ask for advice, ask for feedback”
— Nana Nyantekyi, Mentor

Ayyy, not bad for a first crack at technical writing — and not as scary as I anticipated. Again, any constructive feedback would be greatly appreciated.

On the job front, I’ve been shortlisted for the final two. I’ll keep you guys posted on how that pans out.

In closing, I’ll leave you with one of my favourite quotes about fear and why we can not allow it to stifle our progress.

“Fear is the relinquishment of logic, the willing relinquishing of reasonable patterns.. We yield to it or we fight it, but we cannot meet it halfway.”

— Steve Crain