Starting Your Data Science Portfolio With Three Simple Examples

By Mr. Data Science

Photo by Luke Chesser on Unsplash

This article is for those who are just getting started in Data Science and want to build their skills and begin to establish a data science portfolio. The three projects we will discuss in this article introduce critical skills that every data scientist needs to have. Those critical skills include:

Pre-Processing Data: Real-world datasets are almost always imperfect. Data is often missing, there are outliers, not neatly structured, etc. As a data scientist, you need to know how to manipulate data to get it into a format that is useful.

Exploratory Data Analysis: Once your data is ready to be used, you will have questions about what it contains. Any good data scientist is curious. Exploratory data analysis involves the transformation of data into useful information.

Predictive Models: The final project we will discuss involves using data to make predictions. This fundamental problem is at the root of many requests in our field. If you master it, you will never be jobless.

As you work through these problems, I want you to make it your own. Find a dataset online and follow along, tailoring our approach to a specific problem you are interested in. To get you started, here are ten sources of free data:

Note: different organizations use different licenses; always check the license before using the data, especially if you intend to monetize your analysis.

Once you identify your dataset, you will need to install the following python libraries:

Example 1: Pre-Processing Data

The first dataset we will use is available on Kaggle. It contains information about the 500 richest people in 2021. The first step is to download then import the data into pandas. This process creates a data object called a dataframe; think of it as an excel spreadsheet with rows and columns. The original dataset is a .csv file.

import pandas as pd

to load the data, pas the file path to the read_csv pandas function as shown below:

df_1 = pd.read_csv('projects/500 richest people 2021.csv')

Once it is loaded, let’s take a look at the data. The “head” function provides N rows in your dataset. N, in our example is 3:

df_1.head(3)
png

The output appears to have squashed everything into a single problem. The issue is that the default delimiter for a csv file is a comma but our dataset uses a semi-colon (“;”) instead. We can fix this by re-importing and specifying the delimiter:

df_1 = pd.read_csv('projects/500 richest people 2021.csv',';')df_1.head(3)
png

That is much better. Note the columns on the right; all rows seem to be NaN. NaN stands for not a number. It is used to represent any value that is undefined or unpresentable. In our case, it is missing data. We can count missing values in each column with one line of code:

df_1.isna().sum()Rank                 4
Name 4
Total Net Worth 4
$ Last Change 4
$ YTD Change 4
Country 4
Industry 4
Unnamed: 7 503
Unnamed: 8 503
Unnamed: 9 503
Unnamed: 10 503
dtype: int64

to get the total number of rows and columns in the dataframe, use:

df_1.shape(503, 11)

So there are 503 rows in df_1, which means the four columns on the right are empty; we can drop these columns:

columns = ['Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10']

df_1 = df_1.drop(columns, axis=1)
df_1.head(2)
png

Now we can investigate the other rows; we’ll create a dataframe that is a subset of df_1. Each row in this new dataframe will have at least one missing value:

df_nan = df_1[df_1.isna().any(axis=1)]
df_nan.head()
png

There are four blank rows; in this case, the best thing to do is drop the empty rows:

df_1 = df_1.dropna()

The pandas documentation on missing data has some useful information I encourage you to take a look at.

If we run the code: df_1.shape we should now have 499 rows and 7 columns

df_1.shape(499, 7)

Sometimes, dropping the missing values is undesireable. It is possible to fill the spaces with some values insted. these values could be the mean or median of the column values, for example, or it might be a value like ‘n/a’ for text columns.

The next dataset we will look at is available on the London government data site.

df_2 = pd.read_csv('projects/tfl-journeys-type.csv')df_2.head(2)
png
df_2.isna().sum()Period and Financial year         0
Reporting Period 0
Days in period 0
Period beginning 0
Period ending 0
Bus journeys (m) 0
Underground journeys (m) 0
DLR Journeys (m) 0
Tram Journeys (m) 0
Overground Journeys (m) 7
Emirates Airline Journeys (m) 29
TfL Rail Journeys (m) 66
dtype: int64
df_2.shape(143, 12)

In this dataset, about 5% of the Overground Journeys (m) column values are missing. To fill these we’ll use the column mean:

df_2['Overground Journeys (m)'] = df_2['Overground Journeys (m)'].fillna(df_2['Overground Journeys (m)'].mean())

Then check the column again:

df_2.isna().sum()Period and Financial year         0
Reporting Period 0
Days in period 0
Period beginning 0
Period ending 0
Bus journeys (m) 0
Underground journeys (m) 0
DLR Journeys (m) 0
Tram Journeys (m) 0
Overground Journeys (m) 0
Emirates Airline Journeys (m) 29
TfL Rail Journeys (m) 66
dtype: int64

As expected, the missing values have been replaced. There are no absolute rules for when to use dropna and when to use fillna. You will need to experiment and find the solution that delivers the best results.

In the next example uses another dataset, also available on Kaggle.

Another common type of data problem you might experience is duplicate rows, sometimes it easy to find the duplicates because the rows are identical, other times, it takes a bit more detective work.

duplicates_df = pd.read_csv('projects/Shootings_Dataset.csv', encoding = "ISO-8859-1", parse_dates=['Date'])duplicates_df.head(2)
png

The dataset presents each row as a separate gun crime. We can test this by first sorting on different columns:

duplicates_df.sort_values(by=['Date','S#','Latitude','Longitude']).head(15)
png

Rows 385 and 386 (by row index, leftmost value) are not separate incidents; both rows refer to the same incident. However, some of the details, including Latitude and Longitude, are slightly different. Maybe what happened here was the person who created the dataset scrapped some different news sources and accidentally picked up some duplicate stories. Although they are duplicates, the various news media sources may have slightly different accounts of an incident. This type of duplicate is more challenging to detect, but with any project, you need to set aside some time early in the project to investigate and become familiar with your data.

That leads us into the next project…

Example 2 — Exploratory Data Analysis

The weather data in this example comes from the British Meteorological Office. The dataset has monthly totals for sunshine, rain, maximum temperature …, The second dataset is daily bicycle rentals in London. The challenge this time is reshaping the data: the weather dataset has monthly averages and totals, but the second dataset has daily bicycle rental totals. The daily totals need to be converted to monthly totals before combining the two datasets. Because the weather dataset has only year and month and the bicycle rental dataset has the full date, we also need to extract the year and month from the full date and create new columns in the bicycle rental dataframe, one for year and one for month. The two datasets can then be merged on these columns.

weather = pd.read_csv('projects/weather.csv')weather.head(2)
png
hires = pd.read_csv('projects/bicycle_hires.csv')hires.head(2)
png

To extract the year and month from the date, we need to make sure the date column is of datetime type.

hires['date'] = pd.to_datetime(hires['date'])hires['year'] = pd.DatetimeIndex(hires['date']).year
hires['month'] = pd.DatetimeIndex(hires['date']).month
hires.head(1)
png

groupby year and month and sum the totals:

df1 = hires.groupby(['year','month'], as_index=False).agg({'total': 'sum'})df1.head()
png

Merge the two datasets on year and month

merged = weather.merge(df1, on=['year', 'month'])
merged.head()
png

We can now do some analysis. For example we can plot a correlation heatmap:

import seaborn as sns
corr = merged.corr()

# plot the heatmap
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns);
png

This shows a strong negative correlation between the number of days with an air frost (af) and the total bicycle hires. In other words, people don’t like to hire bicycles on frosty days. There is also a strong positive correlation between the maximum temperature and the total hires, so more people hire bikes on warm or hot days. The sun column is the hours of sunshine per month; this is also positively correlated with the total number of hires. To sum up, the total number of bicycle hires per month seems to be strongly correlated with temperature, sunshine, and rainfall but less correlated with the month.

These findings illustrate the potential of combining datasets to derive information from raw data. It does require more work, but it may be worth the effort in some cases. The ability to combine datasets is also a more impressive addition to a portfolio. If you had data on road accidents involving bicycles in London, for example, you could combine that data and investigate any correlation between weather conditions, total bicycle rentals, and accidents.

You should know that Data visualization is an essential element in turning raw data into information; check out the Seaborn gallery for several examples and inspirations regarding data visualization.

In the next example, we will use the same dataset to build a predictive model.

Example 3 — Predictive Models

The last skill we will discuss in this article is the abilitity to turn data into predictions. Let’s assume we have access to long range weather forecasts for the next month. Could we use that data to predict the number of bicycle hires for the month?

The first decision is: which algorithm to use? We have a very small dataset so a deep learning approach is unlikely to work well since those require large datasets for reasonable accuracy. We could use linear regression to tackle this problem. Linear regression is a simple predictive model but it can be used in very sophisticated ways. Many examples can be found in the fields of finance [1] and natural language processing [2]. Let’s start by loading some libraries from scikit-learn.

from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics

First we need to separate the data in the target variable ‘y’ and the other variables ‘X’:

merged.columnsIndex(['year', 'month', 'tmax', 'tmin', 'af', 'rain', 'sun', 'total'], dtype='object')X = merged.drop('total',axis=1)
y = merged['total']

Then use the scikit learn library to split the data into testing and training data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Next we need to train the model:

regressor = LinearRegression()  
regressor.fit(X_train, y_train)
LinearRegression()

Now we can test the predictions:

y_pred = regressor.predict(X_test)df_pred = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df_pred
png

The predicted results are the correct order of magnitude, sometimes above and sometimes below the actual values. Using Scikit Learn’s metrics library we can get some error values:

import numpy as np
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Mean Absolute Error: 57280.97002866678
Mean Squared Error: 5087644123.157183
Root Mean Squared Error: 71327.72338408946
mean_total = merged['total'].mean()
print(mean_total)
869922.8055555555

so the Root Mean Squared Error is about 8% of the mean value of the ‘total’ column. This is a reasonable error given the small amount of data we had to work with. Using this model and weather forecasts we could predict demand for the rental bicycles, for example.

A quick review of what you’ve learned:

If you’ve made it this far, you should have a fairly good understanding of:

If you have any feedback or suggestions for improving this article, we would love to hear from you.

References:

Connect With Mr. Data Science:

MrDataScience.com, GitHub, Medium,

I’m just a nerdy engineer that has too much time on his hands and I’ve decided to help people around the world learn about data science!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store