By Mr. Data Science

Photo by Luke Chesser on Unsplash

This article is for those who are just getting started in Data Science and want to build their skills and begin to establish a data science portfolio. The three projects we will discuss in this article introduce critical skills that every data scientist needs to have. Those critical skills include:

Pre-Processing Data: Real-world datasets are almost always imperfect. Data is often missing, there are outliers, not neatly structured, etc. As a data scientist, you need to know how to manipulate data to get it into a format that is useful.

Exploratory Data Analysis: Once your data…


By Mr. Data Science

Photo by Brooke Denevan on Unsplash

Throughout this article, we will analyze some data on UFO sightings. Recent press releases from the Pentagon have sparked new interest in the topic of UFOs/UAPs, so it is a trendy and interesting way to introduce some data science and data analytics concepts. However, we need to be realistic about what we can discover from publically available datasets on this topic. These datasets usually consist of eyewitness accounts; therefore, the data should be considered low quality from a scientific perspective. Science does not put as much faith in eyewitness accounts as the legal system does; reference…


By Mr. Data Science

Photo by AJ Colores on Unsplash

In law enforcement, different types of policing exist. There is active policing such as crowd control and traffic control, and there is preventative policing, where the police make themselves visible to deter crime. Finally, there is reactive policing where a crime occures, and the police respond, investigate, and aprehend criminals. In this article, I’ll demonstrate how Data Science can use pattern detection to predict where and when crimes might happen. This capability could enable reactive policing to become more proactive and preventative. Police forces have been plotting crimes on maps to look for crime patterns for…


By Mr. Data Science

Photo by Jefferson Santos on Unsplash

Throughout this article, we will explore some background terminology and data analytics for Cyber Security applications. I’ll demonstrate ow you could use a classification algorithm to identify cyber attachs and show how you can detect anomalies within datasets, specifically using the isolation forest algorithm.

Background on Cyber Security and Data Science:

Cyber security and data science are commonly used to combat hackers. This type of work even has a name: CSDS (cyber security data science).

CSDS is still relatively new and there are many opportunities for improvement. Some of the challenges include:

  • Non-stationary nature of cyber security threats. Hackers are constantly developing and…


By Mr. Data Science

Photo by Arseny Togulev on Unsplash

Until recently a lot of Natural Language Processing (NLP) tasks including NLG (natural language generation) used Recursive Neural Nets (RNN) and LSTM (long short term memory). One of the biggest drawbacks of using these models is the large training time requirement they come with. In recent years, something called a “transformer” has popped up in the data science world which has led to great improvements in model accuracy and a significant reduction in model training times. In this article, we will briefly go over how to use a transformer for NLG using common python libraries.

A brief introduction to transformers and GTP2:

If…


By Mr. Data Science

Photo by NOAA on Unsplash

In this article we will look at plotting data that has a geospatial component. This process has different names, including geovisualization and cartographic visualization. Specifically, we will look at some examples of the different types of geospatial visualizations, including point maps and choropleths. There are many different Python libraries that can help with geovisualization, including:

  • geopandas
  • plotly
  • ipyleaflet
  • folium

We will primarily use geopandas in this tutorial.

Background on Geospatial Data Plotting:

Geospatial data plotting has been in use since at least the nineteenth century. Dr. John Snow used point maps to show a link between the spread of cholera and…


By Mr. Data Science

Photo by Hush Naidoo on Unsplash

Data Science has had a huge impact on the field of medical science. Some of the areas where it is making a difference include:

  • Medical image analysis
  • Genetics and Genomics research
  • Creating new drugs/Drug Discovery with Data Science
  • Predictive Analytics in Healthcare
  • Data Analysis of healthcare data

Topics like the discovery of new drugs are a little beyond the scope of this article but we can still take a look at some examples of predictive and exploratory data analysis of healthcare data. In example 1 we’ll look at data on cancer and how we could approach…


By Mr. Data Science

Photo by Greg Rakozy on Unsplash

An excellent way to learn data science is to do data science: get some data and start analyzing it. The techniques used in this article can be applied to any data, and some of the issues we will encounter are typical of the challenges real-world data analysis throws up.

This article will investigate some data on asteroids to find if there is a threat of collision. Example 3 will use machine learning to classify asteroids as potential threats. …


Photo by Jason Rosewell on Unsplash

Natural language generation (NLG) is the process creating text using software. In general, it can be divided into a few subgroups[1]:

  • text-to-text generation, such as machine translation
  • text summarization
  • open-domain conversation response generation
  • data-to-text generation

NLG is growing in popularity because there are so many applications in areas such as journalism, business, and law. With NLG, you can complete tasks such as writing product descriptions, engaging with users, and writing investigative reports. NLG is frequently used to generate social media posts, such as on twitter, and retroactively caption images throughout the web.

A brief background on natural language generation

One important concept in natural language generation is…


By Mr. Data Science

Photo by Atul Pandey on Unsplash

A Brief Overview:

Throughout this article, we will explore migration data to gain a better understanding of migration drivers. Since migration remains a contentious political issue, we will refrain from giving opinions and focus on the data instead. To investigate migration drivers we will use a couple of datasets (all of them csv files):

  • The country data set was downloaded from Kaggle
  • The happiness reports (5 files) were also downloaded from Kaggle

The goals for this article are to:

  1. demonstrate some useful data science techniques such as combining datasets, generating correlation heat maps, and applying k-means to a dataset

Mr. Data Science

I’m just a nerdy engineer that has too much time on his hands and I’ve decided to help people around the world learn about data science!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store