Photo by Jason Rosewell on Unsplash

Natural language generation (NLG) is the process creating text using software. In general, it can be divided into a few subgroups[1]:

  • text-to-text generation, such as machine translation
  • text summarization
  • open-domain conversation response generation
  • data-to-text generation

NLG is growing in popularity because there are so many applications in areas such as journalism, business, and law. With NLG, you can complete tasks such as writing product descriptions, engaging with users, and writing investigative reports. NLG is frequently used to generate social media posts, such as on twitter, and retroactively caption images throughout the web.

One important concept in natural language generation is…


By Mr. Data Science

Photo by Atul Pandey on Unsplash

Throughout this article, we will explore migration data to gain a better understanding of migration drivers. Since migration remains a contentious political issue, we will refrain from giving opinions and focus on the data instead. To investigate migration drivers we will use a couple of datasets (all of them csv files):

  • The country data set was downloaded from Kaggle
  • The happiness reports (5 files) were also downloaded from Kaggle

The goals for this article are to:

  1. demonstrate some useful data science techniques such as combining datasets, generating correlation heat maps, and applying k-means to a dataset


Use Python to find when your favorite superhero character appeared

A pile of comic books.
A pile of comic books.
Photo by Erik Mclean on Unsplash

There’s a famous quote by the American engineer/statistician W. Edwards Deming:

“Without data, you’re just another person with an opinion.”

One of the first steps you take when working with a new data set is to perform exploratory data analysis (EDA). The overarching objective of EDA is to help data scientists understand what the data contains and what types of questions the data will be able to answer. Note: EDA doesn’t attempt to answer any single question. It’s an investigative tool in your belt. Throughout this article, we’ll use a variety of EDA techniques on Marvel versus DC Comics data.

1. Getting Started: Preprocessing the Data


Want to publish your story on The Data Science Publication?

Just leave a comment on this article expressing your interest and we will review your previous articles. If you meet…


Use a Kaggle dataset and a few Python libraries to get started

Person watching Netflix
Person watching Netflix
Photo by Mollie Sivaram on Unsplash.

Recommender systems are used on large online platforms like Netflix and YouTube to recommend movies, shows, or videos based on what you have watched in the past. Recommender systems are also commonly used in the online retail space. One common recommender system statistic is that Amazon makes about one-third of their sales from recommended products. Just imagine making 50% more money than you currently do.

If you ask me, learning how to implement a recommender system is well worth the time commitment.

The recommender system described in this article will be simple but will demonstrate the fundamental problems that need…


By Mr. Data Science

image.png

Throughout this article, I will use the mnist dataset to show you how to reduce image noise using a simple autoencoder. First, I will demonstrate how you can artificially inject noise into your images. Next, I will describe the process for creating an autoencoder, and finally, I will test the autoencoder on a few different signal-to-noise ratio (SNR) images to assess the model’s robustness. Note that the goal of this article is to introduce you to the concept of noise reduction with autoencoders, not teach you the nuances of autoencoder architectures and design.

According to Wikepedia…


By Mr. Data Science

k-Nearest Neighbor (KNN) is a classification algorithm, not to be confused with k-Means, they are two very different algorithms with very different uses. k-Means is an unsupervised clustering algorithm, given some data k-Means will cluster that data into k groups where k is a positive integer. k-Nearest Neighbor is a supervised classification algorithm, note — a supervised algorithm uses training data whereas an unsupervised algorithm has no training data.We …


By Mr. Data Science

Photo by Tachina Lee on Unsplash

In this article we will attempt to use machine learning, specifically ensemble learning to detect fake news. First let’s define what is meant by the term ‘fake news’ at least within this article. A statement such as ‘NASA discovers an alien civilisation living on the moon’ is fake news in the sense that it is factually incorrect. There is another use of the term ‘fake news’ where some people will attempt to dismiss anything that challenges their world view as fake news, we will not be using that definition. …


By Mr. Data Science

Throughout this article, we will describe how you can use decision trees and random forest classifiers to predict the cause of wildfires. First, we will use SQLite to import the data into a Pandas Dataframe. Next, we will do some preprocessing and data exploration to better understand the dataset. Finally, we will apply a random forest classifier to the complete dataset, as well as a subset (California wildfires). The concepts described in this article are applicable to a wide range of problems. If you have any feedback, we look forward to hearing from you.

Fundamentally, a…


By Mr. Data Science

Let’s say we are data scientists working for a retail company and our boss wants to create a targeted marketing campaign. In order to focus the campaign, we have to divide the set of customers into smaller subsets based on the features in our customer dataset. Features are just the columns in the dataset and each row represents a unique customer. So as the data science team, our job is to somehow find those groups.

This task is different from many other machine learning tasks in that we don’t have any labelled data so we can’t…

Mr. Data Science

I’m just a nerdy engineer that has too much time on his hands and I’ve decided to help people around the world learn about data science!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store