March 19, 2018 · EN snippets

Turn categorical variables into numerics

It's essential that we work with numbers in data science regardless the variable content.

In case we have some data including categorical columns, it's easy to convert the corresponding categorical columns into numerical ones.

If we use pandas to fetch the data, that's pretty easy to overcome such case.

In the screenshot below, consider Source Port as the column including categorical variables. There are 3 variables in this column (80, 443 and 0).
We need to split them so that no variable outweights the others.

We'll use the one-hot method. In this way, we'll split these categorical variables into N columns each composed of 1 and 0. The following python line accomplishes the requirements.

pandas.get_dummies(data, columns=['column_name'])

The following screenshot prints the output data where the Source Port column is separated into 3.
If the corresponding line includes 80 as a values, then it's marked as 1 in the related column. Otherwise 0.

Using one-hot encoding method (of course applying it to all columns applicable), it becomes possible to feed the data into neural networks