Text Matching for Data Manipulation in Pandas using Fuzzywuzzy

Photo by Boitumelo Phetla (on unsplash)

Since I started my journey to become a Data Scientist I learned a ton of useful stuff that I found crucial when we are dealing with data from multiple sources. Among them, I feel Text/String matching to be one of the most useful techniques to know.

Where do we even need it?

Note that the “State” column has more than 17 states. It actually has 35 States( I have considered UT’s of India also as States)

This Data Frame has two columns. The “State” column represents all the administrative territories of India and The “Total IPC Crimes” column represents the total number of registered crimes under the Indian Penal Code in the year 2011.

But the problem here is that these absolute crime numbers aren’t much of a help when we want to know how severe the crime in these States is. In such cases, statistical measures like Crimes per million with respect to their population will be much more useful. To do that we need the population data for India in the year 2011.

So, I searched the web and found a data set containing the Indian Population census for the year 2011. I cleaned and loaded the data into a Data Frame called “Indian_Population”. Take a look at it below.

Note that the “State” column has more than 17 states. It actually has 35 States( I have considered UT’s of India also as States)

Now, all we have to do is merge Indian_Crime_Data and Indian_Population Data Frames on the “State” column and we can easily calculate the Crimes Per Million. But if we take a closer look at both Data Frames “State” columns, we will notice a problem. Since both of these Data Frames are created from data collected from different data sources. The names for some of the states are written differently even though they are the same states.

For example, the very first state “ANDAMAN AND NICOBAR ISLANDS” in Indian_Population is written as “A & N ISLANDS” in Indian_Crime_Data Data Frame. Such differences in the naming of the same states in the State column in these two Data Frames will make it hard to merge them.

Fuzzywuzzy to the rescue

But thanks to the hard work of some wonderful people we don't need to know such algorithms nor we need to implement them from scratch as python has several libraries based on such Algorithms. One such library is called Fuzzywuzzy. This particular library uses Levenshtein Algorithm, which we absolutely don't need to know to use this library.

Installing Fuzzywuzzy

pip install fuzzywuzzy

Solving our Problem using Fuzzywuzzy

from fuzzywuzzy import process

Now, all we need is these four lines of code to solve our problem.

# for loop to match each state in Indian_Crime_Data with Indian_Population Data Framefor state in Indian_Crime_Data['State']:
match = process.extract(state,Indian_Population['State'],limit = 1)

Indian_Crime_Data['State'] = Indian_Crime_Data['State'].str.replace(state,match[0][0])

let me explain to you what process.extract() does. It takes three arguments.

  1. A string that we compare with an array of strings
  2. An array of strings
  3. a limit = n(integer) argument to specify how many matches should we return as a list

After passing these arguments and executing process.extract(), it returns a list containing tuples, which has a match along with its score. These tuples are sorted in descending order with respect to their score. Check the example below to see how the process.extract() works

Since we passed the argument limit = 1 only one tuple will be present in the list.

The Final Data Frame:

# Merging the Data Frames on State column
Indian_crime = Indian_Crime_Data.merge(Indian_Population,on = 'State')
Note that the “State” column has more than 16 states. It actually has 35 States( I have considered UT’s of India also as States)

This is just one of many scenarios where knowing Fuzzywuzzy can be quite handy. With this, I would like to thank you for reading my very first article on this platform and I hope to see you again soon.

An aspiring Data Scientist( Yes, it’s a classy way of saying I am a Noob. LOL) happy to share my struggles and knowledge.