Since I started my journey to become a Data Scientist I learned a ton of useful stuff that I found crucial when we are dealing with data from multiple sources. Among them, I feel Text/String matching to be one of the most useful techniques to know.
Where do we even need it?
To explain the importance of Text/String matching let me present you with a scenario. I have a Data Frame named Indian_Crime_Data representing India’s Total Crimes under the Indian Penal Code in each State in the year 2011.
This Data Frame has two columns. The “State” column represents all the administrative territories of India and The “Total IPC Crimes” column represents the total number of registered crimes under the Indian Penal Code in the year 2011.
But the problem here is that these absolute crime numbers aren’t much of a help when we want to know how severe the crime in these States is. In such cases, statistical measures like Crimes per million with respect to their population will be much more useful. To do that we need the population data for India in the year 2011.
So, I searched the web and found a data set containing the Indian Population census for the year 2011. I cleaned and loaded the data into a Data Frame called “Indian_Population”. Take a look at it below.
Now, all we have to do is merge Indian_Crime_Data and Indian_Population Data Frames on the “State” column and we can easily calculate the Crimes Per Million. But if we take a closer look at both Data Frames “State” columns, we will notice a problem. Since both of these Data Frames are created from data collected from different data sources. The names for some of the states are written differently even though they are the same states.
For example, the very first state “ANDAMAN AND NICOBAR ISLANDS” in Indian_Population is written as “A & N ISLANDS” in Indian_Crime_Data Data Frame. Such differences in the naming of the same states in the State column in these two Data Frames will make it hard to merge them.
Fuzzywuzzy to the rescue
In scenarios such as these, Text matching based on Minimum Edit Distance Algorithms will help us. These algorithms match strings based on a score. This score is calculated based on the least amount of steps required to transition from one string to another.
But thanks to the hard work of some wonderful people we don't need to know such algorithms nor we need to implement them from scratch as python has several libraries based on such Algorithms. One such library is called Fuzzywuzzy. This particular library uses Levenshtein Algorithm, which we absolutely don't need to know to use this library.
pip install fuzzywuzzy
Solving our Problem using Fuzzywuzzy
First, import the class “process” from Fuzzywuzzy
from fuzzywuzzy import process
Now, all we need is these four lines of code to solve our problem.
# for loop to match each state in Indian_Crime_Data with Indian_Population Data Framefor state in Indian_Crime_Data['State']:
match = process.extract(state,Indian_Population['State'],limit = 1)
Indian_Crime_Data['State'] = Indian_Crime_Data['State'].str.replace(state,match)
let me explain to you what process.extract() does. It takes three arguments.
- A string that we compare with an array of strings
- An array of strings
- a limit = n(integer) argument to specify how many matches should we return as a list
After passing these arguments and executing process.extract(), it returns a list containing tuples, which has a match along with its score. These tuples are sorted in descending order with respect to their score. Check the example below to see how the process.extract() works
The Final Data Frame:
After we are done with this text/String matching using Fuzzywuzzy. We can now easily merge the data on the ‘State’ column.
# Merging the Data Frames on State column
Indian_crime = Indian_Crime_Data.merge(Indian_Population,on = 'State')
This is just one of many scenarios where knowing Fuzzywuzzy can be quite handy. With this, I would like to thank you for reading my very first article on this platform and I hope to see you again soon.