paint-brush
Master Company Name Matching in Minutes with Python's Ready-Made Librariesby@ismailkarankin

Master Company Name Matching in Minutes with Python's Ready-Made Libraries

by Ismail KarankinJanuary 6th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Company matching is the process of comparing similar names to determine the degree of similarity between different records in a data source. The goal of matching is to identify companies that may be represented under different names in various sources.
featured image - Master Company Name Matching in Minutes with Python's Ready-Made Libraries
Ismail Karankin HackerNoon profile picture

Company matching is the process of comparing similar names to determine the degree of similarity between different records in a data source. The goal of matching is to identify companies that may be represented under different names in various sources.

How Name Identification Works: An Example

Let's take Apple as an example for our experiment and imagine we already have a data source with different names for this company. Then, if we encounter names like "Epl," "AAPL," "Apple Inc," we should identify them as the same company by comparing their string units. It's worth noting that in real situations, companies can have much more complex spelling variants and more variability than in the example above. In such cases, more advanced matching algorithms are used that meet specific requirements.

Matching Methods

As mentioned earlier, we have several types of matching and finding correspondence between records. The choice of method depends on the scenarios we're working with and the requirements related to the nature of the data. In this article, we'll use the string method, which is effective when comparing ready-made text strings. However, if we have multiple feature data, not just strings — for example, images — then in such cases, it's worth using a machine learning-based algorithm. In general, we can identify two main data matching methods:


String Algorithm: Used for comparing string sequences Machine Learning Algorithm: Used when it's necessary to consider multiple different factors or data types.

Matching Difficulties

The main difficulty of any matching is data search and processing. Data mining is an entire separate field with unique characteristics. One of the problems we may encounter when matching company names includes:

  • Typos in company names
  • Data taken from different sources that have inconsistencies or outdated company names
  • Changes in real company data over time
  • Ambiguity of company names and vagueness


When working with data, we often encounter various problems; this is normal practice, so before working with real data, you need to be confident in complete data standardization.

Data Generation

First, we need to generate test data. We will generate it in 2 txt files. The code first imports the necessary libraries: pandas for data manipulation, and random and string for generating random names. Then a function generate_company_name is defined, which creates random company names. Each word is formed from a random number of letters (from 5 to 10). The generated data is saved in two files (companies_a.txt and companies_b.txt).


Here's the complete code:

import pandas as pd
import random
import string

def generate_company_name():
    prefix = random.choice(['Tech', 'Global', 'International', 'Innovative', 'Advanced'])
    suffix = random.choice(['Corp.', 'Ltd.', 'Inc.', 'Group', 'Holdings'])
    word_count = random.randint(1, 3)
    words = [''.join(random.choice(string.ascii_uppercase + string.ascii_lowercase) for _ in range(random.randint(5, 10))) for _ in range(word_count)]
    return f'{prefix} {" ".join(words)} {suffix}'

data_a = {'Company name': [generate_company_name() for _ in range(100)]}
data_b = {'name': [generate_company_name() for _ in range(100)]}

df_companies_a = pd.DataFrame(data_a)
df_companies_b = pd.DataFrame(data_b)

df_companies_a.to_csv('companies_a.txt', index=False, header=False)
df_companies_b.to_csv('companies_b.txt', index=False, header=False)

print("Data saved in companies_a.txt and companies_b.txt")  

Preprocessing

First, the data needs to be cleaned and put in order, remove spaces at the edges, use 'lower()' for converting to lowercase. Typos can also be automatically corrected. Using pandas library.


pandas is a high-level Python library for data analysis. It is built on top of the lower-level NumPy library (written in C), which is a big advantage for performance.

import pandas as pd

df_companies_a = pd.read_csv('companies_a.txt', header=None, names=['Company name'])
df_companies_b = pd.read_csv('companies_b.txt', header=None, names=['name'])

df_companies_a['Company name'] = df_companies_a['Company name'].str.strip().str.lower()
df_companies_b['name'] = df_companies_b['name'].str.strip().str.lower()

df_companies_b['name'] = df_companies_b['name'].replace({
    'amerika': 'america', 
    'hscb': 'hsbc',
    'credite': 'credit'
}, regex=True)

print("DataFrame A after cleaning:")
print(df_companies_a)
print("\nDataFrame B after cleaning:")
print(df_companies_b)


You can remove punctuation marks from company names to make the text cleaner and more unified.

df_companies_a['Company name'] = df_companies_a['Company name'].str.replace(r'\W\s', ' ').str.strip().str.lower()
df_companies_b['name'] = df_companies_b['name'].str.replace(r'\W\s', ' ').str.strip().str.lower()


Remove unnecessary digits.

df_companies_a['Company name'] = df_companies_a['Company name'].str.replace(r'\d', '').str.strip().str.lower()
df_companies_b['name'] = df_companies_b['name'].str.replace(r'\d', '').str.strip().st


Also apply lemmatization and stemming

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

df_companies_a['Company name'] = df_companies_a['Company name'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))
df_companies_b['name'] = df_companies_b['name'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))


This code uses the NLTK library's PorterStemmer to perform word stemming on company names, reducing words to their root form.

Implementation of the Jaro-Winkler Method

Jaro Similarity is also a good method for solving the task at hand. The Jaro distance value ranges from 0 to 1, where 1 means the strings are equal, and 0 means there is no similarity between the two strings.


For example,

string1 = "WINKLER"

string2 = "WINSTER"

Jaro Similarity = 0.809524


The following is an example of Jaro-Winkler implementation:


Companies_a.txt contains entries like:

Innovative YNLPejYn Group
International nDzRM xaDeLg MVJVRgyY Group
International wFZYgs rYwYBWMnG Ltd.
...


Companies_b.txt contains entries like:

International MGQTEGqM AuTAJCq eiWIKZ Ltd.
Advanced TFOzjbP wSQeJ Holdings
Global YmIBVsO Ltd.
...


The advantages of the Jaro Winkler method are as follows.


First, let's import the required libraries.

pandas as pd
from fuzzywuzzy import fuzz

def jaro_winkler_similarity(s1, s2):
    return fuzz.ratio(s1, s2)


Convert company names to a uniform format (to lowercase).

df_companies_a = pd.read_csv('companies_a.txt', header=None, names=['Company name'])
df_companies_b = pd.read_csv('companies_b.txt', header=None, names=['name'])

df_companies_a['Company name'] = df_companies_a['Company name'].str.lower()
df_companies_b['name'] = df_companies_b['name'].str.lower()


Create a similarity matrix and DataFrame for each pair of companies.

df_similarity = pd.DataFrame(similarity_matrix,
    columns=df_companies_b['name'], index=df_companies_a['Company name'])


Print the result.

print(df_similarity)


If implementing the jaro_winkler method manually, the code will look as follows:


If strings are equal:

from math import floor

def jaro_similarity(str1, str2):
    if str1 == str2:
        return 1.0

    len1, len2 = len(str1), len(str2)
    max_dist = floor(max(len1, len2) / 2) - 1
    match = 0
    hash_str1 = [0] * len(str1)
    hash_str2 = [0] * len(str2)


After this we count matches:

for i in range(len1):
    for j in range(max(0, i - max_dist), min(len2, i + max_dist + 1)):
        if str1[i] == str2[j] and hash_str2[j] == 0:
            hash_str1[i] = 1
            hash_str2[j] = 1
            match += 1
            break


If there are no matches:

if match == 0:
    return 0.0

t_point = 0.0


After this we count the number of transpositions and return the Jaro Similarity coefficient.

for i in range(len1):
    if hash_str1[i]:
        while hash_str2[point] == 0:
            point += 1
        if str1[i] != str2[point]:
            t += 1
        point += 1
t = t // 2

return (match / len1 + match / len2 + (match - t) / match) / 3.0

string1 = "WINKLER"
string2 = "WINSTER"
print(round(jaro_similarity(s1, s2), 6))


The advantages of the Jaro Winkler method are as follows. First, it works effectively with short and medium-length strings, which are common in company names. Additionally, it allows users to set a similarity threshold to determine what level of similarity is considered a match, providing flexibility depending on matching requirements. The Jaro Similarity metric also measures the closeness between two strings and assigns a value indicating the degree of similarity, which can be useful for ranking or categorizing matches.

Implementation of Cosine Similarity

Cosine similarity is a metric that measures the angle between two vectors in data (in our case, it's textual information, company names). Cosine similarity evaluates the similarity between these vectors by calculating the cosine of the angle they form in some space.


If cosine similarity equals 1, it means maximum similarity between vectors, while 0 indicates no similarity.


Implementation in Python looks as follows (using the same data):


Import libraries:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

df_companies_a = pd.read_csv('companies_a.txt', header=None, names=['Company name'])
df_companies_b = pd.read_csv('companies_b.txt', header=None, names=['name'])

df_companies_a['Company name'] = df_companies_a['Company name'].str.lower()
df_companies_b['name'] = df_companies_b['name'].str.lower()


For simplicity, we combine both dataframes into one dataframe:

df_combined = pd.concat([df_companies_a, df_companies_b], axis=1)


Process text data:

vectorizer = CountVectorizer().fit_transform(df_combined['Company name'].values.astype(U))
cosine_similarities = cosine_similarity(vectorizer)


Output similarity scores for each pair of companies:

print("Cosine Similarity matrix:")
print(cosine_similarities)

for i in range(len(df_companies_a)):
    for j in range(len(df_companies_b)):
        print(f"Similarity between '{df_companies_a['Company name'][i]}' and '{df_companies_b['name'][j]}': {cosine_similarities[i, j]}")


Cosine similarity is advantageous for comparing text data since it measures the cosine of the angle between vectors, making it insensitive to vector length. However, Cosine similarity does not take into account word order, and all measurements are considered equally important.

Implementation of Levenshtein Distance

Levenshtein distance is a similarity metric between two string sequences. The larger the distance, the more different the strings are. For two identical sequences, the distance is zero.


For example, between the words "ключ" and "клюв" the Levenshtein distance = 1. Because only one letter needs to be replaced.


Example of Python implementation without external libraries:


First, we initialize the distance matrix and the first column and row of the matrix:

def levenshtein_distance(str1, str2):
    len_str1 = len(str1) + 1
    len_str2 = len(str2) + 1
    
    matrix = [[0 for _ in range(len_str2)] for _ in range(len_str1)]
    
    for i in range(len_str1):
        matrix[i][0] = i
    for j in range(len_str2):
        matrix[0][j] = j


Then we fill the matrix, calculate the distance:

for i in range(1, len_str1):
    for j in range(1, len_str2):
        cost = 0 if str1[i - 1] == str2[j - 1] else 1
        matrix[i][j] = min(
            matrix[i - 1][j] + 1,      # deletion
            matrix[i][j - 1] + 1,      # insertion
            matrix[i - 1][j - 1] + cost # substitution
        )


Example:

str1 = "клюв"
str2 = "ключ"
distance = levenshtein_distance(str1, str2)
print(f"Levenshtein distance between '{str1}' and '{str2}': {distance}")


Output:

Levenshtein distance between 'клюв' and 'ключ': 1
Process finished with exit code 0


It should be noted that the Levenshtein distance method is sensitive to string length.

Conclusion

We studied the company matching process, which involves comparing and identifying similar records across different data sources. We learned to use string methods and libraries, such as name_matcher in Python, for effective text string comparison. We understood that company matching is important for removing duplicates, ensuring data analysis accuracy, and creating a unified database. We also examined matching challenges, such as typos and data ambiguity, and learned to account for them when working with real data.