Company matching is the process of comparing similar names to determine the degree of similarity between different records in a data source. The goal of matching is to identify companies that may be represented under different names in various sources.
Let's take Apple as an example for our experiment and imagine we already have a data source with different names for this company. Then, if we encounter names like "Epl," "AAPL," "Apple Inc," we should identify them as the same company by comparing their string units. It's worth noting that in real situations, companies can have much more complex spelling variants and more variability than in the example above. In such cases, more advanced matching algorithms are used that meet specific requirements.
As mentioned earlier, we have several types of matching and finding correspondence between records. The choice of method depends on the scenarios we're working with and the requirements related to the nature of the data. In this article, we'll use the string method, which is effective when comparing ready-made text strings. However, if we have multiple feature data, not just strings — for example, images — then in such cases, it's worth using a machine learning-based algorithm. In general, we can identify two main data matching methods:
String Algorithm: Used for comparing string sequences Machine Learning Algorithm: Used when it's necessary to consider multiple different factors or data types.
The main difficulty of any matching is data search and processing. Data mining is an entire separate field with unique characteristics. One of the problems we may encounter when matching company names includes:
When working with data, we often encounter various problems; this is normal practice, so before working with real data, you need to be confident in complete data standardization.
First, we need to generate test data. We will generate it in 2 txt files. The code first imports the necessary libraries: pandas for data manipulation, and random and string for generating random names. Then a function generate_company_name is defined, which creates random company names. Each word is formed from a random number of letters (from 5 to 10). The generated data is saved in two files (companies_a.txt and companies_b.txt).
Here's the complete code:
import pandas as pd
import random
import string
def generate_company_name():
prefix = random.choice(['Tech', 'Global', 'International', 'Innovative', 'Advanced'])
suffix = random.choice(['Corp.', 'Ltd.', 'Inc.', 'Group', 'Holdings'])
word_count = random.randint(1, 3)
words = [''.join(random.choice(string.ascii_uppercase + string.ascii_lowercase) for _ in range(random.randint(5, 10))) for _ in range(word_count)]
return f'{prefix} {" ".join(words)} {suffix}'
data_a = {'Company name': [generate_company_name() for _ in range(100)]}
data_b = {'name': [generate_company_name() for _ in range(100)]}
df_companies_a = pd.DataFrame(data_a)
df_companies_b = pd.DataFrame(data_b)
df_companies_a.to_csv('companies_a.txt', index=False, header=False)
df_companies_b.to_csv('companies_b.txt', index=False, header=False)
print("Data saved in companies_a.txt and companies_b.txt")
First, the data needs to be cleaned and put in order, remove spaces at the edges, use 'lower()' for converting to lowercase. Typos can also be automatically corrected. Using pandas library.
pandas is a high-level Python library for data analysis. It is built on top of the lower-level NumPy library (written in C), which is a big advantage for performance.
import pandas as pd
df_companies_a = pd.read_csv('companies_a.txt', header=None, names=['Company name'])
df_companies_b = pd.read_csv('companies_b.txt', header=None, names=['name'])
df_companies_a['Company name'] = df_companies_a['Company name'].str.strip().str.lower()
df_companies_b['name'] = df_companies_b['name'].str.strip().str.lower()
df_companies_b['name'] = df_companies_b['name'].replace({
'amerika': 'america',
'hscb': 'hsbc',
'credite': 'credit'
}, regex=True)
print("DataFrame A after cleaning:")
print(df_companies_a)
print("\nDataFrame B after cleaning:")
print(df_companies_b)
You can remove punctuation marks from company names to make the text cleaner and more unified.
df_companies_a['Company name'] = df_companies_a['Company name'].str.replace(r'\W\s', ' ').str.strip().str.lower()
df_companies_b['name'] = df_companies_b['name'].str.replace(r'\W\s', ' ').str.strip().str.lower()
Remove unnecessary digits.
df_companies_a['Company name'] = df_companies_a['Company name'].str.replace(r'\d', '').str.strip().str.lower()
df_companies_b['name'] = df_companies_b['name'].str.replace(r'\d', '').str.strip().st
Also apply lemmatization and stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
df_companies_a['Company name'] = df_companies_a['Company name'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))
df_companies_b['name'] = df_companies_b['name'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))
This code uses the NLTK library's PorterStemmer to perform word stemming on company names, reducing words to their root form.
Jaro Similarity is also a good method for solving the task at hand. The Jaro distance value ranges from 0 to 1, where 1 means the strings are equal, and 0 means there is no similarity between the two strings.
For example,
string1 = "WINKLER"
string2 = "WINSTER"
Jaro Similarity = 0.809524
The following is an example of Jaro-Winkler implementation:
Companies_a.txt contains entries like:
Innovative YNLPejYn Group
International nDzRM xaDeLg MVJVRgyY Group
International wFZYgs rYwYBWMnG Ltd.
...
Companies_b.txt contains entries like:
International MGQTEGqM AuTAJCq eiWIKZ Ltd.
Advanced TFOzjbP wSQeJ Holdings
Global YmIBVsO Ltd.
...
The advantages of the Jaro Winkler method are as follows.
First, let's import the required libraries.
pandas as pd
from fuzzywuzzy import fuzz
def jaro_winkler_similarity(s1, s2):
return fuzz.ratio(s1, s2)
Convert company names to a uniform format (to lowercase).
df_companies_a = pd.read_csv('companies_a.txt', header=None, names=['Company name'])
df_companies_b = pd.read_csv('companies_b.txt', header=None, names=['name'])
df_companies_a['Company name'] = df_companies_a['Company name'].str.lower()
df_companies_b['name'] = df_companies_b['name'].str.lower()
Create a similarity matrix and DataFrame for each pair of companies.
df_similarity = pd.DataFrame(similarity_matrix,
columns=df_companies_b['name'], index=df_companies_a['Company name'])
Print the result.
print(df_similarity)
If implementing the jaro_winkler method manually, the code will look as follows:
If strings are equal:
from math import floor
def jaro_similarity(str1, str2):
if str1 == str2:
return 1.0
len1, len2 = len(str1), len(str2)
max_dist = floor(max(len1, len2) / 2) - 1
match = 0
hash_str1 = [0] * len(str1)
hash_str2 = [0] * len(str2)
After this we count matches:
for i in range(len1):
for j in range(max(0, i - max_dist), min(len2, i + max_dist + 1)):
if str1[i] == str2[j] and hash_str2[j] == 0:
hash_str1[i] = 1
hash_str2[j] = 1
match += 1
break
If there are no matches:
if match == 0:
return 0.0
t_point = 0.0
After this we count the number of transpositions and return the Jaro Similarity coefficient.
for i in range(len1):
if hash_str1[i]:
while hash_str2[point] == 0:
point += 1
if str1[i] != str2[point]:
t += 1
point += 1
t = t // 2
return (match / len1 + match / len2 + (match - t) / match) / 3.0
string1 = "WINKLER"
string2 = "WINSTER"
print(round(jaro_similarity(s1, s2), 6))
The advantages of the Jaro Winkler method are as follows. First, it works effectively with short and medium-length strings, which are common in company names. Additionally, it allows users to set a similarity threshold to determine what level of similarity is considered a match, providing flexibility depending on matching requirements. The Jaro Similarity metric also measures the closeness between two strings and assigns a value indicating the degree of similarity, which can be useful for ranking or categorizing matches.
Cosine similarity is a metric that measures the angle between two vectors in data (in our case, it's textual information, company names). Cosine similarity evaluates the similarity between these vectors by calculating the cosine of the angle they form in some space.
If cosine similarity equals 1, it means maximum similarity between vectors, while 0 indicates no similarity.
Implementation in Python looks as follows (using the same data):
Import libraries:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
df_companies_a = pd.read_csv('companies_a.txt', header=None, names=['Company name'])
df_companies_b = pd.read_csv('companies_b.txt', header=None, names=['name'])
df_companies_a['Company name'] = df_companies_a['Company name'].str.lower()
df_companies_b['name'] = df_companies_b['name'].str.lower()
For simplicity, we combine both dataframes into one dataframe:
df_combined = pd.concat([df_companies_a, df_companies_b], axis=1)
Process text data:
vectorizer = CountVectorizer().fit_transform(df_combined['Company name'].values.astype(U))
cosine_similarities = cosine_similarity(vectorizer)
Output similarity scores for each pair of companies:
print("Cosine Similarity matrix:")
print(cosine_similarities)
for i in range(len(df_companies_a)):
for j in range(len(df_companies_b)):
print(f"Similarity between '{df_companies_a['Company name'][i]}' and '{df_companies_b['name'][j]}': {cosine_similarities[i, j]}")
Cosine similarity is advantageous for comparing text data since it measures the cosine of the angle between vectors, making it insensitive to vector length. However, Cosine similarity does not take into account word order, and all measurements are considered equally important.
Levenshtein distance is a similarity metric between two string sequences. The larger the distance, the more different the strings are. For two identical sequences, the distance is zero.
For example, between the words "ключ" and "клюв" the Levenshtein distance = 1. Because only one letter needs to be replaced.
Example of Python implementation without external libraries:
First, we initialize the distance matrix and the first column and row of the matrix:
def levenshtein_distance(str1, str2):
len_str1 = len(str1) + 1
len_str2 = len(str2) + 1
matrix = [[0 for _ in range(len_str2)] for _ in range(len_str1)]
for i in range(len_str1):
matrix[i][0] = i
for j in range(len_str2):
matrix[0][j] = j
Then we fill the matrix, calculate the distance:
for i in range(1, len_str1):
for j in range(1, len_str2):
cost = 0 if str1[i - 1] == str2[j - 1] else 1
matrix[i][j] = min(
matrix[i - 1][j] + 1, # deletion
matrix[i][j - 1] + 1, # insertion
matrix[i - 1][j - 1] + cost # substitution
)
Example:
str1 = "клюв"
str2 = "ключ"
distance = levenshtein_distance(str1, str2)
print(f"Levenshtein distance between '{str1}' and '{str2}': {distance}")
Output:
Levenshtein distance between 'клюв' and 'ключ': 1
Process finished with exit code 0
It should be noted that the Levenshtein distance method is sensitive to string length.
We studied the company matching process, which involves comparing and identifying similar records across different data sources. We learned to use string methods and libraries, such as name_matcher in Python, for effective text string comparison. We understood that company matching is important for removing duplicates, ensuring data analysis accuracy, and creating a unified database. We also examined matching challenges, such as typos and data ambiguity, and learned to account for them when working with real data.