barilog

エンジニアリング見習い雑記

Bigquery上で行う文字列類似度計算 〜Levenshtein Distance〜

名寄せなどを行うために、しばしばlevenshtein distance通称編集距離による類似度比較が行われます。
私の場合bigquery上でこの処理を行いたかったのですが探しても探しても該当コードが見あたらなかったため簡単に関数を自作しました。
標準SQL文のみだと実装がめんどくさかったので、javascriptの関数ベースで実装しました。
幸いlevenshtein distance用のjavascript関数が

https://github.com/trekhleb/javascript-algorithms/blob/master/src/algorithms/string/levenshtein-distance/levenshteinDistance.js
こちらに掲載されておりました。ただこちらでは結果の正規化がなされていなかったので、本コードでは正規化コードを追記しました。


```
CREATE TEMP FUNCTION NormalizedLevenshteinDistance(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS '''
const distanceMatrix = Array(b.length + 1).fill(null).map*1;
// Fill the first row of the matrix.
// If this is first row then we're transforming empty string to a.
// In this case the number of transformations equals to size of a substring.
for (let i = 0; i <= a.length; i += 1) {
distanceMatrix[0][i] = i;
}

// Fill the first column of the matrix.
// If this is first column then we're transforming empty string to b.
// In this case the number of transformations equals to size of b substring.
for (let j = 0; j <= b.length; j += 1) {
distanceMatrix[j][0] = j;
}

for (let j = 1; j <= b.length; j += 1) {
for (let i = 1; i <= a.length; i += 1) {
const indicator = a[i - 1] === b[j - 1] ? 0 : 1;
distanceMatrix[j][i] = Math.min(
distanceMatrix[j][i - 1] + 1, // deletion
distanceMatrix[j - 1][i] + 1, // insertion
distanceMatrix[j - 1][i - 1] + indicator, // substitution
);
}
}
 let maxLen = Math.max(a.length, b.length)

return distanceMatrix[b.length][a.length] / maxLen;
''';
```


実際に使用する場合には以下のように使用してください。
```

CREATE TEMP FUNCTION NormalizedLevenshteinDistance(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS '''
const distanceMatrix = Array(b.length + 1).fill(null).map*2;
// Fill the first row of the matrix.
// If this is first row then we're transforming empty string to a.
// In this case the number of transformations equals to size of a substring.
for (let i = 0; i <= a.length; i += 1) {
distanceMatrix[0][i] = i;
}

// Fill the first column of the matrix.
// If this is first column then we're transforming empty string to b.
// In this case the number of transformations equals to size of b substring.
for (let j = 0; j <= b.length; j += 1) {
distanceMatrix[j][0] = j;
}

for (let j = 1; j <= b.length; j += 1) {
for (let i = 1; i <= a.length; i += 1) {
const indicator = a[i - 1] === b[j - 1] ? 0 : 1;
distanceMatrix[j][i] = Math.min(
distanceMatrix[j][i - 1] + 1, // deletion
distanceMatrix[j - 1][i] + 1, // insertion
distanceMatrix[j - 1][i - 1] + indicator, // substitution
);
}
}
 let maxLen = Math.max(a.length, b.length)

return distanceMatrix[b.length][a.length] / maxLen;
''';

WITH strings AS
(SELECT "おはよう" AS x, "おはよa" as y
UNION ALL
SELECT "おはよう" AS x, "おはうよ" as y
UNION ALL
SELECT "おはよう" as x, "おはう" as y
UNION ALL
SELECT "おはよう" as x , "おうはよ"
UNION ALL
SELECT "おはよう" as x, "おaaa" as y)
SELECT x,y, NormalizedLevenshteinDistance(x, y) as `NormalizedLevenshteinDistance`
FROM strings;

```

*1:) => Array(a.length + 1).fill(null

*2:) => Array(a.length + 1).fill(null