secure_inner_join.lsh module
This implements Locality-Sensitive Hashing for dates and zip2-codes.
- secure_inner_join.lsh.encode(day, month, year, zip4_code)[source]
Encodes day, month, year and zip2 to a Tuple.
- Parameters:
day (
int
) – day of birthmonth (
int
) – month of birthyear (
int
) – year of birthzip4_code (
int
) – the four digits of the postal code
- Return type:
Tuple
[int
,int
,int
,int
]- Returns:
encoded representation
- secure_inner_join.lsh.get_hyper_planes(amount=2000, seed=42, mask=False)[source]
Construct a specified number of hyper planes with a set seed. We assume the following order: (day, month, year, zip2-code).
- Parameters:
amount (
int
) – number of hyper planes to constructseed (
int
) – seed to use for the random generatormask (
bool
) – set to true to generate a bit mask to use for masking
- Return type:
Union
[ndarray
[Any
,dtype
[int64
]],Tuple
[ndarray
[Any
,dtype
[int64
]],bitarray
]]- Returns:
array containing the random hyper planes
- secure_inner_join.lsh.lsh_hash(day, month, year, zip4_code, hyper_planes, bit_mask=None)[source]
Computes a hash encoding for a given encoded input, given a collection of hyperplanes
- Parameters:
day (
int
) – day of birthmonth (
int
) – month of birthyear (
int
) – year of birthzip4_code (
int
) – the four digits of the postal codehyper_planes (
ndarray
[Any
,dtype
[int64
]]) – \(n\) hyperplanes sampled from \([0,62) imes[0,12) imes[0,100) imes[10,100)\)bit_mask (
Optional
[bitarray
]) – masking to apply to the hashing
- Return type:
bitarray
- Returns:
an encode hash, first for \(n\) bits belong to day, second \(n\) bits belong to month, etc.
- secure_inner_join.lsh.weighted_hamming_distance(hash_1, hash_2)[source]
if score ~= 1 than we expect at most one element to be one-off
The score represents the actual distance between two encodings if the number of buckets is large enough :type hash_1:
bitarray
:param hash_1: first hash :type hash_2:bitarray
:param hash_2: second hash :rtype:Tuple
[float
,Tuple
[float
,float
,float
,float
]] :return: an x-off distance score, and a tuple of x-off distances per (day, month, year, zip2)