The Universal Sentence Encoder and Locality Sensitive Hashing..
improving search time and file size for sentence embedding searches
(Yet another way to browse Sailor Moon)
the universal sentence encoder
The Universal Sentence Encoder is a model trained on a variety of language data sources. It is able to turn a sentence into a 512-dimension vector (also called an embedding). The goal is to have semantically similar sentences produce vectors that point in a similar direction.
Embeddings are a common way for neural nets to represent real higher dimension data and are especially common in language processing. Consider how many possible word combinations there are in sentences, and the variety of sentence lengths. It's hard to compare this information programmatically. Making embeddings especially useful.
So a sentence is fed into the network and an array of 512 numbers is produced, which can be compared to other arrays created from other sentences. These embeddings can be used for a variety of natural language processing tasks, but in this case to compare sentence similarity.
The neural network is capable of doing this as it was trained on sentence
pairs that had their semantic textual similarity scored by humans.
Let's see how some sentences compare for fun.
The image below shows several sentences compared to each other and scored based on their
semantic similarity using the universal sentence encoder model.

Comparing Embeddings
Let's take a look at an example embedding, this one is for the sentence, "praise the sun"
[ -0.01643446274101734, -0.047866735607385635, 0.01116135623306036, -0.032162975519895554, -0.04119051992893219, -0.049514010548591614, 0.05056055262684822, 0.01861172914505005, 0.02079540677368641, 0.03788421303033829, 0.055882297456264496, 0.052787601947784424, -0.04854540899395943, 0.0047414726577699184, -0.053531713783741, -0.017291227355599403, 0.053345393389463425, 0.05225902795791626, 0.044134367257356644, 0.034594688564538956, 0.02990477718412876, 0.01906985603272915, 0.04899447783827782, 0.055814746767282486, -0.05602262541651726, 0.05459308251738548, 0.05091241002082825, -0.05024508386850357, -0.05322225019335747, 0.020622193813323975, 0.022310415282845497, -0.05342493951320648, -0.03910081088542938, -0.054072655737400055, 0.04943077266216278, -0.05404074117541313, -0.05312085151672363, 0.048237964510917664, 0.052785295993089676, -0.04522813856601715, 0.053926967084407806, 0.03370456025004387, 0.05236560106277466, 0.0015304642729461193, -0.044569939374923706, -0.04093872010707855, 0.0400158166885376, 0.05586504936218262, 0.040547121316194534, 0.0036303112283349037, 0.05467100441455841, -0.05315506085753441, 0.05460590869188309, -0.05341360718011856, 0.019216563552618027, 0.014945510774850845, 0.008523264899849892, -0.05486554652452469, -0.05196134373545647, -0.0503271222114563, -0.054707158356904984, -0.03312518447637558, -0.05008922889828682, 0.0006174301961436868, -0.03885316103696823, 0.05602826178073883, -0.05558229237794876, -0.03280400112271309, 0.0557483471930027, -0.020747335627675056, -0.04034963250160217, -0.05323438718914986, 0.05147049203515053, 0.05478457361459732, 0.050157900899648666, -0.05495397374033928, 0.040603939443826675, 0.056042611598968506, 0.043434031307697296, -0.0055681923404335976, -0.03814433142542839, -0.03673834726214409, -0.05543023720383644, -0.05131136626005173, -0.05583127215504646, 0.04594847932457924, 0.05276202782988548, -0.05573916807770729, -0.022053785622119904, 0.0015039695426821709, -0.0377194806933403, 0.023686300963163376, -0.04683661460876465, 0.04619279131293297, -0.04374907910823822, -0.027344444766640663, 0.038678839802742004, -0.020303141325712204, 0.051446061581373215, -0.021707113832235336, -0.05371095985174179, 0.055392805486917496, 0.05516625568270683, 0.056031569838523865, -0.046048443764448166, -0.052267421036958694, 0.036351095885038376, 0.042035214602947235, -0.030406028032302856, -0.027976704761385918, -0.024354681372642517, -0.03456025943160057, -0.048526324331760406, -0.054300062358379364, 0.03975285217165947, 0.03405265137553215, -0.0008716466836631298, 0.04143061488866806, -0.017759624868631363, 0.04933466389775276, 0.04317356273531914, -0.05370345711708069, 0.046637848019599915, -0.05604122579097748, -0.04903582111001015, -0.008755053393542767, 0.02665846049785614, 0.05143268033862114, 0.015641331672668457, 0.053754791617393494, 0.03222831338644028, -0.05548969656229019, -0.05565796047449112, 0.05579647049307823, -0.025518905371427536, -0.05387597158551216, 0.02367117628455162, 0.043358735740184784, -0.032881706953048706, 0.04977487027645111, -0.025500651448965073, -0.01830417476594448, 0.054875925183296204, 0.05603192746639252, -0.0505065955221653, -0.023587802425026894, 0.028091568499803543, -0.05503155291080475, 0.03317264840006828, -0.05222945660352707, 0.0002477589878253639, -0.05411040037870407, -0.04951809346675873, -0.003709644079208374, 0.05603909492492676, 0.00677878875285387, 0.03319043666124344, -0.052565574645996094, -0.05374324321746826, 0.05126998946070671, -0.010412076488137245, -0.016527077183127403, 0.055314358323812485, -0.05408173426985741, 0.008594626560807228, 0.055709823966026306, 0.013491012156009674, -0.0006980478647165, 0.0053011467680335045, 0.045773740857839584, -0.05204932764172554, -0.023253854364156723, -0.05288105458021164, 0.045529551804065704, -0.055810973048210144, 0.006248749326914549, -0.052872903645038605, 0.012982008047401905, 0.04929075390100479, 0.051177412271499634, 0.024482639506459236, -0.0023270556703209877, 0.05582268536090851, -0.05187274515628815, 0.05578025430440903, -0.02887865900993347, -0.055860910564661026, 0.049252405762672424, 0.027567410841584206, 0.004360144957900047, -0.03805725648999214, 0.04781678318977356, 0.02349051833152771, -0.05355745181441307, -0.0010168420849367976, 0.019375935196876526, -0.048489831387996674, 0.05290268734097481, -0.0514141209423542, -0.05219649523496628, 0.053786858916282654, 0.04703744500875473, -0.05548414960503578, -0.05426885187625885, 0.04358344152569771, -0.04335813596844673, 0.055667418986558914, -0.05191067233681679, 0.050331346690654755, 0.052997346967458725, -0.05290118604898453, 0.05595087260007858, 0.02919776737689972, 0.0279071107506752, -0.05300073325634003, -0.025899618864059448, 0.05566255375742912, 0.01740713231265545, 0.05597109720110893, 0.05386003479361534, 0.04719238728284836, -0.0413287952542305, 0.05515514686703682, -0.05304962396621704, -0.0282291192561388, -0.016887780278921127, -0.02695431374013424, 0.018786640837788582, 0.05564965307712555, -0.05578072369098663, 0.02692331001162529, 0.042407725006341934, -0.05515687167644501, 0.05270160362124443, -0.011442351154983044, 0.055934712290763855, 0.05354756489396095, 0.03316701203584671, 0.05588717758655548, 0.05604216828942299, -0.008422136306762695, 0.043577276170253754, -0.008080055005848408, 0.016125492751598358, 0.046945225447416306, 0.043736327439546585, -0.023303015157580376, -0.050779227167367935, 0.04854461923241615, -0.0560176819562912, -0.04963786527514458, -0.054972484707832336, 0.051782578229904175, -0.05325941741466522, 0.01675291731953621, 0.05401768162846565, -0.029279962182044983, 0.029603365808725357, 0.05384274199604988, 0.05026647076010704, -0.05591430142521858, 0.014729233458638191, 0.004318914841860533, 0.04583499953150749, -0.028199832886457443, 0.05517265945672989, 0.05195918306708336, 0.03361981362104416, -0.005860268138349056, 0.054360758513212204, -0.04790390655398369, -0.04412402957677841, 0.05526090785861015, -0.04397204518318176, 0.0510873906314373, 0.054974354803562164, -0.05430081859230995, 0.0511348582804203, -0.05075996369123459, -0.017456840723752975, -0.054586976766586304, 0.04988223314285278, -0.031048662960529327, -0.0019140191143378615, -0.00025244030985049903, -0.03073504939675331, -0.008409513160586357, 0.04650219529867172, 0.0459686703979969, -0.018416931852698326, -0.05508637800812721, 0.05005917325615883, -0.03366953879594803, -0.02187909558415413, 0.041992828249931335, -0.05555545166134834, 0.05599233880639076, 0.03982987254858017, 0.04466022178530693, 0.04570598155260086, 0.055467091500759125, 0.05319790542125702, -0.05507420003414154, 0.055999744683504105, -0.041691623628139496, -0.01942344382405281, -0.018287384882569313, 0.05570831894874573, -0.028560396283864975, -0.052859265357255936, 0.02748016081750393, -0.05497709661722183, 0.05500137433409691, 0.0538204200565815, 0.05325619876384735, -0.05123851075768471, -0.05441600829362869, -0.04787558317184448, 0.05543545261025429, 0.04792202636599541, 0.04059022665023804, -0.047015149146318436, 0.01801333576440811, -0.04497704282402992, 0.05537839233875275, -0.04872656241059303, -0.0498863160610199, -0.05248422548174858, 0.004491743631660938, -0.056007225066423416, 0.04945643991231918, -0.05558210611343384, -0.053780682384967804, 0.036105845123529434, 0.05598032474517822, 0.053630754351615906, -0.04550294950604439, 0.05588221922516823, -0.03173701465129852, 0.026102842763066292, -0.032342758029699326, 0.054514095187187195, 0.05471456050872803, 0.0535542294383049, 0.024944543838500977, 0.022099237889051437, 0.05573274940252304, 0.03772395849227905, -0.006928583141416311, 0.055451977998018265, 0.05255221948027611, -0.020383965224027634, 0.014840695075690746, -0.052972737699747086, 0.04729835316538811, 0.041253626346588135, -0.04857562109827995, -0.03430145978927612, 0.030324283987283707, 0.051337189972400665, 0.05483871325850487, 0.05280350521206856, -0.024308906868100166, -0.055778954178094864, -0.027072394266724586, -0.04774164780974388, 0.055680446326732635, -0.022918766364455223, -0.031212477013468742, -0.055910054594278336, -0.022955654188990593, -0.0385383665561676, 0.04803907498717308, -0.050335753709077835, 0.0509420782327652, 0.05588535591959953, -0.007357464637607336, 0.052812933921813965, 0.022859547287225723, 0.05572576820850372, -0.003088396741077304, 0.03805335611104965, -0.05365729704499245, 0.05554977431893349, 0.00746613135561347, -0.05584201589226723, 0.05582189932465553, -0.008294692263007164, -0.041224587708711624, -0.03395610302686691, -0.03327474743127823, -0.03447003290057182, 0.023300718516111374, 0.03380630537867546, 0.03800996020436287, 0.05538610741496086, 0.05519676208496094, -0.056026309728622437, -0.055212780833244324, 0.05570850148797035, 0.029411407187581062, 0.055867381393909454, -0.041572559624910355, 0.054851919412612915, -0.04785359278321266, -0.041999075561761856, 0.042050108313560486, -0.023656176403164864, 0.01810332201421261, -0.028811370953917503, -0.04434119910001755, -0.022894995287060738, -0.04691595211625099, 0.028853129595518112, 0.05067988112568855, -0.05385778471827507, -0.05336437001824379, -0.028338856995105743, 0.055960699915885925, 0.0556451715528965, 0.053908269852399826, -0.05120284855365753, -0.03537750989198685, -0.055464960634708405, -0.056009646505117416, -0.048401106148958206, 0.0520816408097744, 0.05341281741857529, 0.042371753603219986, -0.05149541050195694, 0.02432170696556568, -0.05589159578084946, -0.0253126323223114, -0.055962808430194855, 0.05288317799568176, -0.055515848100185394, -0.04950021579861641, 0.053132157772779465, 0.05560009181499481, -0.05382295697927475, 0.03887653723359108, 0.05554954335093498, 0.046343084424734116, 0.05601463466882706, -0.046690721064805984, 0.05519827827811241, 0.05464799702167511, -0.054447636008262634, 0.05284092575311661, 0.054841265082359314, 0.03104795143008232, -0.055980488657951355, 0.054428581148386, -0.02225598320364952, -0.04590776562690735, -0.0538906455039978, 0.018541937693953514, -0.05409042909741402, -0.008213928900659084, 0.048824261873960495, 0.04403650760650635, -0.05584526062011719, -0.0467953197658062, 0.023316213861107826, 0.0553462840616703, 0.05479169264435768, -0.03851236775517464, 0.0545991025865078, -0.050144799053668976, -0.052886318415403366, 0.05447651073336601, 0.04183284565806389, 0.03253893554210663, 0.04657915607094765, -0.046109311282634735, 0.05548405647277832, 0.05580862984061241, 0.03811436891555786, -0.05542745813727379, 0.03589120879769325, 0.00815376453101635, -0.055900316685438156, -0.05602550506591797, -0.006971255410462618, -0.05076229199767113, 0.05467640981078148, 0.0029950763564556837, 0.04073617234826088, -0.05401226133108139, 0.011864435859024525, -0.010817749425768852, 0.04748326912522316, -0.007380371447652578, 0.04826263338327408, 0.05093942582607269, 0.004084303975105286, 0.05559725686907768, 0.016932904720306396, 0.025586623698472977, 0.05064704641699791, -0.047627706080675125, 0.020730800926685333, -0.05533415824174881, 0.052814748138189316, -0.04870741814374924, -0.05315487086772919, 0.05516745150089264, 0.012046034447848797, -0.040742047131061554, -0.056016404181718826, 0.04769644886255264, 0.05575522035360336]
This array is a bit daunting to look at, and tbh could probably be rounded to fewer decimal places,
so for our visualizations, let's start with an array that has a
length of 2 and not 512, much easier to plot on a chart.
Check out the point below with x and y as our two dimensions, but we aren't thinking of this array
so
much as
a
point, but as more of a vector, with its direction and magnitude being its length from 0 and
the angle it forms, click the button to visualize as a vector lol
All graphs created using jsxGraph which is honestly worth a write up of its own
Let's call our vector a, the length or magnitude of a would be √(a₁² + a₂²), or
to consider it
in 2d space √(ax² + ay²), this is also
denoted as
|a|
An example in 2D space is much easier to visualize than 512D which our embeddings are in,
but is very useful as the math remains the same at higher
dimensions. If a was a 512-dimensional vector the value of |a| would be √(a₁² +
a₂² + a₃² + ...
a₅₁₂²)
The universal sentence encoder model was trained to have semantically similar sentences produce
embeddings
that have high angular similarity (the angle between them is
small/they point in the same direction).
Take a look at the 3 vectors below.
The euclidian distance between then ends of a and c is smaller, but we are comparing
the angle
between them as vectors, so a and b would be more similar than a and
c.
It can be costly to compute the angular similarity between embeddings
so it is more common to calculate their cosine similarity,
which is the same as the cosθ
of the angle between them.
Like angular similarity, if two vectors have a high cosine
similarity they will be pointing in a similar direction.
If you're like me and forget all basic trigonometry every few years, cosθ = adj/hyp for a
right-angle triangle.
And acos or on your calculators usually cos⁻¹ is the magic function we run cosθ
through
to get an angle.
That acos function is the part that is costly to compute
so it is what we will skip when checking only cosine similarity
Take a look below at how cosθ or the cosine similarity compares to the actual angle between two
vectors.
Drag the points around and see how the values change.
So we can see that cosθ will range from 1 when the angle is 0 (high cosine and angular
similarity)
to -1 when the angle is 180 (low cosine and angular similarity).
This can be sen in a plot of the cosine function.
However cosθ doesn't scale linearly with the angular distance so it is not a good measure of
the
strength of the association between two vectors but it works just
as well for proportional association (seeing how close vectors are
compared to other vectors)
Now let's take a look at two vectors again, but we will lock one to 0° and change its
magnitude to consistently create a
right-angle triangle to show how we could calculate cosθ between them.
Of course it would be very easy to find the cosine similarity between vectors if they each formed
a nice right angle triangle with each other, one vector would be the adjacent and the other the
hypotenuse,
but sadly this isn't the case, like in the first example the vectors could have any magnitude and
we need a way to find the cosine similarity between vectors
regardless of their magnitudes, consider the image below. To find cosθ we need to
find the adjacent and hypotenuse of a right-angle triangle formed by our angle.
In this image the hypotenuse is easy to find, it is |a|, but we need to find L given only the
information for a and b.

Here is an example where both vectors can be moved
Fortunately, there is a formula that allows us to find cosθ and by that also find L
using something called the dot product.
If we have two vectors of length 512, lets call them a and b again.
a = [a₁,a₂,a₃,… ,a₅₁₂] and b = [b₁,b₂,b₃,… ,b₅₁₂]
Then their dot product is given by:
a•b = a₁b₁+a₂b₂+a₃b₃+………a₅₁₂b₅₁₂
For vectors of length 2, their dot product would be
a₁b₁+a₂b₂ or aₓbₓ+aᵧbᵧ
\[\vec{a} \cdot \vec{b} = {\sum_{i=1}^{n}(a_i b_i)}\]
The dot product is written using a dot "·" a · b, don't confuse this for a times b.The formula which relates two vectors and their dot product is
a · b =
|a||b|cos(θ)
Where |a| and |b| are the lengths of a and b and theta is the angle
between them.
Remember cos(θ) is the adjacent over hypotenuse of a right-angle triangle formed by this angle so we
could write this as
a · b = |a| × |b| × L/|a|, the |a|'s on the right cancel
out so we have
a · b = |b| × L, divide
both sides by |b| and we get
(a · b)/|b| = L
this would allow us to solve for L from this image
Oh and of course we can solve for cosθ which is what we really need since that is our cosine
similarity: a · b/|a||b|= cosθ
This formula can be proven with the law of cosine for 2d vectors if you wanted to relate it to
something more familiar like I always do, I swear if you @ me I just might write it out
When writing a program, calculating the cosine similarity will usually be
done by a math library with a cosine similarity function, so you don't really need to understand the
math, just have an idea of what it represents, but it is nice to
know, since geometry is fun, and remember,
you need math
(profanity warning)
The dot product has a few interesting
properties, one being that it is negative if the vectors are more than
90° apart and positive otherwise, this will come in handy later.
Also here is an
excellent video on the dot product. Highly recommend watching it.
Searching/comparing embeddings
Now that we have an understanding of the universal sentence encoder
and how to compare the sentence embeddings it generates using their cosine
similarity, let's look into some issues with searching through
large sets of embeddings.
In Elden Ring players can create messages
given lists of words and templates, see some examples below



Messages often provide insight into the surrouding envirnment, though more often than not they are just someone being silly





And sometimes they can make surprisingly clever use of the limited template system

If you arent an Elden Ring or Dark Souls fan don't worry I've also created examples with sentences from the Harry Potter books and of course with dialogue from the show Sailor Moon, the links are at the end of this article.
There are 368 words and 25 templates, so for single sentences in Elden Ring (without getting into combining two sentences with conjunctions) we have 368*25 = 9200 possible different sentences. The embeddings are created ahead of time and stored but when it's time to search that's 9200 arrays of 512 floats. To check the cosine similarity of one against all of them is not an outlandish task, especially when utilizing the gpu, but depending on the device running the search it can take upwards of a minute, checking the magnitude and dot product of vectors involves a lot of operations for vectors at higher dimensions.
Fortunately, there is a way to narrow our search results before
comparing the embeddings.
Locality Sensitive Hashes
Locality sensitive hashing involves generating random vectors with the same number of dimensions as our sentence embeddings and using them to divide the embeddings into groups or bins.
We divide embeddings into groups by checking if the dot product
between the randomly generated vector (or projection) and the
sentence embedding is positive or negative. Giving its hash a 1 or 0 respectively.
Remember vectors with an angular difference of more than 90°s will have a negative dot product,
and a positive dot product otherwise
Embeddings with different hashes are in different bins.
Let's look at one embedding, a, and one projection, p₁
Move them around to see how the hash for embedding a changes.
We can then use
multiple projections to group embeddings by the bins they are divided into, we will do this by again
giving the embeddings a 1 or
0 for a positive or negative dot product (within 90 or not) for each
projection. If an embedding is within 90° of the first projection (positive dot product)
we will add a 1 to its hash and then if it is not within 90 of the 2nd projection
(negative dot product) we add a 0 to its hash, and so on, the number of projections we have will
be our hash length since an embedding gets a 0 or 1 in its hash for
each projection.
Now let's look at two embeddings, a and b, and three projections p1,p2,p3
If two sentence embeddings(vectors) point in a similar direction, they are more likely to end up in
the same bin. If the vectors are 180° apart or have a cosine similarity
of -1, they will never be in the same bin, if they are 90° apart or have a
cosine sim of 0, there is a 50% chance they will be in the same bin,
and 100% chance if they are 0° and have cosine similarity of 1.
The formula for the probability of two vectors being in the same bin
would be 1-d/dc where d is the distance between the vectors and dc is
the max distance (d/dc is the probability they are in the same bin so 1-d/dc is the probability they
aren't)
We could use angular distance for this and see
1-90/180 = 0.5 or 50%. And could also normalize this to a max distance of 1 to view more easily on a
graph.
The probability of two events happening is (probability of event
1)X(probability of event 2)
Like rolling a 4 on a 6 sided dice twice in a row would be 1/6 X 1/6
So using the formula above 1-d/dc we can multiply it by itself for
the number of projections we have to see how the probability of two
embeddings being in the same bins based on their distance changes.
This would give us (1-d/dc)^k where k is the number of projections.
Increasing k reduces the chance of collision(two embeddings having
the same hash) since there are more opportunities for a projection
to divide them into separate bins.
With LSH the idea is that we have a good chance to end up with embeddings that are close to
eachother in the same bin,
though with only one hashtable there is a chance embeddings which are close could be divided by a
projection. We will look at multiple hashtables to deal with this soon.
Change the number of projections below to see how the probability of two embeddings being in the same bin based on their distance changes.
This will help us speed up searches by checking a new embedding (not
in our collection of sentences from Elden Ring, Harry Potter, or Sailor Moon) with
our projections and then only comparing it with embeddings that have
the same hash. We can adjust k to tune the probability of there
being a collision based on the distance between the embeddings.
Let's take a look at a rough python example we could use to create a hash table for our sentence tensors
hashtableLength = 5 #will also be our number of projections
inputArrayDims = 512
projections = []
hashTable = {}
#these are our previously generated sentence embeddings
with open('Tensors.json') as f:
tensors = json.load(f)
#add an empty array to hold each projection
for i in range(hashtableLength):
projections.append([])
#to each empty projection array add 512(same length as our tensors) random floats
for j in range(hashtableLength):
for k in range(inputArrayDims):
projections[j].append(np.random.randn())
#now projections is a 2d array of shape hashtableLength*imputArrayDims
#it is a set of made up projections/tensors/vectors, the same length as our sentence vectors
#for each tensor
for i, t in enumerate(tensors):
#create an empty string to hold our hash
hash = '';
#then compare it to each projection in the hash table
for k in range(hashtableLength):
#positive dot product(less than 90° distance) we add a '1'
if(np.dot(t ,projections[k]) > 0):
hash += '1'
#and negative (greater than 90° distance) we add a '0'
else:
hash += '0'
#if the hash string is already in our hash table,
#add the index for this tensor to the list the hash points to
if(hash in hashTable):
hashTable[hash].append(i)
#otherwise create an empty list with just that index to start
else:
hashTable[hash] = [i]
The 'hashTable' dictionary should now have each hash as a key and a list of the indices of the
tensors that have that hash, as the value.
We would then save the hashtable dict and our projections, and use them to generate a hash for a
search term,
then only check the cosine similarity of the search term tensor with the sentence tensors that have
the same hash,
we can get a list of those indices quickly since we have our hashtable dict.
Let's look at an example of this in javascript
apologies for hopping between languages but it's easy to write utility programs in python and
then obvs I
want this to be a web app so all the other stuff is js
//generated from python file
projections = [...]
hashTable = {}
sentences = ["","",...]
tensors = [[...],[...],...]
searchTerm = document.getElementById("searchTerm").value;
queryEmbedding = []
queryEmbedding.push((await model.embed(searchTerm)).unstack()[0])
//this need to match the length used in the python file
hashtableLength = 5;
inputArrayDims = 512;
hash = '';
//for each projection in the hashtable
for(j = 0; j < hashtableLength; j+= 1){
//check the dot product of the search term embedding with the projection
if(math.dot(queryEmbedding[0].arraySync(),projections[j]) > 0){
//if positive(within 90°) hash gets a '1'
hash += '1';
}
//otherwise if negative(not within 90°) hash gets a '0'
else{
hash += '0';
}
}
//now we have built the hash for query embedding by running it through the same projections
//we ran all the sentence embeddings through to create our hash table,
//we can now use the hash to check our hash table for the indices that have the same hash!
indicesToCompare = [];
//there is a chance our search term dosen't share its hash with any of the tensors so we check the hash exists
if(hash in hashTables[i]){
indicesToCompare = hashTable[hash];
}
//check cosine similarity for each tensor that had the same hash
scores = new Array(tensors.length).fill(1);
for(i = 0; i < indicesToCompare.length; i+= 1){
score = tf.losses.cosineDistance(tensors[indicesToCompare[i]], queryEmbedding[0], 0);
scores[indicesToCompare[i]] = score.arraySync();
}
We will then have the 'scores' array which has the cosine similarity for each of the tensors from the indices to compare list, and the rest that we didn't compare will be a 1. We can sort these values to find the sentence tensors that are closest to our search term.
We can also create multiple hash tables to increase the chance of
collision, since just one will give us a high chance of false
negatives, since even if two vectors have a fairly high angular similarity there is still a chance
for one of the projections to divide them, with several hashes
there is a much better chance that vectors that have higher angular similarity will have a bin
collision in at least one hash table.
Take a look at the chart above and adjust the number of projections
to 5,
This gives us a good chance to not have a collision for embeddings with a distance greater than 0.6,
but ideally we would like to see the line shoot up more quickly after that, since at a distance of
0.2 which
is quite close there is still only slightly better than a 25% chance we will have a collision.
We will be able to fix this with multiple hash tables.
So looking at our formula, (1-d/dc)^k is
our probability of collision (Pc) right now for one hash of length k,
1-Pc is the probability that we don't have a collision for a hash,
(1-pc)^L is the probability we don't have a collision for L number of
hashes, and finally 1-(1-pc)^L is the probability we do have a
collision, so thats 1-(1-(1-d/dc)^k)^L
at least, that is my understanding of how the probability works out and I get the same looking graph
as in this paper:
see page 1116.
visualizing this below helps us get a good idea for
probability based on distance, hash length, and number of hashes. Here is a a youtube video I managed to find while trying to understand all of this as well!
If we were to go with 20 projections and 2000 hash tables we are pretty much guaranteed to have collisions for hashes with a distance less than 0.25, but the trade off is the information for that many hash tables would take up a huge amount of space, so we still want to look for that trade off between accuracy and performance.
Now let's look at some code examples similar to the ones above but adjusted to use multuple hash
tables
first the python code to create the hash tables and projections.
numberOfHashtables = 16
hashtableLength = 5
inputArrayDims = 512
projections = []
hashTables = []
for i in range(numberOfHashtables):
hashTables.append({})
for i in range(numberOfHashtables):
projections.append([])
for j in range(hashtableLength):
projections[i].append([])
for i in range(numberOfHashtables):
for j in range(hashtableLength):
for k in range(inputArrayDims):
projections[i][j].append(np.random.randn())
with open('Tensors.json') as f:
tensors = json.load(f)
for i, t in enumerate(tensors):
for j in range(numberOfHashtables):
hash = '';
for k in range(hashtableLength):
if(np.dot(t ,projections[j][k]) >0):
hash += '1'
else:
hash += '0'
if(hash in hashTables[j]):
hashTables[j][hash].append(i)
else:
hashTables[j][hash] = [i]
and the javascript
//generated from python file
projections = [...]
hashTables = {}
sentences = ["","",...]
tensors = [[...],[...],...]
searchTerm = document.getElementById("searchTerm").value;
queryEmbedding = []
queryEmbedding.push((await model.embed(searchTerm)).unstack()[0])
//these need to match the lenths used when generating the number
//of hash tables and length
numberOfHashtables = 16;
hashtableLength = 5;
inputArrayDims = 512;
queryHashes = [];
for(i = 0; i < numberOfHashtables; i+= 1){
hash = '';
for(j = 0; j < hashtableLength; j+= 1){
if(math.dot(queryEmbedding[0].arraySync(),projections[i][j]) >0){
hash += '1';
}
else{
hash += '0';
}
}
queryHashes.push(hash);
}
indicesToCompare = [];
for(i = 0; i < queryHashes.length; i+= 1){
if(queryHashes[i] in hashTables[i]){
indicesToCompare.push(...hashTables[i][queryHashes[i]]);
}
}
//turn the list into a set to remove duplicates
indicesToCompare = [...new Set(indicesToCompare)];
scores = new Array(tensors.length).fill(1);
for(i = 0; i < indicesToCompare.length; i+= 1){
score = tf.losses.cosineDistance(tensors[indicesToCompare[i]], queryEmbedding[0], 0);
scores[indicesToCompare[i]] = score.arraySync();
}
I'll include all the full code that I used on this projects github page
So as we increase the number of hash tables we have more chances for
embeddings to fall into the same bin at least once, and embeddings
that are close are likely to fall into the same bin even more
often.
This leads us to a final bit of optimization, we can
increase the number hash tables and use the frequency of
bin collision as our similarity score to remove the need for
checking cosine similarity entirely, this also eliminates the need
to store the sentence embeddings to compare, we only need to create
an embedding for the search term and create hashes of it by checking
it with our projections for each hash table.
projections = [...]
hashTables = {}
sentences = ["","",...]
searchTerm = document.getElementById("searchTerm").value;
queryEmbedding = []
queryEmbedding.push((await model.embed(searchTerm)).unstack()[0])
numberOfHashtables = 300;
hashtableLength = 5;
inputArrayDims = 512;
queryHashes = [];
for(i = 0; i < numberOfHashtables; i+= 1){
hash = '';
for(j = 0; j < hashtableLength; j+= 1){
if(math.dot(queryEmbedding[0].arraySync(),projections[i][j]) >0){
hash += '1';
}
else{
hash += '0';
}
}
queryHashes.push(hash);
}
//we run the query through each hashtable and get the indices of the embeddings that it collides with
//if two embeddings are very close they could collide for every hash table, then we would see them 500 times in indices to compare
//or if two embeddings are very far apart we probably wont see them in the list at all
indicesToCompare = [];
for(i = 0; i < queryHashes.length; i+= 1){
if(queryHashes[i] in hashTables[i]){
indicesToCompare.push(...hashTables[i][queryHashes[i]]);
}
}
//this snazzy bit of js will take the indicies to compare list and return a dictionary called map
//with each index that occured at least once as the key and the number of occurances as the value
var map = indicesToCompare.reduce(function(p, c) {
p[c] = (p[c] || 0) + 1;
return p;
}, {});
//this will take that dictionary and sort the keys(the indices of the embeddings)
// by the values(the number of times that embedding collided with our query hash in our hashtables)
var results = Object.keys(map).sort(function(a, b) {
return map[b] - map[a];
});
//so now the results list is a list of the indeices of the sentence embeddings sorted descending by
//the number of collisions they had with the query embedding, so sorted from best match going down
//we can reference those indices back to our sentences list to display them
res = ""
table = document.getElementById("results");
table.innerHTML = "";
for(var i = 500; i >= 0; i-= 1){
var row = table.insertRow(0);
var cell1 = row.insertCell(0);
cell1.innerHTML = sentences[results[i]];
}
This will result in a loss of accuracy
since collision is probability based as we saw above, but with enough
tables we can achieve decent results, especially for
something as subjective as comparing sentences. Less disk space is
needed and search time is further improved. I don't have a fantastic way to decide what values to
use to achieve that but for but from looking at the charts and trying things out I've found around
200-300 hashtables
with a length of 5 each produces good search results for searching with hashes only, for my sentence
sets.
Another way to understand this idea of scoring results by hash collisions only is to first take a look back at this chart
and change the number of projections(length of each hash table) to 5, and the number of hash tables to 1
Take a look at the curve formed showing the probability of getting one match with our one hash table.
Then change the number of hash tables to 100, this curve shows the probability of getting at least one match, but we could get up to 100, you can think of it as stacking the probability of the last curve 100 times.
For this last method of scoring by just checking frequency of bin/hashtable collision as our similarity score, we are just checking the 100 randomly generated hashtables,
each one with the probability you saw for the first curve (5 projections 1 table)
and adding up how many times we get a hash collision as our score.
If you're curious about this blog or still dont fully understand, feel free to shoot me an email at colourofloosemetal@gmail.com, I would be happy to chat about this project.
I havent talked too much about the actual universal sentence encoder model since it isn't really the main topic of this article, but it is worth mentioning that all the examples use the "lite" version of the model which is many many times smaller than the main version, this is to allow it to run in your browser. In the future I might try to run the larger model locally to see how the results improve. The lite version still gives excellent results but it probably wont pick up on things like idioms and is more likely to interpret text literally if you try some.
Live Demos
Do Not try to search before the loading message goes away, if the loading message seems like it is taking an absurdly long time to go away try the example on desktop or refresh the page
Elden Ring Messages
If you've played elden ring this is fantastic for finding messages, but with less than 10k messages to search from the results can be lacking
Harry Potter bookswith close to 10x the results of the elden ring messages there are some interesting sentence to be found
Sailor Moon
💯
here are some interestng results from each searcher
Starting with Elden Ring


Harry Potter




And of course Sailor Moon





please also email me screenshots like these if you get any fun results colourofloosemetal@gmail.com