70 lines
3.2 KiB
Markdown
70 lines
3.2 KiB
Markdown
|
Bloom filters
|
||
|
-------------
|
||
|
|
||
|
[![Master Build Status](https://secure.travis-ci.org/willf/bloom.png?branch=master)](https://travis-ci.org/willf/bloom?branch=master)
|
||
|
[![Coverage Status](https://coveralls.io/repos/github/willf/bloom/badge.svg?branch=master)](https://coveralls.io/github/willf/bloom?branch=master)
|
||
|
[![Go Report Card](https://goreportcard.com/badge/github.com/willf/bloom)](https://goreportcard.com/report/github.com/willf/bloom)
|
||
|
[![GoDoc](https://godoc.org/github.com/willf/bloom?status.svg)](http://godoc.org/github.com/willf/bloom)
|
||
|
|
||
|
A Bloom filter is a representation of a set of _n_ items, where the main
|
||
|
requirement is to make membership queries; _i.e._, whether an item is a
|
||
|
member of a set.
|
||
|
|
||
|
A Bloom filter has two parameters: _m_, a maximum size (typically a reasonably large multiple of the cardinality of the set to represent) and _k_, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by a [BitSet](https://github.com/willf/bitset); a key is represented in the filter by setting the bits at each value of the hashing functions (modulo _m_). Set membership is done by _testing_ whether the bits at each value of the hashing functions (again, modulo _m_) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose _k_ and _m_ correctly.
|
||
|
|
||
|
In this implementation, the hashing functions used is [murmurhash](github.com/spaolacci/murmur3), a non-cryptographic hashing function.
|
||
|
|
||
|
This implementation accepts keys for setting and testing as `[]byte`. Thus, to
|
||
|
add a string item, `"Love"`:
|
||
|
|
||
|
n := uint(1000)
|
||
|
filter := bloom.New(20*n, 5) // load of 20, 5 keys
|
||
|
filter.Add([]byte("Love"))
|
||
|
|
||
|
Similarly, to test if `"Love"` is in bloom:
|
||
|
|
||
|
if filter.Test([]byte("Love"))
|
||
|
|
||
|
For numeric data, I recommend that you look into the encoding/binary library. But, for example, to add a `uint32` to the filter:
|
||
|
|
||
|
i := uint32(100)
|
||
|
n1 := make([]byte, 4)
|
||
|
binary.BigEndian.PutUint32(n1, i)
|
||
|
filter.Add(n1)
|
||
|
|
||
|
Finally, there is a method to estimate the false positive rate of a particular
|
||
|
bloom filter for a set of size _n_:
|
||
|
|
||
|
if filter.EstimateFalsePositiveRate(1000) > 0.001
|
||
|
|
||
|
Given the particular hashing scheme, it's best to be empirical about this. Note
|
||
|
that estimating the FP rate will clear the Bloom filter.
|
||
|
|
||
|
Discussion here: [Bloom filter](https://groups.google.com/d/topic/golang-nuts/6MktecKi1bE/discussion)
|
||
|
|
||
|
Godoc documentation: https://godoc.org/github.com/willf/bloom
|
||
|
|
||
|
## Installation
|
||
|
|
||
|
```bash
|
||
|
go get -u github.com/willf/bloom
|
||
|
```
|
||
|
|
||
|
## Contributing
|
||
|
|
||
|
If you wish to contribute to this project, please branch and issue a pull request against master ("[GitHub Flow](https://guides.github.com/introduction/flow/)")
|
||
|
|
||
|
This project include a Makefile that allows you to test and build the project with simple commands.
|
||
|
To see all available options:
|
||
|
```bash
|
||
|
make help
|
||
|
```
|
||
|
|
||
|
## Running all tests
|
||
|
|
||
|
Before committing the code, please check if it passes all tests using (note: this will install some dependencies):
|
||
|
```bash
|
||
|
make deps
|
||
|
make qa
|
||
|
```
|