liskin · June 14, 2021 10:18 · liskin · Mar 24, 2020
diff --git a/corona-bloom.lhs b/corona-bloom.lhs
 Faster COVID-19 testing using PCR and hashing (bloom filter)
 ============================================================

 According to Israeli scientists, [it's possible][pcr-pooling] to merge
 multiple swab samples into one tube for PCR without losing much accuracy.
 Their method involves a divide and conquer approach, starting by pooling
 samples into groups of (up to) 32 and then eliminating groups with no positive
 samples.

 [pcr-pooling]: https://medium.com/@dinber19/more-with-less-using-pooling-to-detect-coronavirus-with-fewer-tests-8ba1a2cd8b67

 I'd like to suggest an alternative approach based on hashing, essentially a
 biological [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter).
 The core idea is that we don't just group samples, we first split each sample
 and only then group it with several other samples, and we do this in a way
 that allows identification of individual positives/negatives, assuming the
 ratio of positives is low (1%, ideally even less).

 A bloom filter uses k distinct hash functions and m bits. Adding an input
 value involves hashing it using each hash function and then setting the
 resulting bit position to 1.

 A biological bloom filter uses tubes instead of bits. Adding a positive sample
 contaminates k tubes with the virus, whereas adding a negative sample only
 slightly dilutes k tubes (essentially a no-op). Then PCR is done on all those
 tubes, and the bit values are revealed. We can then use it as a traditional
 bloom filter and check (on a computer) which samples are in the positive set.
 Provided the ratio of positive samples is low (which it should be if we're
 going to widely test asymptomatic people), the false positive rate should be
 acceptable and we should be able to identify infected persons within a single
 PCR run.

 Assumptions:

 * we're able to split a sample into up to 10 equal parts (I hope so)
 * we can merge samples and still detect the virus (verified in the lab)
 * it's possible to correctly put thousands of sub-samples into tubes (???)
 * ratio of positive vs tested is low (so far it has been)

 What follows is a Haskell program to simulate the process and calculate ideal
 values of k (number of sub-samples we need) and false positivity rates.

 See results after the code.

 > import Data.Csv (encode)
 > import Data.List
 > import Text.Printf
 > import qualified Data.BloomFilter as Bloom
 > import qualified Data.BloomFilter.Hash as Bloom
 > import qualified Data.ByteString.Lazy as B

 > pcrTubes = 96
 > maxSamplesInTube = 50

 > kHashesCandidates = [2..20]
 > salts = [0..100]

 > samplesInTube kHashes nAll = fromIntegral nAll * fromIntegral kHashes / fromIntegral pcrTubes

 > genSamples :: Int -> [Int]
 > genSamples salt = map (salt +) [1..]

 > bloom samples kHashes nRealPositive =
 >     Bloom.fromList (Bloom.hashes kHashes) pcrTubes (take nRealPositive samples)

 > tryBloom samples nRealPositive nAll =
 >     [ nTestedPositive filt - nRealPositive
 >     | kHashes <- kHashesCandidates, let filt = bloom samples kHashes nRealPositive
 >     , samplesInTube kHashes nAll <= maxSamplesInTube ]
 >   where
 >     nTestedPositive filt = length $ filter id [n `Bloom.elem` filt | n <- take nAll samples]

 > optimalBloom nRealPositive nAll = fmap (\(_, avg, k) -> (avg, k)) optimum
 >   where
 >     allFalsePositives = transpose [ tryBloom samples nRealPositive nAll | samples <- map genSamples salts ]
 >     avgFalsePositives = map avg allFalsePositives
 >     sumSquareFalsePositives = map (sum . map (^2)) allFalsePositives
 >     optimum = if null allFalsePositives
 >         then Nothing
 >         else Just $ minimum $ zip3 sumSquareFalsePositives avgFalsePositives kHashesCandidates

 > avg :: (Num a, Integral a) => [a] -> Double
 > avg xs = fromIntegral (sum xs) / fromIntegral (length xs)

 > table :: [(Int, Int, Int, String, String)]
 > table =
 >     [ (nRealPositive, nAll, kHashes, fmt (samplesInTube kHashes nAll), fmt avgFalsePositives)
 >     | nRealPositive <- [1, 5, 10, 15, 20, 30]
 >     , nAll <- [500, 1000, 2000, 5000]
 >     , Just (avgFalsePositives, kHashes) <- [optimalBloom nRealPositive nAll] ]
 >   where
 >     fmt :: Double -> String
 >     fmt d = printf "%.2f" d

 > main = do
 >     B.putStr (encode [("nRealPositive", "nAll", "kHashes", "samplesInTube", "avgFalsePositives")])
 >     B.putStr (encode table)

 $ ./corona-bloom | column -s, -t
 nRealPositive  nAll  kHashes  samplesInTube  avgFalsePositives
 1              500   3        15.62          0.00
 1              1000  4        41.67          0.00
 1              2000  2        41.67          0.41
 5              500   7        36.46          0.02
 5              1000  4        41.67          0.50
 5              2000  2        41.67          10.48
 10             500   7        36.46          1.31
 10             1000  4        41.67          4.83
 10             2000  2        41.67          39.64
 15             500   4        20.83          7.41
 15             1000  4        41.67          17.58
 15             2000  2        41.67          81.89
 20             500   4        20.83          17.95
 20             1000  4        41.67          40.67
 20             2000  2        41.67          134.20
 30             500   4        20.83          57.10
 30             1000  3        31.25          114.93
 30             2000  2        41.67          265.82

 These ones look interesting:

 nRealPositive  nAll  kHashes  samplesInTube  avgFalsePositives
 1              500   3        15.62          0.00
 1              1000  4        41.67          0.00
 1              2000  2        41.67          0.41
 5              500   7        36.46          0.02
 5              1000  4        41.67          0.50
 5              2000  2        41.67          10.48
 10             500   7        36.46          1.31
 10             1000  4        41.67          4.83
 15             500   4        20.83          7.41
 15             1000  4        41.67          17.58
 20             500   4        20.83          17.95
 20             1000  4        41.67          40.67

 This means that if we expect to have up to 5 infected in a thousand samples, we
 can test them all in a single 96-tube PCR run and identify the infected ones
 almost exactly. As the ratio of infected samples rises, the false positive rate
 goes up rapidly, but up to 20 infected in 1000 samples, it should fit in one
 additional 96-tube PCR run. But we must be able to split a single swab into 5
 pieces.
diff --git a/Makefile b/Makefile
 STACK_EXEC = stack exec \
 	--package bloomfilter \
 	--package bytestring \
 	--package cassava

 corona-bloom: corona-bloom.lhs
 	$(STACK_EXEC) -- ghc --make $<

 ghci: corona-bloom.lhs
 	$(STACK_EXEC) -- ghci $<

 ghcid: corona-bloom.lhs
 	$(STACK_EXEC) -- ghcid $<
	Faster COVID-19 testing using PCR and hashing (bloom filter)
	============================================================

	According to Israeli scientists, [it's possible][pcr-pooling] to merge
	multiple swab samples into one tube for PCR without losing much accuracy.
	Their method involves a divide and conquer approach, starting by pooling
	samples into groups of (up to) 32 and then eliminating groups with no positive
	samples.

	[pcr-pooling]: https://medium.com/@dinber19/more-with-less-using-pooling-to-detect-coronavirus-with-fewer-tests-8ba1a2cd8b67

	I'd like to suggest an alternative approach based on hashing, essentially a
	biological [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter).
	The core idea is that we don't just group samples, we first split each sample
	and only then group it with several other samples, and we do this in a way
	that allows identification of individual positives/negatives, assuming the
	ratio of positives is low (1%, ideally even less).

	A bloom filter uses k distinct hash functions and m bits. Adding an input
	value involves hashing it using each hash function and then setting the
	resulting bit position to 1.

	A biological bloom filter uses tubes instead of bits. Adding a positive sample
	contaminates k tubes with the virus, whereas adding a negative sample only
	slightly dilutes k tubes (essentially a no-op). Then PCR is done on all those
	tubes, and the bit values are revealed. We can then use it as a traditional
	bloom filter and check (on a computer) which samples are in the positive set.
	Provided the ratio of positive samples is low (which it should be if we're
	going to widely test asymptomatic people), the false positive rate should be
	acceptable and we should be able to identify infected persons within a single
	PCR run.

	Assumptions:

	* we're able to split a sample into up to 10 equal parts (I hope so)
	* we can merge samples and still detect the virus (verified in the lab)
	* it's possible to correctly put thousands of sub-samples into tubes (???)
	* ratio of positive vs tested is low (so far it has been)

	What follows is a Haskell program to simulate the process and calculate ideal
	values of k (number of sub-samples we need) and false positivity rates.

	See results after the code.

	> import Data.Csv (encode)
	> import Data.List
	> import Text.Printf
	> import qualified Data.BloomFilter as Bloom
	> import qualified Data.BloomFilter.Hash as Bloom
	> import qualified Data.ByteString.Lazy as B

	> pcrTubes = 96
	> maxSamplesInTube = 50

	> kHashesCandidates = [2..20]
	> salts = [0..100]

	> samplesInTube kHashes nAll = fromIntegral nAll * fromIntegral kHashes / fromIntegral pcrTubes

	> genSamples :: Int -> [Int]
	> genSamples salt = map (salt +) [1..]

	> bloom samples kHashes nRealPositive =
	> Bloom.fromList (Bloom.hashes kHashes) pcrTubes (take nRealPositive samples)

	> tryBloom samples nRealPositive nAll =
	> [ nTestedPositive filt - nRealPositive
	> \| kHashes <- kHashesCandidates, let filt = bloom samples kHashes nRealPositive
	> , samplesInTube kHashes nAll <= maxSamplesInTube ]
	> where
	> nTestedPositive filt = length $ filter id [n `Bloom.elem` filt \| n <- take nAll samples]

	> optimalBloom nRealPositive nAll = fmap (\(_, avg, k) -> (avg, k)) optimum
	> where
	> allFalsePositives = transpose [ tryBloom samples nRealPositive nAll \| samples <- map genSamples salts ]
	> avgFalsePositives = map avg allFalsePositives
	> sumSquareFalsePositives = map (sum . map (^2)) allFalsePositives
	> optimum = if null allFalsePositives
	> then Nothing
	> else Just $ minimum $ zip3 sumSquareFalsePositives avgFalsePositives kHashesCandidates

	> avg :: (Num a, Integral a) => [a] -> Double
	> avg xs = fromIntegral (sum xs) / fromIntegral (length xs)

	> table :: [(Int, Int, Int, String, String)]
	> table =
	> [ (nRealPositive, nAll, kHashes, fmt (samplesInTube kHashes nAll), fmt avgFalsePositives)
	> \| nRealPositive <- [1, 5, 10, 15, 20, 30]
	> , nAll <- [500, 1000, 2000, 5000]
	> , Just (avgFalsePositives, kHashes) <- [optimalBloom nRealPositive nAll] ]
	> where
	> fmt :: Double -> String
	> fmt d = printf "%.2f" d

	> main = do
	> B.putStr (encode [("nRealPositive", "nAll", "kHashes", "samplesInTube", "avgFalsePositives")])
	> B.putStr (encode table)

	$ ./corona-bloom \| column -s, -t
	nRealPositive nAll kHashes samplesInTube avgFalsePositives
	1 500 3 15.62 0.00
	1 1000 4 41.67 0.00
	1 2000 2 41.67 0.41
	5 500 7 36.46 0.02
	5 1000 4 41.67 0.50
	5 2000 2 41.67 10.48
	10 500 7 36.46 1.31
	10 1000 4 41.67 4.83
	10 2000 2 41.67 39.64
	15 500 4 20.83 7.41
	15 1000 4 41.67 17.58
	15 2000 2 41.67 81.89
	20 500 4 20.83 17.95
	20 1000 4 41.67 40.67
	20 2000 2 41.67 134.20
	30 500 4 20.83 57.10
	30 1000 3 31.25 114.93
	30 2000 2 41.67 265.82

	These ones look interesting:

	nRealPositive nAll kHashes samplesInTube avgFalsePositives
	1 500 3 15.62 0.00
	1 1000 4 41.67 0.00
	1 2000 2 41.67 0.41
	5 500 7 36.46 0.02
	5 1000 4 41.67 0.50
	5 2000 2 41.67 10.48
	10 500 7 36.46 1.31
	10 1000 4 41.67 4.83
	15 500 4 20.83 7.41
	15 1000 4 41.67 17.58
	20 500 4 20.83 17.95
	20 1000 4 41.67 40.67

	This means that if we expect to have up to 5 infected in a thousand samples, we
	can test them all in a single 96-tube PCR run and identify the infected ones
	almost exactly. As the ratio of infected samples rises, the false positive rate
	goes up rapidly, but up to 20 infected in 1000 samples, it should fit in one
	additional 96-tube PCR run. But we must be able to split a single swab into 5
	pieces.
	STACK_EXEC = stack exec \
	--package bloomfilter \
	--package bytestring \
	--package cassava

	corona-bloom: corona-bloom.lhs
	$(STACK_EXEC) -- ghc --make $<

	ghci: corona-bloom.lhs
	$(STACK_EXEC) -- ghci $<

	ghcid: corona-bloom.lhs
	$(STACK_EXEC) -- ghcid $<