nagydani · April 12, 2016 15:21
diff --git a/topic-discovery.txt b/topic-discovery.txt
 1. Introduction

 Each node in the Ethereum network has a set of tags called "topics" by 
 which other nodes may choose to associate with them. They may indicate 
 capability, responsibility for certain information or functionality or 
 any other attribute by which the node may wish to belong to a connected 
 subnetwork of the Ethereum network. In this document, an infrastructure is 
 outlined by which nodes of a certain topic may discover one another.

 The presented infrastructure can be used for two distinct purposes: 
 nodes covered by the same topic can use it to form and maintain a 
 connected network, while nodes not covered by a certain topic might use 
 it for finding nodes covered by that topic for using a service provides 
 by such nodes.

 Topics are assumed to potentially cover arbitrarily large or small parts 
 of the Ethereum network, meaning that certain topics can contain only 
 one node while other topics may span the entire network. Irrespective of
 the number of nodes covered by a topic, the discovery infrastructure should
 provide a suitable set of bootstrap nodes for new nodes wishing to join
 the network or those wishing to use the services of nodes covered by said
 topic.

 Furthermore, it is assumed, that each node is covered by at most a few dozen
 topics.


 2. Topic structure

 On the lowest level, topics are assumed to be unstructured, denoted by a 
 string of printable characters. On a higher level, two possible 
 hierarchical topic structures are defined: one in which nodes covered by 
 a topic and all its subtopics are searched and another in which nodes covered
 by a topic and all its supertopics are searched. Subtopics are denoted by
 a string that has the parent topic as its prefix, followed by some delimiting
 character.

 The two higher-level hierarchies are handled as follows: in the first 
 case, when nodes covered by subtopics need to be discoverable, nodes 
 must be tagged by their topic and all their supertopics (e.g a node 
 covered by "animal.mammal.dog" is also covered by "animal.mammal" and 
 "animal", so that a search for "animal.mammal" will find it), in the 
 second case, when nodes covered by supertopics need to be discoverable, 
 the node doing the discovery also searches for all the supertopics.
 Apart from this, the directory is topic sturcture agnostic.

 The DHT distance measure between a topic and a node's DHT address is 
 defined as the XOR of the latter with the hash of the topic string.


 3. Architecture

 Two states of discovery are clearly distinguished: before and after the 
 searching node made the first successful handshake with a node covered by
 the searched topic.

 In the first state, the node searches the topic directory organized as a 
 DHT described later in more detail. From the topic directory, it will 
 obtain the contact details (network address, port and public key) of a 
 relatively small set of matching nodes to which it attempts to connect. 
 Upon the first successful handshake, it transitions to the second state.

 In the second state, nodes covered by a certain topic introduce each 
 other to the searching node. The strategy followed in this state may 
 well depend on the requirements of the topic and the desired 
 (sub-)network topology. Thus, the discussion of this second state search 
 falls outside of the scope of this document. Nodes can transition back 
 to the first state if for any reason they disconnect from all other nodes 
 covered by their topic.

 Each node participating in the DHT stores records for a (potentially) 
 large number of topics and at most a small number of records for each 
 topic. Each record binds one node to one topic. A record contains, in 
 addition to the topic, the network address, the port and the public key 
 of the node as well as a timestamp and a digital signature by the 
 referred node calculated on the rest of the record.


 4. Registration

 Nodes regularly register themselves for each topic covering them by pushing
 the corresponding records into the DHT. Only registrations with a very recent
 timestamp and a correct signature are accepted. If the DHT is direct (no
 content forwarding), the originator network address is also checked, as nodes
 are only allowed to register themselves.

 In order to prevent overload on nodes with addresses close to popular 
 topics, the time interval for these registrations is controlled as 
 follows: Upon registration, the timestamp of the most recent registration
 is returned to the registering node. The repeated registration is scheduled
 at a delay inversely proportional to the age of the most recent record.
 If the capacity for records for one topic is full, a random record for
 the topic in question is deleted. As a security measure, nodes with overly
 frequent registrations may be blacklisted.


 5. Retrieval

 First, a DHT node sufficiently close to the topic is found. Upon request 
 (i.e. search by topic), all sufficiently recent records are returned.

 If the DHT is implemented without query and data forwarding, this might 
 result in excessive hammering of nodes close to popular topics. For such 
 implementation, it is advisable for registering nodes to register 
 themselves in every step of DHT lookup, not only at nodes closest to the 
 topic. This way, retrieval lookups are likely to find registrations 
 closer to themselves and further away from the closest node in the DHT. 
 Another possible defense against hammering by retrieval requests is to 
 use them sparingly: both topic member nodes and client nodes must strive 
 to remain connected to the topic network by queriing their neighbors for 
 further connections in order to maintain a healthy redundance of 
 connectivity without resorting to discovery.

 Forwarding DHT implementations, however, avoid the issue altogether by 
 propagating and caching records along the path between the node on which 
 it was found and the querying node.


 6. Proposed implementation strategies

 a) Leave the UDP-based Kademlia discovery intact and oblivious of topics, 
 its only purpose being to maintain a connected network of assorted 
 Ethereum nodes. The above proposed mechanism would be implemented as a 
 p2p subprotocol (named, e.g., DIR) with a forwarding Kademlia similar to 
 that of Swarm, based largely on Swarm codebase. Instead of storing one 
 Swarm chunk per key, it would store a fixed number of topic discovery 
 records per key.

 b) Add topic discovery directly to the existing UDP-based node discovery 
 Kademlia and implement the anti-hammering measures proposed in the 
 previous section.
	1. Introduction

	Each node in the Ethereum network has a set of tags called "topics" by
	which other nodes may choose to associate with them. They may indicate
	capability, responsibility for certain information or functionality or
	any other attribute by which the node may wish to belong to a connected
	subnetwork of the Ethereum network. In this document, an infrastructure is
	outlined by which nodes of a certain topic may discover one another.

	The presented infrastructure can be used for two distinct purposes:
	nodes covered by the same topic can use it to form and maintain a
	connected network, while nodes not covered by a certain topic might use
	it for finding nodes covered by that topic for using a service provides
	by such nodes.

	Topics are assumed to potentially cover arbitrarily large or small parts
	of the Ethereum network, meaning that certain topics can contain only
	one node while other topics may span the entire network. Irrespective of
	the number of nodes covered by a topic, the discovery infrastructure should
	provide a suitable set of bootstrap nodes for new nodes wishing to join
	the network or those wishing to use the services of nodes covered by said
	topic.

	Furthermore, it is assumed, that each node is covered by at most a few dozen
	topics.


	2. Topic structure

	On the lowest level, topics are assumed to be unstructured, denoted by a
	string of printable characters. On a higher level, two possible
	hierarchical topic structures are defined: one in which nodes covered by
	a topic and all its subtopics are searched and another in which nodes covered
	by a topic and all its supertopics are searched. Subtopics are denoted by
	a string that has the parent topic as its prefix, followed by some delimiting
	character.

	The two higher-level hierarchies are handled as follows: in the first
	case, when nodes covered by subtopics need to be discoverable, nodes
	must be tagged by their topic and all their supertopics (e.g a node
	covered by "animal.mammal.dog" is also covered by "animal.mammal" and
	"animal", so that a search for "animal.mammal" will find it), in the
	second case, when nodes covered by supertopics need to be discoverable,
	the node doing the discovery also searches for all the supertopics.
	Apart from this, the directory is topic sturcture agnostic.

	The DHT distance measure between a topic and a node's DHT address is
	defined as the XOR of the latter with the hash of the topic string.


	3. Architecture

	Two states of discovery are clearly distinguished: before and after the
	searching node made the first successful handshake with a node covered by
	the searched topic.

	In the first state, the node searches the topic directory organized as a
	DHT described later in more detail. From the topic directory, it will
	obtain the contact details (network address, port and public key) of a
	relatively small set of matching nodes to which it attempts to connect.
	Upon the first successful handshake, it transitions to the second state.

	In the second state, nodes covered by a certain topic introduce each
	other to the searching node. The strategy followed in this state may
	well depend on the requirements of the topic and the desired
	(sub-)network topology. Thus, the discussion of this second state search
	falls outside of the scope of this document. Nodes can transition back
	to the first state if for any reason they disconnect from all other nodes
	covered by their topic.

	Each node participating in the DHT stores records for a (potentially)
	large number of topics and at most a small number of records for each
	topic. Each record binds one node to one topic. A record contains, in
	addition to the topic, the network address, the port and the public key
	of the node as well as a timestamp and a digital signature by the
	referred node calculated on the rest of the record.


	4. Registration

	Nodes regularly register themselves for each topic covering them by pushing
	the corresponding records into the DHT. Only registrations with a very recent
	timestamp and a correct signature are accepted. If the DHT is direct (no
	content forwarding), the originator network address is also checked, as nodes
	are only allowed to register themselves.

	In order to prevent overload on nodes with addresses close to popular
	topics, the time interval for these registrations is controlled as
	follows: Upon registration, the timestamp of the most recent registration
	is returned to the registering node. The repeated registration is scheduled
	at a delay inversely proportional to the age of the most recent record.
	If the capacity for records for one topic is full, a random record for
	the topic in question is deleted. As a security measure, nodes with overly
	frequent registrations may be blacklisted.


	5. Retrieval

	First, a DHT node sufficiently close to the topic is found. Upon request
	(i.e. search by topic), all sufficiently recent records are returned.

	If the DHT is implemented without query and data forwarding, this might
	result in excessive hammering of nodes close to popular topics. For such
	implementation, it is advisable for registering nodes to register
	themselves in every step of DHT lookup, not only at nodes closest to the
	topic. This way, retrieval lookups are likely to find registrations
	closer to themselves and further away from the closest node in the DHT.
	Another possible defense against hammering by retrieval requests is to
	use them sparingly: both topic member nodes and client nodes must strive
	to remain connected to the topic network by queriing their neighbors for
	further connections in order to maintain a healthy redundance of
	connectivity without resorting to discovery.

	Forwarding DHT implementations, however, avoid the issue altogether by
	propagating and caching records along the path between the node on which
	it was found and the querying node.


	6. Proposed implementation strategies

	a) Leave the UDP-based Kademlia discovery intact and oblivious of topics,
	its only purpose being to maintain a connected network of assorted
	Ethereum nodes. The above proposed mechanism would be implemented as a
	p2p subprotocol (named, e.g., DIR) with a forwarding Kademlia similar to
	that of Swarm, based largely on Swarm codebase. Instead of storing one
	Swarm chunk per key, it would store a fixed number of topic discovery
	records per key.

	b) Add topic discovery directly to the existing UDP-based node discovery
	Kademlia and implement the anti-hammering measures proposed in the
	previous section.