I compiled these while learning about running Apache Pulsar. The notes are written from the perspective of a Go developer that interfaces primarily with Kafka, RabbitMQ and NATS. Something may not be entirely correct. Sorry!
- Concept of “bundles”
- Each broker is in charge of certain bundles
- A bundle represents multiple topics
- Pulsar has a tunable automatic load shedder
- Possible to configure to auto shed bundles to less loaded brokers
- Bundles are automatically split when under heavy load
- Bundle is split into more bundles
- ^ How do you know that has happened?
- Bundle is split into more bundles
- Pulsar’s built-in dashboard sucks
- Does not contain up-to date information (has 1m+ lag for stats)
- Unable to view actual messages
- Use pulsar manager instead
- UPDATE: Oh, dashboard is deprecated. That explains it.
- By default, retention is not enabled for namespaces/topics
- Can be set via
pulsar-admin
orpulsarctl
- Can be set via
- Partitions
- How do partitions per topic work?
- Same as Kafka - partitions are a unit of parallelism
- By default, a topic is created with 1 partition (known as a "non-partitioned topic")
- Partitioned topics must be explicitly created via admin API (
pulsarctl
etc) - If you use a regular topic (with 1 partition) - you can have as many consumers as you want
- If you use a partitioned topic - it falls under the same requirements as kafka - one consumer per partition.
- How do partitions per topic work?
- What is the ledger?
- A bookkeeper concept
- Pulsar stores data in bookkeeper ledgers
- Ledger contains metadata about the topics underneath it
- Unlike kafka, Pulsar does not store data on brokers - data is stored on bookkeeper nodes
- Ledger == unit of storage in bookkeeper
- Pulsar supports server-side schema registry
- Does it support protobuf?
- Looks like it is possible to create a producer with protobuf support in golang
- Haven’t checked consumer - but probably yes
- Does it support protobuf?
- Connectors
- Same as Kafka
- “source” == get data INTO pulsar
- “sink” == get data OUT of pulsar
- Same as Kafka
- Replication factor
Specify the replication factor viapulsarctl
when creating the topicYou can update replication factor via pulsarctl- There is no concept of "replication"
3. Data is stored on bookies (bookkeeper concept) based on satisfying quorum
4. Possible to disable/enable "replication" per topic (maybe per namespace?) via
pulsar admin api
- Can you set message speed per topic???
- No, doesn't seem like it
- What is a “cursor”?
- Same as “current offset” in Kafka
- You can change the cursor via pulsar admin api
- You can define “interceptors” for producers (in golang client)!!!
- Ie. custom roundtripper
- Could attach some sort of basic validation - neat!
- Has concept of persistent and non-persistent topics
- Non-partitioned topics are automatically deleted after 60 seconds of inactivity - nice!
- There is no way to convert non-partitioned topics -> partitioned topics... - lame
- Have to delete non-partitioned topic and re-create as partitioned topic
- Pulsar comes with
pulsar-perf
— a tool to test pulsar performance - Pulsar has a concept of “transactions”
- Think DB transactions - emit 5 messages as part of the same atomic operation.
- Pulsar golang client
- The pulsar golang client lib by default sets a 64MB memory limit!!!! - nice!
// Limit of client memory usage (in byte). The 64M default can guarantee a high producer throughput // Config less than 0 indicates off memory limit. MemoryLimitBytes int64
- It gets better,
client.CreateProducer()
returns an interface already - niiiiice - The native Go client is really well thought out. Has batching, has chunking support, has automatic (tunable) broker reconnect, has schema (incl. protobuf) and much more.
- It is possible to deliver messages with a delay or delayAt!
- Downside - the library does not contain management functionality (ie. can't manage topics, subscriptions, etc.) - need to use another lib (
pulsarctl
) - Pulsar has server-side dedupe
Recent findings after writing producer and consumer code:
- For high-speed production, async send works best
- To avoid hitting send timeouts, set
SendTImeout
when instantiating producer - The golang lib doesn't include admin API support - need to use a separate lib if you want to programmatically manage topics, subscriptions, etc.
- Not able to produce more than ~6K/s on a default k8s deployment (3 node cluster via official helm) -- something needs to be tuned probably
Rudimentary Pulsar producer code here: https://github.com/batchcorp/event-generator/blob/main/output/pulsar.go#L41