RohanAwhad/late_chunking_with_recursive_splitting.py

RohanAwhad · 2024-09-03T22:44:56Z

Usage Example:

separators = ['\n\n', '\n']
text = """At a YC event last week Brian Chesky gave a talk that everyone who was there will remember. Most founders I talked to afterward said it was the best they'd ever heard. Ron Conway, for the first time in his life, forgot to take notes. I'm not going to try to reproduce it here. Instead I want to talk about a question it raised.

The theme of Brian's talk was that the conventional wisdom about how to run larger companies is mistaken. As Airbnb grew, well-meaning people advised him that he had to run the company in a certain way for it to scale. Their advice could be optimistically summarized as "hire good people and give them room to do their jobs." He followed this advice and the results were disastrous. So he had to figure out a better way on his own, which he did partly by studying how Steve Jobs ran Apple. So far it seems to be working. Airbnb's free cash flow margin is now among the best in Silicon Valley.

The audience at this event included a lot of the most successful founders we've funded, and one after another said that the same thing had happened to them. They'd been given the same advice about how to run their companies as they grew, but instead of helping their companies, it had damaged them.

Why was everyone telling these founders the wrong thing? That was the big mystery to me. And after mulling it over for a bit I figured out the answer: what they were being told was how to run a company you hadn't founded — how to run a company if you're merely a professional manager. But this m.o. is so much less effective that to founders it feels broken. There are things founders can do that managers can't, and not doing them feels wrong to founders, because it is.

In effect there are two different ways to run a company: founder mode and manager mode. Till now most people even in Silicon Valley have implicitly assumed that scaling a startup meant switching to manager mode. But we can infer the existence of another mode from the dismay of founders who've tried it, and the success of their attempts to escape from it.

There are as far as I know no books specifically about founder mode. Business schools don't know it exists. All we have so far are the experiments of individual founders who've been figuring it out for themselves. But now that we know what we're looking for, we can search for it. I hope in a few years founder mode will be as well understood as manager mode. We can already guess at some of the ways it will differ.

The way managers are taught to run companies seems to be like modular design in the sense that you treat subtrees of the org chart as black boxes. You tell your direct reports what to do, and it's up to them to figure out how. But you don't get involved in the details of what they do. That would be micromanaging them, which is bad.

Hire good people and give them room to do their jobs. Sounds great when it's described that way, doesn't it? Except in practice, judging from the report of founder after founder, what this often turns out to mean is: hire professional fakers and let them drive the company into the ground.

One theme I noticed both in Brian's talk and when talking to founders afterward was the idea of being gaslit. Founders feel like they're being gaslit from both sides — by the people telling them they have to run their companies like managers, and by the people working for them when they do. Usually when everyone around you disagrees with you, your default assumption should be that you're mistaken. But this is one of the rare exceptions. VCs who haven't been founders themselves don't know how founders should run companies, and C-level execs, as a class, include some of the most skillful liars in the world. [1]

Whatever founder mode consists of, it's pretty clear that it's going to break the principle that the CEO should engage with the company only via his or her direct reports. "Skip-level" meetings will become the norm instead of a practice so unusual that there's a name for it. And once you abandon that constraint there are a huge number of permutations to choose from.

For example, Steve Jobs used to run an annual retreat for what he considered the 100 most important people at Apple, and these were not the 100 people highest on the org chart. Can you imagine the force of will it would take to do this at the average company? And yet imagine how useful such a thing could be. It could make a big company feel like a startup. Steve presumably wouldn't have kept having these retreats if they didn't work. But I've never heard of another company doing this. So is it a good idea, or a bad one? We still don't know. That's how little we know about founder mode. [2]

Obviously founders can't keep running a 2000 person company the way they ran it when it had 20. There's going to have to be some amount of delegation. Where the borders of autonomy end up, and how sharp they are, will probably vary from company to company. They'll even vary from time to time within the same company, as managers earn trust. So founder mode will be more complicated than manager mode. But it will also work better. We already know that from the examples of individual founders groping their way toward it.

Indeed, another prediction I'll make about founder mode is that once we figure out what it is, we'll find that a number of individual founders were already most of the way there — except that in doing what they did they were regarded by many as eccentric or worse. [3]

Curiously enough it's an encouraging thought that we still know so little about founder mode. Look at what founders have achieved already, and yet they've achieved this against a headwind of bad advice. Imagine what they'll do once we can tell them how to run their companies like Steve Jobs instead of John Sculley."""

chunks = recursive_splitter(text, separators, 300)
chunk_embeddings = embed_using_late_chunking(chunks)

	from transformers import AutoTokenizer, AutoModel
	import torch


	# MODEL CKPT is downloaded from: "jinaai/jina-embeddings-v2-base-en" # has context len of 8192
	MODEL_CKPT = "/Users/rohan/3_Resources/ai_models/jina-embeddings-v2-base-en"

	def recursive_splitter(text: str, separators: list[str], chunk_size: int) -> list[str]:
	if len(separators) == 0:
	words = text.strip().split(' ')
	return [' '.join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]
	ret = []
	first_sep = separators[0]
	for chunk in text.split(first_sep): ret.extend(recursive_splitter(chunk, separators[1:], chunk_size))
	return ret

	def embed_using_late_chunking(chunks):
	tokenizer = AutoTokenizer.from_pretrained(MODEL_CKPT) # this simple BERT tokenizer
	inp_tokens = [x[1:-1] for x in tokenizer(chunks)['input_ids']] # removing CLS and SEP token from start and end of each chunk
	offsets = [1]
	all_tokens = [tokenizer.cls_token_id]
	for toks in inp_tokens:
	offsets.append(offsets[-1] + len(toks))
	all_tokens.extend(toks)
	all_tokens.append(tokenizer.sep_token_id)

	model = AutoModel.from_pretrained(MODEL_CKPT, trust_remote_code=True)
	model.eval()
	with torch.no_grad(): outputs = model(input_ids=torch.tensor(all_tokens).unsqueeze(-1))
	return [outputs.last_hidden_state[0, i:j, :].mean(dim=-2).detach().numpy().tolist() for i, j in zip(offsets, offsets[1:])]

RohanAwhad/late_chunking_with_recursive_splitting.py

RohanAwhad commented Sep 3, 2024

Uh oh!