First, I'd like to say hello to everyone and thank for coming.
I joined SH not that long ago (~2months) as a PS team member. I don't have a chance to gather enough knowledge about how stuff works here, but let me begin.
During working on some PS projects I met with many things that should be done only once and they were not. I want to talk about one of them in this talk.
There are lots of duplicate code across different projects
Even worse there are duplicate ideas, methods, approaches and techniques.
In many projects there is custom logic for running spiders. Usually it's a entry script that is scheduled with Dash that takes some sort of input (config, args) and runs spiders.
It's possible thanks to SC2.0 arbitrary python scripts execution facility and simple cron-style scheduler. We can improve this.
Some custom code that implements scheduling
DS team did a decent job with it. It provide snippets (some code + config: see BaseManager
in ds-toolbox
) to create producer-consumer spider chain. Custom scripts (Managers) run within single dash project that control spider execution within other dash project.
Question why not "open source" it to whole company and even take one step forward?
WE NEED STANDALONE DEDICATED WORKFLOW MANAGER.
In my opinion, the whole company would greatly benefit from having it. Let's create company wide workflow manager installation so that anyone in any team can access and use it!
Do not write it from scratch (at least, at the moment)
It really doesn't matter what to install (google for luigi, oozie, azkaban). Many of them.
My suggestion is Airflow
. Doing it to make proposal sound more concrete and not just chit-chatting about non-tangible stuff.
- Code/infrustructure reusability -> don't reinvent the wheel
- One place for all workflows -> central registry
- Tons of nice features coming with it -> GUI, flexibility, etc. (in case of airflow)
- Another nice feature we can sell to customers (service to execute large complex workflows to support data retrieval process)
- Right now in PS projects I'm seeing a couple of projects with really complex workflow logic: chaining/branching/dependencies. Such projects would benefit a lot.
- More flexibility for monitoring: check not only spider state but look at results.
- DS team could enhance their managers. Or simplify. Or get rid of. Given the team has created a lot for unifying codebase it would not that difficult to switch to new platform. Though they usually need not so complex logic, just producer/consumer pattern (most of time, and I could mistake here)
- All ETL's within a company would be done right. Workflow manager is a heart of companies ETL processes, it's designed for them.
- Most simple spiders would also benefit: easy to implement retry/monitor logic. Or any other extra functionality for free.
Open source python scalable workflow manager, now is under Apache
incubator. Originally developed by airbnb
. Gaining momentum pretty fast. Pretty new, yet early adoption phase has already been passed. Workflow (==DAG, directed acyclic graph) is a python code: flexibility!
Consists of:
- Webserver (GUI + manual input)
- Scheduler (decides what to run when)
- Workers (inprocess/local/celery/mesos executors)
Local installation with docker-airflow
with DAGs (directed acyclic graph) folder mounted. Just open your browser and look how it runs. Actually, my previous experience is already PoC.
Let's get a little dirty with a code
Main point of my talk is not about airflow
but about common workflow process. I used airflow
to be as concrete as possible (not just idea and/or talk but proposal, demonstration, PoC).
In the end i want to ask for a feedback. Do we really need it or am I mistaken here? Maybe there is already some job done within some team/project/anything.
Another thing i want to ask is to spread the word. Shubtalks is a nice place to start but we have more than 35-40 people. It nice to have them in a loop :)
I wont' go into very fundamental topics that are worth researching by themselves such as:
- Sleep
- Physical activity/wellness
- Personal burnout
I just want to share my regular routine that helps me to stay on track and get job done.
Pomodoro technique is a core of my work process. For those who is not familiar with it it just a smart way to split you time into intervals of work and rest.
Rest is first-class citizen of your productive work. It one of things we're all know about but rare pay due attention to.
How do i use it: I use 25 min work interval and 5 min rest. My pomodoro is 30 minutes long. Every four pomodoro you have a long break (i use 15 minutes).
Key points here are:
- Really take breaks! I use short breaks for stretching, having a glass of water, do some movements. These small things contribute much to my physical condition. Idea here to have some activity to accelerate blood flow that help brain cells to get extra fuel: nutrients, oxygen, whatever. That boosts your mind performing.
- Ultra focus on one task per interval. No distractions (phone, social media, chats, nothing - use them either during breaks or in a dedicated pomodoro). For better focusing I employ ambient sounds in a headphones (more on this later).
- Small trick is to leave to rest without not wanting to. So you brain will still work on what you've worked in background generating insights while you conscientiously do other things (test that! this is really cool!)
Find more about it at this awesome lifehacker article. It's great as entry point.
This is very controversial point. For pomodoro technique you can vary work/rest time to fit you personally. Once i used 45-15 split because i feel not-so-productive working only 25 min. But (!) everything changed once i switch to height-regulated desk.
Right now 25-5 split fits much naturally because, well, it's harder to stand than to sit. Now you tend to focus much harder for smaller periods of time (or it's only me).
There are no research that really show standing is more or less beneficial than sitting so i won't advocate for it. I switched because it felt like a cool experiment and it stuck with me.
My main point using standing desk is to force myself being more active throughout the day. Again, it's harder to stand so you will accept small movements more easily. You're already standing - just go grab some water. Compare it to sitting - man, i'm too lazy to stand up and go do something, let's do it in an hour.
This is great for spinning your brain. There actually ARE researches that show how right amount of background noise affect your concentration ability.
So I employ it. There are services that generate such noises (my choice is nosili.com). There is a intriguing service called brain.fm (AI generated sounds/music for relaxing/productivity/etc. Let me know if we can have a paid account there). Music actually distracts me and is good only for routine tasks.
work expands so as to fill the time available for its completion
Limit yourself. Life is a marathon and not a sprint. Your goal is to perfrom steadily for a long time. Unless you know what you're doing.