Last active
November 19, 2021 10:12
-
-
Save yurukov/8326b3803b436c100cac to your computer and use it in GitHub Desktop.
Scraping a full Facebook group page from a browser
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
These are a few commands that could be used to scrape a full group page | |
from Facebook. One can use the Graph API, but there some users would be | |
hidden. The JS commands should be run in a browser and scroll through | |
the page opening up hidden content and comments. I used Chrome. Once | |
enough content is opened, you should save the page as any other and | |
analyse it's contents. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// 1. load the group | |
// 2. start scrolling. This will erase all images to minimize the size | |
// of the page in memory and keep scrolling down | |
scroll = setInterval(function() { | |
a = $$("img"); for (i=0;i<a.length;i++) a[i].parentNode.removeChild(a[i]); | |
window.scrollTo(0,document.body.scrollHeight); | |
},3000); | |
// 3. Stop scrolling when satisfied | |
clearInterval(scroll); | |
// 4. Add a guard against reloading the page | |
window.onbeforeunload = function() { | |
clearInterval(uncover); | |
return "Loading hidden comments stopped."; | |
} | |
// 5. Load hidden comments and posts. Loading some posts may reload the | |
// page. In these cases the guard above will stop the loading process and | |
// stop the reload. In that case, press cancel and run this command again | |
uncover = setInterval(function() { | |
a = $$("img"); for (i=0;i<a.length;i++) a[i].parentNode.removeChild(a[i]); | |
a = $$("a[class='see_more_link']"); | |
if (a.length>0) { | |
a[0].target="_blank"; | |
a[0].click(); | |
a[0].className="see_more_link passed"; | |
} | |
b = $$("a[class='UFIPagerLink']"); | |
if (b.length>0) { | |
b[0].click(); | |
b[0].className="UFIPagerLink passed"; | |
} | |
console.log(a.length+" "+b.length); | |
},1000); | |
// 6. When all is loaded, stop the comment/post recover process | |
clearInterval(uncover); | |
// 7. Save the page code from the browser |
$$ wasn't working for me whilst inside setInterval (but worked fine when run manually) due to scoping; binding it to querySelectorAll on the document manually before trying to run setInterval fixed it.
const $$ = document.querySelectorAll.bind(document);
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello,
I use my own Facebook group scrape tool (written in Java) which runs each Z minutes and scrape last N post in the specified group.
For my needs, the posts are storing in the Firebase
I am going to publish this tool for the world - just need to prepare some executable files
The program will be available here - http://bit.ly/3bbtJA0
Meanwhile, you may request an output format you need to be added in the tool and/or some extra logic
Here are my contacts
email - [email protected]
telegram - https://t.me/postullat
Skype - [email protected]
Feel free to contact me
Best regards,
Vova