Social_Media_Crawler_2024

This repository contains my new Social Media Crawler for Data Analytics
(Written for German social media pages, but easily transferable to other languages)
All Crawlers are written in Python. I primarily use Selenium as the framework, but I've also created some smaller Crawlers using Playwright.

Currently, the project includes the Social Media platforms Facebook, Instagram, LinkedIn, TikTok, X/Twitter and YouTube. However, functionalities such as capturing comments are not fully implemented in some cases. The updates for these features will be implemented in the coming days.

The functionalities of the crawlers include:

Automated logins
Several ways to get around cookie banners
Searching for and getting on a profile (not included in all files)
Collection of the profile stats
Scrolling down the feed until you reach a specific date
Scraping of the date, content, images, links, likes, number of comments and number of shares of every posting
Saving the data in DataFrames
and finally an export of the DataFrames to an excel file

Issues

The DateTime on Facebook is no longer displayed as text, requiring me to address this issue through less precise methods such as text scraping, screenshots, and image reading (utilizing Pillow/Pytesseract)
the links are partially provided only in XML format: "use xlink:href="#gid111" xmlns:xlink="http://www.w3.org/1999/xlink"></use" Consequently, I am unable to scrape their text content directly. Although I devised a method to associate visible dates with posts, this approach introduces some inaccuracies.
Certain crawlers necessitate a headed browser due to my use of PyAutoGUI to navigate around bot blocking and handle exceptional page settings.

I appreciate any suggested solutions to overcome these challenges.

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
.idea		.idea
.gitignore		.gitignore
Facebook_Crawler_Sel.py		Facebook_Crawler_Sel.py
Facebook_Crawler_Sel_2025.py		Facebook_Crawler_Sel_2025.py
Instagram_Crawler_Sel.py		Instagram_Crawler_Sel.py
LICENSE		LICENSE
LinkedIn_Crawler_Sel.py		LinkedIn_Crawler_Sel.py
LinkedIn_post_Crawler_PW_sync.py		LinkedIn_post_Crawler_PW_sync.py
README.md		README.md
TikTok_Crawler_Sel.py		TikTok_Crawler_Sel.py
X_Crawler_Sel.py		X_Crawler_Sel.py
YouTube_Crawler_Sel.py		YouTube_Crawler_Sel.py
aggregate_analyse_data.py		aggregate_analyse_data.py
crawler_functions.py		crawler_functions.py
exclude_words.txt		exclude_words.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Social_Media_Crawler_2024

The functionalities of the crawlers include:

Issues

About

Releases

Packages

Languages

License

AndreM92/Social_Media_Crawler_2024

Folders and files

Latest commit

History

Repository files navigation

Social_Media_Crawler_2024

The functionalities of the crawlers include:

Issues

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages