The Best Web Crawler for Scraping Reddit
Do I need to write about the benefits of scraping Reddit, a social platform, which has 430 Mn+ active monthly users? Not really. Esp when 50% of the traffic is from the USA. If you are into data science, you would already know that scraping platforms like Reddit, quora, StackExchange, Facebook groups, Linkedin, Pinterest, Instagram, etcetera is like a data-goldmine for your brand to
- Consumer research
- Track customer sentiments,
- Track your competitor,
- Research about the pain-points of your target customers,
- Better understand your target customer’s lingo to launch effective marketing campaigns
- Stay on top of market trends, and
- Monitor the impact of your marketing campaigns
- Generate new leads
- Build NLP applications
A lot more is possible with web scraping… in this insight, we provide you a cost and time-efficient methodology to scrape reddit[dot]com. Let’s get started.
What should be scraped from Reddit?
You can scrape the following data points from reddit.com:
- List of subreddits
- Submissions for each subreddit
- Content of each submission [posts]
and any other information from the subreddits which are relevant to your industry/business.
- A scraping company can scrape all the questions from the subreddit r/webscraping, analyze the posts, and accordingly plan what topics to consider for content strategy.
- A fashion brand can scrape all links, comment texts, titles, captions, images, etcetera in fashion subreddits like r/streetwear, run some text analytics and machine learning algorithms to
- Identify color trends,
- Discover pain-points of fashionistas with various brands,
- Devise the right pricing strategy
- Trading and investing firms can scrape “stock market” related subreddits to devise an investing plan by analyzing which stocks are being discussed and preparing a ticker list accordingly.
- A job aggregator can scrape Reddit posts for collecting info around new vacancies. Similarly, HR firms can scrape Reddit posts to find candidates looking for new jobs.
- News and journalism players can scrape author posts with blog links to train ML algorithms for auto text summarization.
How to scrape Reddit.com?
There are five ways to scrape reddit.com:
- Manual Scraping
- Reddit API
- Sugar Coated Third-Party APIs
- Custom Scraping Scripts
- Using “Click Once & Scrape Repetitively” Web Scraping Tools
Benefits & challenges of using different scraping techniques:
Manually scraping Reddit is the easiest but least efficient in terms of speed, as well as cost. But manual scraping yields data with high consistency. Scraping using Reddit API provides data easily but to use it you need at least basic coding skills. Also, Reddit API limits the number of posts in any Reddit thread to 1000. It’s not possible to scrape any post other than the top 1000 using Reddit API. Third-party API services to scrape Reddit is an effective and scalable approach but it is not cost-efficient. Custom scraping scripts again requires a high programming caliber. This is highly customizable and scalable. But it’s cost-intensive, just like using third-party API services. Enters Click & scrape tools. These are scalable and require only basic know-how of using a mouse. Any added knowledge of XPath or RegEx is beneficial. Though, it is not required as good “click and scrape” tools have in-built functionalities to extract Xpath or generate RegEx.
Which method should you choose for scraping reddit.com?
- If your Reddit scraping requirements are small, go for manual scraping. Say, if you only need to scrape three or four Reddit threads on a particular topic, of course, manual scraping should be preferred.
- For large Reddit scraping requirements, you must leverage automated scraping methodologies like custom scripts, API services, or “click and scrape” tools.
- Prefer hiring web scraping developers and data testing, cleansing & validation engineers if you have a high budget, and if your daily Reddit scraping requirements are way past a few million posts. Remember, this is a resource-intensive option. You’ll need human resources, computing resources, networking resources on top of web scraping specific resources i.e., proxy services, database, etcetera.
- If your daily scraping requirements are within a few million posts or rows of data, then using “click and scrape” tools would be more cost & resource-efficient.
Scraping Reddit.com with Octoparse:
New reddit.com has an infinite scroll feature and it is tricky to scrape. But scraping new Reddit is a cakewalk with Octoparse. It hardly takes 5 minutes to set up and start scraping data from Reddit. For the demo in this insight, we shall do the setup for scraping old Reddit:
What shall we scrape? Ummm.. let’s scrape all-time top posts in the technology subreddit:
Getting Started With The Reddit Scraper :
- We shall be using Octoparse advanced mode for this tutorial to generate the template for scraping the old Reddit.
PS: If you want to scrape any other social media websites, we’ve ready to use templates for 100+ popular websites like Amazon, Walmart, Booking, Facebook, Instagram, Yelp, Tripadvisor, Youtube, Crunchbase, etc., Below is a screenshot of pre-built templates for different Social-Media websites.
Enter the Url in the Website field. You’ll get this screen after you log in and click on the “Task” button under advanced mode. This is the starting URL for your Reddit scraping template.
Of course, click on the Save button. If you don’t see this on the screen, you may need to scroll down a bit.
Creating Pagination For The Reddit Scraper:
Scraping old Reddit is damn easy but we do need to create pagination. In new Reddit, you would need to set “Scroll Down” to crawl all the posts from a Reddit URL as it has an infinite scroll feature.
Post saving the target URL, ideally you should see the following screen:
The left panel has links to scraping resources. The right component has two halves:
- The top half is a template configuration component.
- The bottom half lets us interact and set up the scraping steps.
To create the pagination, scroll down to the bottom of the page and locate a button with “next” as the button text.
We need to loop click this next button to create pagination. Click on the Next button and in “Action Tips” select “Loop click this element”.
Now, by default you get the below screen as Octoparse detects AJAX on the Reddit website:
But for old Reddit, while clicking on the next page we will not use AJAX as it is a full-page reload. So, we need to uncheck two boxes:
- Load the page with AJAX
- Click Items in the Loop
Our screen should look like this:
Note: if you’re extracting new Reddit pages with AJAX, check these boxes. Set the ajax timeout to anything between 2-6 [recommended], if yours as well as Reddit’s network bandwidth is good and consistent then you may keep AJAX timeout to 3 or 4 too! If you’re going for cloud extraction with Octoparse where you get high bandwidth low AJAX timeout should work well.
Click on customize to use custom Xpath for the clicked element.
To compute the XPath for the Next button, use it data points if data points you’re acquainted with chrome developer tools. Else try the pre-built Xpath tool provided by Octoparse, as you can see in the image below:
Press OK. Press Save.
Extracting data from Reddit:
Before we start extracting data, we need to perform two actions:
- Click on Go to web page
- Click on Pagination Box [Not on “click to paginate”].
Now, we need to perform extracting data from Reddit post listicles i.e., recurring HTML elements with Reddit posts.
The approach is to ->
- Select the boxes or components with all the data points and
- Select the data-points within those boxes.
Refer to this link for more on extracting data from recurring HTML elements using Octoparse.
Next, we rename the data fields to our desired names. This is how our crawler looks now :
Last thing we need to do is set the “Loop Item” to a variable list by using following XPath: //div[contains(@id,”siteTable”)]/div[contains(@data-promoted,”false”)]
This is important as Reddit post listings have promoted posts that won’t always be on the same listing position so our XPath must be customized. Also, we observe that the loop items must be terminated after 25 as that’s the default number of posts on a single Reddit post listing page. You can set “End loop when -> Execution times reach -> 25”. Else, it becomes an infinite loop and never exits.
Click on “Ok”,
Save the template by clicking on “Save”.
Now, let’s start scraping Reddit posts by clicking on “Start extraction”.
On the next screen, choose “Local Extraction”. You may also choose cloud extraction or creating an API, schedule your Reddit crawler, etcetera. For the demo, let’s stick to “Local extraction”.
Yayy!!!! It starts scraping. Here is how it looks:
I forced STOP the data extraction after 100 post extractions. You can scrape as much as you want. Do observe the fact that data looks
- Consistent, and
- Got Scraped at admirable speed.
In this insight, we explored how scraping Reddit can be useful to businesses. You have also been informed about 5 different ways of scraping Reddit. And how to decide about which methodology you should adopt for scraping. Finally, we scraped old reddit.com using Octoparse. Was it not a breeze? Here are a few reasons why I recommend you to use Octoparse as your go-to scraping tool:
- It’s cost-effective.
- Scalable scraping.
- Easy to use solution for clicking and scraping data.
- Small learning curve.
- Extensively documented, well-detailed tutorials.
- Ready to use templates for scraping popular websites across industry verticals.
- You can configure browser user-agents, and lets you block ads too.
- Provides you with options to customize your templates with XPath & RegEx.
- Supports scraping AJAX websites.
- Has in-built support for handling anti-scraping website configurations.
If you’ve any doubts or need help setting up a Reddit crawler, please contact us.
- Best Data Science Tools: Automation, Analytics, and Visualisation - December 9, 2021
- Automate Job Feed Scraping & Posting To Scale-Up Your Business - September 29, 2021
- How To Develop And Grow Your Niche Job Board Aggregator Websites? - September 16, 2021