Preventing Content Scraping on Your WordPress Site
Learn how to protect your WordPress content from automated scraping and theft. Implement effective measures against content copying and republishing.
Content scraping threatens websites by stealing valuable content for unauthorized republishing. Protecting your WordPress content requires multiple defensive layers against automated and manual copying.
Understanding Content Scraping
Scrapers use various methods to steal content:
- Automated bots crawling your site
- RSS feed harvesting
- API exploitation
- Manual copy-paste operations
- Browser automation tools
Why Scrapers Target Your Content
- Building competing websites
- Creating spam sites for ads
- Training AI models
- SEO manipulation
- Republishing for profit
Identifying Scraping Activity
Signs of Scraping
- Unusual traffic patterns
- High bandwidth usage
- Rapid sequential page requests
- Requests without typical browser headers
- Content appearing on other sites
Monitoring Tools
- Server access logs analysis
- Google Alerts for content
- Copyscape for plagiarism detection
- Traffic analytics for patterns
Technical Prevention Methods
Rate Limiting
// Simple rate limiting
add_action('init', function() {
$ip = $_SERVER['REMOTE_ADDR'];
$key = 'rate_limit_' . md5($ip);
$requests = get_transient($key) ?: 0;
if ($requests > 60) { // 60 requests per minute
wp_die('Too many requests. Please slow down.', 'Rate Limited', 429);
}
set_transient($key, $requests + 1, 60);
});
Bot Detection
// Block known bad bots
add_action('init', function() {
$user_agent = $_SERVER['HTTP_USER_AGENT'] ?? '';
$bad_bots = array(
'HTTrack', 'WebCopier', 'Offline Explorer',
'SiteSucker', 'WebReaper', 'Teleport'
);
foreach ($bad_bots as $bot) {
if (stripos($user_agent, $bot) !== false) {
wp_die('Access denied', 'Forbidden', 403);
}
}
});
Honeypot Traps
Create hidden links that humans won't see but bots will follow:
// Hidden link in footer
<a href="/trap-page/" style="display:none;">Click here</a>
// Block IPs that visit trap page
add_action('template_redirect', function() {
if (is_page('trap-page')) {
$ip = $_SERVER['REMOTE_ADDR'];
// Log and block this IP
update_option('blocked_scrapers',
array_merge(get_option('blocked_scrapers', []), [$ip])
);
}
});
RSS Feed Protection
Limit Feed Content
// Show excerpts only in feeds
add_filter('the_content_feed', function($content) {
global $post;
return '' . get_the_excerpt($post) . '
';
});
Add Attribution to Feeds
// Append source link
add_filter('the_content_feed', function($content) {
$link = get_permalink();
$content .= 'Originally published at ' . $link . '
';
return $content;
});
JavaScript-Based Protection
Disable Right-Click
// Discourage casual copying
document.addEventListener('contextmenu', function(e) {
e.preventDefault();
alert('Content is protected');
});
document.addEventListener('selectstart', function(e) {
e.preventDefault();
});
Note: This only deters casual copying. Determined scrapers bypass JavaScript easily.
Lazy Loading Content
Load content after page load to complicate scraping:
// Load content via AJAX after page load
jQuery(document).ready(function($) {
$('.protected-content').each(function() {
var container = $(this);
$.get('/api/content/?id=' + container.data('id'), function(data) {
container.html(data);
});
});
});
robots.txt Configuration
# Block known scrapers
User-agent: HTTrack
Disallow: /
User-agent: WebCopier
Disallow: /
User-agent: Offline Explorer
Disallow: /
# Limit crawl rate for all bots
User-agent: *
Crawl-delay: 10
Legal Protections
- Display clear copyright notices
- Include terms of service
- Use DMCA takedown procedures
- Register important content with copyright office
Content Watermarking
Embed invisible markers in content:
- Zero-width characters between words
- Invisible spans with tracking codes
- Unique word variations per visitor
- Image watermarks
Cloudflare Protection
Use Cloudflare or similar services for:
- Bot detection and blocking
- Rate limiting
- Challenge pages for suspicious traffic
- Traffic analytics
Conclusion
No solution completely prevents determined scrapers, but combining multiple methods significantly reduces content theft. Focus on detection, deterrence, and legal recourse for comprehensive protection.
Written by Sarah Chen
WP Folder Shield Team