Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Parse a website with regex and urllib in Python
Web scraping is a powerful technique for extracting data from websites, enabling automated data collection and analysis. Python provides several tools to make web scraping tasks easier through its robust ecosystem of libraries. Two commonly used libraries for web scraping are urllib and re (regular expressions).
The urllib module enables fetching web content, processing URLs, and sending HTTP requests. It provides a straightforward way to interact with web servers and retrieve HTML from web pages. The re module supports regular expressions, which are character sequences used to create search patterns for extracting specific data.
In this article, we'll focus on using urllib and re to parse websites and extract relevant information. We will examine two methods that demonstrate how to use regular expressions to obtain specific data from a webpage's HTML content ?
Extracting Website Title Using urllib and Regex
This example demonstrates how to fetch HTML content using urllib and extract the website title using regular expressions ?
Algorithm
Step 1 Import the required libraries urllib and re.
Step 2 Open URL using urlopen() from urllib.request and retrieve HTML content.
Step 3 Define the regular expression pattern for <title> tag.
Step 4 Search for all occurrences of the pattern.
Step 5 Process and print the matching titles.
Example
import urllib.request
import re
# Open URL and retrieve HTML content
url = "https://www.tutorialspoint.com/index.htm"
response = urllib.request.urlopen(url)
html_content = response.read().decode()
# Define the regular expression pattern for title
pattern = r"<title>(.*?)</title>"
# Search for all occurrences of the pattern
matches = re.findall(pattern, html_content, re.IGNORECASE)
# Process extracted data
for match in matches:
print("Title:", match.strip())
Title: Online Courses and eBooks Library
Extracting URLs Using urllib and Regex
This example shows how to extract all URLs from anchor tags on a webpage using regular expressions ?
Algorithm
Step 1 Import urllib and re libraries.
Step 2 Fetch HTML content using urlopen().
Step 3 Define regex pattern for href attributes.
Step 4 Find all matching URL patterns.
Step 5 Display the extracted URLs.
Example
import urllib.request
import re
# Open URL and retrieve HTML content
url = "https://www.tutorialspoint.com/index.htm"
response = urllib.request.urlopen(url)
html_content = response.read().decode()
# Define regex pattern for extracting URLs from anchor tags
pattern = r"<a\s+[^>]*href=["']([^"']+)["'][^>]*>"
# Search for all occurrences of the pattern
matches = re.findall(pattern, html_content, re.IGNORECASE)
# Display first 5 URLs to avoid too much output
print("Found", len(matches), "URLs:")
for i, match in enumerate(matches[:5]):
print(f"URL {i+1}:", match)
if len(matches) > 5:
print("... and", len(matches) - 5, "more URLs")
Found 127 URLs: URL 1: /index.htm URL 2: /about/about_careers.htm URL 3: /questions/index.php URL 4: /online_dev_tools.htm URL 5: /codingground.htm ... and 122 more URLs
Key Considerations
When using urllib and regex for web scraping, keep these important points in mind ?
Error Handling Always handle potential network errors and invalid URLs.
Regex Limitations Regular expressions work well for simple patterns but struggle with complex nested HTML.
Case Sensitivity Use
re.IGNORECASEflag for case-insensitive matching.Performance For complex parsing tasks, consider using dedicated libraries like BeautifulSoup or lxml.
Conclusion
Using urllib and regex provides a lightweight solution for simple web scraping tasks. While this approach works well for extracting basic patterns like titles and URLs, consider more robust HTML parsing libraries for complex scraping requirements.
