Article Categories

Selected Reading

Parse a website with regex and urllib in Python

Python Server Side Programming Programming

Web scraping is a powerful technique for extracting data from websites, enabling automated data collection and analysis. Python provides several tools to make web scraping tasks easier through its robust ecosystem of libraries. Two commonly used libraries for web scraping are urllib and re (regular expressions).

The urllib module enables fetching web content, processing URLs, and sending HTTP requests. It provides a straightforward way to interact with web servers and retrieve HTML from web pages. The re module supports regular expressions, which are character sequences used to create search patterns for extracting specific data.

In this article, we'll focus on using urllib and re to parse websites and extract relevant information. We will examine two methods that demonstrate how to use regular expressions to obtain specific data from a webpage's HTML content ?

Extracting Website Title Using urllib and Regex

This example demonstrates how to fetch HTML content using urllib and extract the website title using regular expressions ?

Algorithm

Step 1 Import the required libraries urllib and re.
Step 2 Open URL using urlopen() from urllib.request and retrieve HTML content.
Step 3 Define the regular expression pattern for <title> tag.
Step 4 Search for all occurrences of the pattern.
Step 5 Process and print the matching titles.

Example

import urllib.request
import re

# Open URL and retrieve HTML content
url = "https://www.tutorialspoint.com/index.htm"
response = urllib.request.urlopen(url)
html_content = response.read().decode()

# Define the regular expression pattern for title
pattern = r"<title>(.*?)</title>"

# Search for all occurrences of the pattern
matches = re.findall(pattern, html_content, re.IGNORECASE)

# Process extracted data
for match in matches:
    print("Title:", match.strip())

Title: Online Courses and eBooks Library

Extracting URLs Using urllib and Regex

This example shows how to extract all URLs from anchor tags on a webpage using regular expressions ?

Algorithm

Step 1 Import urllib and re libraries.
Step 2 Fetch HTML content using urlopen().
Step 3 Define regex pattern for href attributes.
Step 4 Find all matching URL patterns.
Step 5 Display the extracted URLs.

Example

import urllib.request
import re

# Open URL and retrieve HTML content  
url = "https://www.tutorialspoint.com/index.htm"
response = urllib.request.urlopen(url)
html_content = response.read().decode()

# Define regex pattern for extracting URLs from anchor tags
pattern = r"<a\s+[^>]*href=["']([^"']+)["'][^>]*>"

# Search for all occurrences of the pattern
matches = re.findall(pattern, html_content, re.IGNORECASE)

# Display first 5 URLs to avoid too much output
print("Found", len(matches), "URLs:")
for i, match in enumerate(matches[:5]):
    print(f"URL {i+1}:", match)
    
if len(matches) > 5:
    print("... and", len(matches) - 5, "more URLs")

Found 127 URLs:
URL 1: /index.htm
URL 2: /about/about_careers.htm
URL 3: /questions/index.php
URL 4: /online_dev_tools.htm
URL 5: /codingground.htm
... and 122 more URLs

Key Considerations

When using urllib and regex for web scraping, keep these important points in mind ?

Error Handling Always handle potential network errors and invalid URLs.
Regex Limitations Regular expressions work well for simple patterns but struggle with complex nested HTML.
Case Sensitivity Use re.IGNORECASE flag for case-insensitive matching.
Performance For complex parsing tasks, consider using dedicated libraries like BeautifulSoup or lxml.

Conclusion

Using urllib and regex provides a lightweight solution for simple web scraping tasks. While this approach works well for extracting basic patterns like titles and URLs, consider more robust HTML parsing libraries for complex scraping requirements.

Adeeba Khan

Updated on: 2026-04-02T17:15:46+05:30

538 Views

Kickstart Your Career

Get certified by completing the course

Get Started

Previous Next

Article Categories

Parse a website with regex and urllib in Python

Extracting Website Title Using urllib and Regex

Algorithm

Example

Extracting URLs Using urllib and Regex

Algorithm

Example

Key Considerations

Conclusion

Learn More in Our Tutorials

Kickstart Your Career