Scrape¶
CSS selector-based extraction for pyfetcher.
- Purpose:
Provide ergonomic functions for extracting data from HTML using CSS selectors via BeautifulSoup. Covers common patterns: selecting elements, extracting text, extracting attributes, and parsing HTML tables.
Examples
>>> html = "<div class='item'>Hello</div><div class='item'>World</div>"
>>> extract_text(html, ".item")
['Hello', 'World']
- pyfetcher.scrape.selectors.select(html, selector)[source]¶
Select all elements matching a CSS selector.
- Parameters:
- Returns:
A list of matching
bs4.Tagobjects.- Return type:
list[Tag]
Examples
>>> html = "<ul><li>A</li><li>B</li></ul>" >>> tags = select(html, "li") >>> len(tags) 2
- pyfetcher.scrape.selectors.select_one(html, selector)[source]¶
Select the first element matching a CSS selector.
- Parameters:
- Returns:
The first matching
bs4.Tag, orNoneif not found.- Return type:
Tag | None
Examples
>>> html = "<h1>Title</h1><h1>Subtitle</h1>" >>> tag = select_one(html, "h1") >>> tag.get_text() 'Title'
- pyfetcher.scrape.selectors.extract_text(html, selector, *, strip=True)[source]¶
Extract text content from all elements matching a CSS selector.
- Parameters:
- Returns:
A list of text strings from matching elements.
- Return type:
Examples
>>> html = "<p>Hello</p><p>World</p>" >>> extract_text(html, "p") ['Hello', 'World']
- pyfetcher.scrape.selectors.extract_attrs(html, selector, *, attrs=None)[source]¶
Extract attributes from all elements matching a CSS selector.
If
attrsis not specified, all attributes of each element are returned. Ifattrsis a list of attribute names, only those attributes are included (withNonefor missing attributes).- Parameters:
- Returns:
A list of dictionaries mapping attribute names to values.
- Return type:
Examples
>>> html = '<a href="/about">About</a><a href="/home">Home</a>' >>> extract_attrs(html, "a", attrs=["href"]) [{'href': '/about'}, {'href': '/home'}]
- pyfetcher.scrape.selectors.extract_table(html, selector='table', *, include_headers=True)[source]¶
Extract data from an HTML table as a list of rows.
Parses the first
<table>element matching the selector. Ifinclude_headersisTrue, the first row will contain header cell (<th>) text. Subsequent rows contain data cell (<td>) text.- Parameters:
- Returns:
A list of rows, where each row is a list of cell text strings.
- Return type:
Examples
>>> html = "<table><tr><th>Name</th></tr><tr><td>Alice</td></tr></table>" >>> extract_table(html) [['Name'], ['Alice']]
Link extraction for pyfetcher.
- Purpose:
Harvest and normalize links from HTML documents, supporting filtering by domain, scheme, and link attributes.
Examples
>>> html = '<a href="https://example.com">Example</a>'
>>> links = extract_links(html, base_url="https://example.com")
>>> links[0].url
'https://example.com'
- class pyfetcher.scrape.links.LinkInfo(url, text, rel, is_external)[source]¶
Extracted link information.
- Parameters:
Examples
>>> link = LinkInfo( ... url="https://example.com", text="Example", ... rel=None, is_external=False, ... ) >>> link.url 'https://example.com'
- pyfetcher.scrape.links.extract_links(html, *, base_url=None, same_domain_only=False, include_fragments=False)[source]¶
Extract and normalize links from HTML.
Parses all
<a>tags withhrefattributes and resolves relative URLs againstbase_url. Optionally filters to same-domain links only and controls whether fragment-only links are included.- Parameters:
html (str) – Raw HTML string to parse.
base_url (str | None) – Base URL for resolving relative hrefs. Required for accurate
is_externaldetection and relative URL resolution.same_domain_only (bool) – If
True, only return links pointing to the same domain asbase_url.include_fragments (bool) – If
True, include fragment-only links (e.g.#section). Defaults toFalse.
- Returns:
A list of
LinkInfoobjects for each extracted link.- Return type:
Examples
>>> html = '<a href="/about">About</a><a href="https://other.com">Other</a>' >>> links = extract_links(html, base_url="https://example.com") >>> len(links) 2 >>> links = extract_links(html, base_url="https://example.com", same_domain_only=True) >>> len(links) 1
HTML form extraction for pyfetcher.
- Purpose:
Parse
<form>elements from HTML and extract their fields, making it easy to build form submission requests programmatically.
Examples
>>> html = '<form action="/login" method="post"><input name="user"/></form>'
>>> forms = extract_forms(html, base_url="https://example.com")
>>> forms[0].action
'https://example.com/login'
- class pyfetcher.scrape.forms.FormField(name, type, value, options=<factory>)[source]¶
A single form input field.
- Parameters:
Examples
>>> field = FormField(name="user", type="text", value="") >>> field.name 'user'
- class pyfetcher.scrape.forms.FormData(action, method, fields, id=None, name=None)[source]¶
Parsed HTML form.
- Parameters:
Examples
>>> form = FormData(action="https://example.com/login", method="POST", fields=[]) >>> form.method 'POST'
- to_dict()[source]¶
Convert form fields to a submission dictionary.
Returns a dictionary mapping field names to their default values, suitable for use as POST data or query parameters.
Examples
>>> form = FormData( ... action="/submit", method="POST", ... fields=[FormField(name="q", type="text", value="hello")], ... ) >>> form.to_dict() {'q': 'hello'}
- pyfetcher.scrape.forms.extract_forms(html, *, base_url=None)[source]¶
Extract all forms from HTML.
Parses
<form>elements and their input fields (<input>,<textarea>,<select>) to produce structured form data.- Parameters:
- Returns:
A list of
FormDataobjects.- Return type:
Examples
>>> html = '<form action="/search"><input name="q" value=""/></form>' >>> forms = extract_forms(html, base_url="https://example.com") >>> forms[0].action 'https://example.com/search'
Robots.txt parser for pyfetcher.
- Purpose:
Parse
robots.txtfiles and check URL access permissions for a given user-agent. SupportsAllow,Disallow,Crawl-delay, andSitemapdirectives.
Examples
>>> txt = "User-agent: *\\nDisallow: /admin"
>>> rules = parse_robots_txt(txt)
>>> is_allowed(rules, "/admin", user_agent="*")
False
- class pyfetcher.scrape.robots.RobotsRules(rules=<factory>, sitemaps=<factory>, crawl_delays=<factory>)[source]¶
Parsed robots.txt rules.
- Parameters:
Examples
>>> rules = RobotsRules() >>> rules.sitemaps []
- pyfetcher.scrape.robots.parse_robots_txt(content)[source]¶
Parse a robots.txt file content.
Extracts
User-agent,Allow,Disallow,Crawl-delay, andSitemapdirectives into a structuredRobotsRulesobject.- Parameters:
content (str) – The raw text content of a robots.txt file.
- Returns:
A
RobotsRulesobject containing parsed directives.- Return type:
Examples
>>> txt = "User-agent: *\\nDisallow: /secret\\nAllow: /public" >>> rules = parse_robots_txt(txt) >>> len(rules.rules.get("*", [])) 2
- pyfetcher.scrape.robots.is_allowed(rules, path, *, user_agent='*')[source]¶
Check if a path is allowed for the given user-agent.
Evaluates the parsed robots.txt rules for the most specific matching user-agent.
Allowdirectives take precedence overDisallowwhen paths have equal specificity (longer path prefix wins).- Parameters:
rules (RobotsRules) – Parsed robots.txt rules from
parse_robots_txt().path (str) – The URL path to check (e.g.
'/admin/settings').user_agent (str) – The user-agent string to check against. Defaults to
'*'(wildcard).
- Returns:
Trueif the path is allowed,Falseif disallowed.- Return type:
Examples
>>> txt = "User-agent: *\\nDisallow: /admin\\nAllow: /admin/public" >>> rules = parse_robots_txt(txt) >>> is_allowed(rules, "/admin/settings") False >>> is_allowed(rules, "/admin/public") True
Sitemap parser for pyfetcher.
- Purpose:
Parse XML sitemaps (both sitemap index files and URL set files) and extract URL entries with their metadata.
Examples
>>> xml = '<?xml version="1.0"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><url><loc>https://example.com/</loc></url></urlset>'
>>> entries = parse_sitemap(xml)
>>> entries[0].loc
'https://example.com/'
- class pyfetcher.scrape.sitemap.SitemapEntry(loc, lastmod=None, changefreq=None, priority=None, is_sitemap=False)[source]¶
A single URL entry from a sitemap.
- Parameters:
loc (str) – The URL location.
lastmod (str | None) – The last modification date string, if present.
changefreq (str | None) – The change frequency hint, if present.
priority (str | None) – The priority value as a string, if present.
is_sitemap (bool) – Whether this entry is a sitemap index reference.
Examples
>>> entry = SitemapEntry(loc="https://example.com/") >>> entry.loc 'https://example.com/'
- pyfetcher.scrape.sitemap.parse_sitemap(xml_content)[source]¶
Parse an XML sitemap or sitemap index.
Handles both
<urlset>(URL sitemaps) and<sitemapindex>(sitemap index files). Returns a flat list of entries with theis_sitemapflag set for index entries.- Parameters:
xml_content (str) – Raw XML string content of the sitemap.
- Returns:
A list of
SitemapEntryobjects.- Return type:
Examples
>>> xml = ( ... '<?xml version="1.0"?>' ... '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">' ... '<url><loc>https://example.com/</loc><priority>1.0</priority></url>' ... '</urlset>' ... ) >>> entries = parse_sitemap(xml) >>> entries[0].priority '1.0'
Content extraction for pyfetcher.
- Purpose:
Extract readable text content from HTML by stripping scripts, styles, and navigation elements to isolate the main body text.
Examples
>>> html = "<html><body><p>Hello World</p><script>var x=1;</script></body></html>"
>>> extract_readable_text(html)
'Hello World'
- pyfetcher.scrape.content.extract_readable_text(html, *, strip_tags=None, selector=None)[source]¶
Extract readable text content from HTML.
Removes scripts, styles, navigation, and other non-content elements from the HTML, then extracts and normalizes the text content. Optionally targets a specific element via CSS selector.
- Parameters:
html (str) – Raw HTML string to process.
strip_tags (frozenset[str] | None) – Set of tag names to remove before text extraction. Defaults to scripts, styles, noscript, iframe, svg, nav, footer, and header.
selector (str | None) – Optional CSS selector to narrow extraction to a specific element (e.g.
'article','main','.content').
- Returns:
Cleaned, readable text with normalized whitespace.
- Return type:
Examples
>>> html = "<div><p>First.</p><p>Second.</p><script>x=1</script></div>" >>> extract_readable_text(html) 'First.\\nSecond.'