Scraping
========

pyfetcher provides a comprehensive set of scraping utilities built on
BeautifulSoup.

CSS Selector Extraction
-----------------------

.. code-block:: python

   from pyfetcher.scrape.selectors import extract_text, extract_attrs, extract_table

   # Extract text from elements
   titles = extract_text(html, "h1")

   # Extract attributes
   links = extract_attrs(html, "a", attrs=["href", "title"])

   # Parse HTML tables
   rows = extract_table(html, "table.data")

Link Harvesting
---------------

.. code-block:: python

   from pyfetcher.scrape.links import extract_links

   links = extract_links(html, base_url="https://example.com")
   internal = [l for l in links if not l.is_external]

Form Parsing
------------

.. code-block:: python

   from pyfetcher.scrape.forms import extract_forms

   forms = extract_forms(html, base_url="https://example.com")
   login_form = forms[0]
   print(login_form.action, login_form.method)
   print(login_form.to_dict())  # Field names -> default values

Robots.txt
----------

.. code-block:: python

   from pyfetcher.scrape.robots import parse_robots_txt, is_allowed

   rules = parse_robots_txt(robots_txt_content)
   if is_allowed(rules, "/admin", user_agent="MyBot"):
       print("Path is allowed")

Sitemap Parsing
---------------

.. code-block:: python

   from pyfetcher.scrape.sitemap import parse_sitemap

   entries = parse_sitemap(sitemap_xml)
   for entry in entries:
       print(entry.loc, entry.lastmod)

Content Extraction
------------------

.. code-block:: python

   from pyfetcher.scrape.content import extract_readable_text

   # Strips scripts, styles, nav, footer
   text = extract_readable_text(html)

   # Target specific element
   text = extract_readable_text(html, selector="article")