How to scrape article content with Python and Newspaper3k

TL;DR - newspaper3k Quick Start

Newspaper3k is a Python library that makes web scraping news articles dead simple. Here’s the typical workflow:

Create an Article object with a URL
Download the HTML content
Parse the downloaded content to extract article data
Access the extracted information (text, authors, images, etc.)
Optionally run NLP for keywords and summaries

Install:

pip3 install newspaper3k

Basic usage:

from newspaper import Article

url = 'https://example.com/article'
article = Article(url)
article.download()
article.parse()

print(article.text)        # Full text
print(article.authors)     # Authors
print(article.publish_date)
print(article.top_image)   # Main image

With NLP (keywords & summary):

article.nlp()
print(article.keywords)
print(article.summary)

Bulk scraping:

import newspaper

site = newspaper.build('https://cnn.com')
for article in site.articles:
    article.download()
    article.parse()

That’s it! newspaper3k handles extraction and cleaning. You can also easily do some basic NLP with it.

What is newspaper3k?

Newspaper3k is a Python library for extracting and parsing news articles from the web. It has an easy to use API and is specifically optimized for content-rich pages like articles or blog posts.

Key Features

Content Extraction: It can identify and extract the main article content, automatically filtering out any content that isn’t relevant to the article
Metadata Parsing: It can identify common metadata like authors, published dates, meta descriptions…
Multi-language Support: Works with a long list of languages and you can add support for additional languages by providing a list of stopwords (of the new language)
Built-in NLP: It provides some basic NLP functionality like keyword extraction out of the box
Multi-threaded Downloads: Includes support for concurrent article downloads and extraction to speed things up
HTML Preservation: Newspaper3k can preserve the HTML it used to extract the content. This is useful in case you need to do any additional processing on it

How It Works

Before you start, create a new folder, install Newspaper3k and create an empty index.py file with the following commands:

mkdir scraping_project && cd scraping_project
pip install newspaper3k
touch index.py

Using Newspaper3k usually consists of three steps: downloading your html/page, parsing the content and finally processing it.

1. Download the HTML

from newspaper import Article

article = Article('https://transformy.io/guides/python-newspaper3k-scraping-tutorial/') # meta
article.download()

2. Parsing/extracting structured data

# After downloading, parse the content
article.parse()

# Now all article data is available
print(article.title)
print(article.publish_date)
print(article.text)
...

The parser intelligently identifies article elements using language-specific rules and patterns.

3. NLP Analysis

article.nlp()

print(article.keywords)
print(article.summary)

The article summary feature is OK but definitely not the best and there are better tools out there to do this. If you need something really good, you shuld probably just use an LLM for this part.

Advanced usage

Language-Specific Parsing

Newspaper3k automatically detects languages, but you can specify for better accuracy:

# For Dutch articles:
chinese_article = Article(url, language='nl')
chinese_article.download()
chinese_article.parse()

Error Handling

from newspaper import Article, ArticleException

try:
    article = Article(url)
    article.download()
    article.parse()
except ArticleException as e:
    print(f"Failed: {e}")

Perserving HTML of the article content

As mentioned earlier, another useful feature is the ability to preserve the article’s HTML structure:

from newspaper import Article

# Enable HTML preservation
article = Article(url, keep_article_html=True)
article.download()
article.parse()

# Access the clean HTML with semantic structure preserved
print(article.article_html)

This comes in handy in case you want to do aditionally parsing with something like BeautifulSoup.

Processing Specific Article Lists

You can also download a custom list of articles concurrently:

from newspaper import Article, Source
import newspaper

# Create a list of specific article URLs
article_urls = [
    'https://transformy.io/guides/python-newspaper3k-scraping-tutorial/',
    'https://transformy.io/guides/python-trafilatura-tutorial/',
    'https://transformy.io/guides/beuatifulsoup-tutorial/'
]

# Create article objects
articles = [Article(url) for url in article_urls]

# Download articles using thread pool
newspaper.news_pool.set(articles, threads_per_source=2)
newspaper.news_pool.join()

# Parse and display article information
for article in articles:
    try:
        article.parse()
        article.nlp()  # Extract keywords and summary
        
        print(f"Title: {article.title}")
        print(f"\n")
        print(f"URL: {article.url}")
        print(f"\n")
        # ...
    except Exception as e:
        print(f"\nFailed to process URL: {article.url}")
        print(f"Error: {e}")

Some Performance Tips

I also have some final recoommendations to optimize your N3K extraction and avoid gettint rate limited.

Thread Count: Use 1-2 threads per source to avoid overwhelming servers (or use a proxy service)
Batch Size: Process articles in batches if dealing with thousands of URLs
Error Handling: Always wrap parsing in try-except blocks, things can go wrong
Rate Limiting: Consider adding delays between requests to avoid being rate-limited
Memory: For very large operations, process and store results incrementally

The multi-threading feature makes newspaper3k really effective for large scale scraping, but, with great power comes great responsibility!