How to scrape article content with Python and Newspaper3k

TL;DR - newspaper3k Quick Start
Newspaper3k is a Python library that makes web scraping news articles dead simple. Here's the typical workflow:
- Create an Article object with a URL
- Download the HTML content
- Parse the downloaded content to extract article data
- Access the extracted information (text, authors, images, etc.)
- Optionally run NLP for keywords and summaries
Install:
pip3 install newspaper3k
Basic usage:
from newspaper import Article
url = 'https://example.com/article'
article = Article(url)
article.download()
article.parse()
print(article.text) # Full text
print(article.authors) # Authors
print(article.publish_date)
print(article.top_image) # Main image
With NLP (keywords & summary):
article.nlp()
print(article.keywords)
print(article.summary)
Bulk scraping:
import newspaper
site = newspaper.build('https://cnn.com')
for article in site.articles:
article.download()
article.parse()
That's it! newspaper3k handles extraction and cleaning. You can also easily do some basic NLP with it.
What is newspaper3k?
Newspaper3k is a Python library for extracting and parsing news articles from the web. It has an easy to use API and is specifically optimized for content-rich pages like articles or blog posts.
Key Features
- Content Extraction: It can identify and extract the main article content, automatically filtering out any content that isn't relevant to the article
- Metadata Parsing: It can identify common metadata like authors, published dates, meta descriptions...
- Multi-language Support: Works with a long list of languages and you can add support for additional languages by providing a list of stopwords (of the new language)
- Built-in NLP: It provides some basic NLP functionality like keyword extraction out of the box
- Multi-threaded Downloads: Includes support for concurrent article downloads and extraction to speed things up
- HTML Preservation: Newspaper3k can preserve the HTML it used to extract the content. This is useful in case you need to do any additional processing on it
How It Works
Before you start, create a new folder, install Newspaper3k and create an empty index.py file with the following commands:
mkdir scraping_project && cd scraping_project
pip install newspaper3k
touch index.py
Using Newspaper3k usually consists of three steps: downloading your html/page, parsing the content and finally processing it.
1. Download the HTML
from newspaper import Article
article = Article('https://transformy.io/guides/python-newspaper3k-scraping-tutorial/') # meta
article.download()
2. Parsing/extracting structured data
# After downloading, parse the content
article.parse()
# Now all article data is available
print(article.title)
print(article.publish_date)
print(article.text)
...
The parser intelligently identifies article elements using language-specific rules and patterns.
3. NLP Analysis
article.nlp()
print(article.keywords)
print(article.summary)
The article summary feature is OK but definitely not the best and there are better tools out there to do this. If you need something really good, you shuld probably just use an LLM for this part.
Advanced usage
Language-Specific Parsing
Newspaper3k automatically detects languages, but you can specify for better accuracy:
# For Dutch articles:
chinese_article = Article(url, language='nl')
chinese_article.download()
chinese_article.parse()
Error Handling
from newspaper import Article, ArticleException
try:
article = Article(url)
article.download()
article.parse()
except ArticleException as e:
print(f"Failed: {e}")
Perserving HTML of the article content
As mentioned earlier, another useful feature is the ability to preserve the article's HTML structure:
from newspaper import Article
# Enable HTML preservation
article = Article(url, keep_article_html=True)
article.download()
article.parse()
# Access the clean HTML with semantic structure preserved
print(article.article_html)
This comes in handy in case you want to do aditionally parsing with something like BeautifulSoup.
Processing Specific Article Lists
You can also download a custom list of articles concurrently:
from newspaper import Article, Source
import newspaper
# Create a list of specific article URLs
article_urls = [
'https://transformy.io/guides/python-newspaper3k-scraping-tutorial/',
'https://transformy.io/guides/python-trafilatura-tutorial/',
'https://transformy.io/guides/beuatifulsoup-tutorial/'
]
# Create article objects
articles = [Article(url) for url in article_urls]
# Download articles using thread pool
newspaper.news_pool.set(articles, threads_per_source=2)
newspaper.news_pool.join()
# Parse and display article information
for article in articles:
try:
article.parse()
article.nlp() # Extract keywords and summary
print(f"Title: {article.title}")
print(f"\n")
print(f"URL: {article.url}")
print(f"\n")
# ...
except Exception as e:
print(f"\nFailed to process URL: {article.url}")
print(f"Error: {e}")
Some Performance Tips
I also have some final recoommendations to optimize your N3K extraction and avoid gettint rate limited.
- Thread Count: Use 1-2 threads per source to avoid overwhelming servers (or use a proxy service)
- Batch Size: Process articles in batches if dealing with thousands of URLs
- Error Handling: Always wrap parsing in try-except blocks, things can go wrong
- Rate Limiting: Consider adding delays between requests to avoid being rate-limited
- Memory: For very large operations, process and store results incrementally
The multi-threading feature makes newspaper3k really effective for large scale scraping, but, with great power comes great responsibility!