Scratching a Tiny Itch in Tiny Steps - A Python Side Project
How to build something tiny and useful with Python and Flask.
Every week I send a new issue of my newsletter This Week's Worth, and every week I commit to the boring effort of estimating the read times of each article I share with it. My subscribers are all saints and they deserve to know if they are clicking on a link that will take them 31 minutes to read.
When I say boring I really mean it. This was the recipe I followed every single time:
- Open the article.
- Copy the article.
- Open Read-O-Meter and paste the article.
- Get an estimate.
- Be unsatisfied with a fixed reading speed of 200 wpm.
- Divide the total number of words by 220.
And I'm not even considering the times where opening the article a second time would trigger a paywall.
What I wanted was to provide a URL to an article and get a number of minutes in return. And that seemed like the perfect-sized challenge for a side project, where, once again, Python should come to the rescue.
Breaking a project in milestones
I want a web application where everyone could provide their links and receive an estimated reading time. I want to be able to configure a Word-Per-Minute speed. I want it easy to use and minimalistic. I want it to be fast. All of these are great goals to have for the last iteration of this project. Not the first.
I plan to build this project in tiny steps.
- A python script that returns the article text from a given URL.
- A python script that returns the reading time of a given URL.
- Previous step + accepts a URL and wpm speed as parameters.
- An API that returns the reading time of a given URL.
- Previous step + accepts a URL and wpm speed as parameters.
- A web application that estimates the reading time of a given URL.
- Previous step + what could pass as "nice UI".
- Previous step + speed improvements.
Defining these bite-sized milestones from the get-go provide at least two considerable advantages (if you discount the futurology and stay agile):
- Added motivation by completing goals with higher frequency;
- Higher clarity on the immediate objective and what lies ahead;
Step 1: How to get an article out of a URL
I wanted this as the first step for two main reasons: warm-up and separate functionality. I knew counting words in a body of text would be easy, but since my web scrapping experience was limited, extracting the words out of the HTML would require some investigation and would be a nice warm-up exercise coming back to Python. The other reason is related to what I could do with the end result. I would use it for this project, but it could also live on its own, for the next time I would need to extract the body of articles.
And just like that, the first line I wrote for this project was not Python in a text editor, but "python extract article from html" on Google.
What I discovered was this very nice Python library called Newspaper3k, which in six, very short, lines of code would give me the complete text of any article.
from newspaper import Article URL = 'https://filipesilva.me/blog/when-to-use-the-java-this-keyword/' article = Article(URL) article.download() article.parse() print(article.text)
That's it. Even if you are not familiar with Python syntax I think it's pretty self-explanatory what each line does.
Step 2: How to get the read time of an article
In this step, I have to count the words in the article text and then just apply simple math. We can get the number of the words of an article with these two lines:
words = article.text.split()
But there's a problem. We are splitting the text by whitespaces and if some punctuation appears isolated, like for example in lines of code, we would get symbols counted as words.
To solve this small issue I reverted, for now, to extracting the words with a regular expression which, instead of excluding the characters I don't want, it only gets me words with the ones I need. You know them: numbers, letters, dashes, underscores, and apostrophes.
words = re.findall("[0-9a-zA-Z_'-]+", article.text)
Then, it's just dividing the number of words by a given Words-Per-Minute constant (I use 220). Also decided to round down the estimation because on the Internet everyone skims.
word_count = len(words) estimated_read_time = math.floor(word_count / WPM)
I could end this here and it would already help on the newsletter writing process. But...
Step 3: How to pass arguments to Python script
I see three levels in this step. In order of importance:
- I want the script to accept one URL.
- I want the script to accept one URL and one WPM number.
- I want the script to accept a list of URLs.
Here, after a brief research, I discovered the argparse Python module which promised easy implementation of Command-Line Interfaces. It delivered.
import argparse parser = argparse.ArgumentParser() parser.add_argument("--url", help="Define URL for the article to estimate read time.", required=True) args = parser.parse_args() article = Article(args.url)
In five lines I was able to set up the beginning of, what could be, a read-time-estimating CLI. I can define the name of the URL argument (as "url"), offer a brief description of what it means, and even mark it as required for the estimation to work.
Now, for each new argument, and in this particular case Words-Per-Minute, I just need to add another line like this:
parser.add_argument("--wpm", help="Define the words-per-minute speed.")
This is not required for our little program to work, because we will always have a default value to estimate with.
Lastly, we need to be able to pass as many URLs as we want. The solution comes in the form of a parameter, appropriately, named
'nargs'. We need to pass to it an '*' to group all urls we may throw at it into a list, so that we can then iterate through them and apply our estimation algorithm.
parser.add_argument("--url", nargs='*', required=True, help="Define URLs for the articles to estimate read time.") args = parser.parse_args() for url in args.url: # call estimate read time function
This is the end of step 3 and of the 'script-only' part of this project. Time for some refactoring and a refresh on Flask, a Python (micro) web framework that I sympathize with.
Step 4: How to get the article read time via API
I'm changing the paradigm. Now I want to provide the capability to get an estimation to other applications. The most basic example would be for me to visit some address in my web browser and get a read time. It could also feed a browser extension, a mobile app, etcetera. As long as we can answer HTTP requests asking nicely for our magic algorithm we would be good to go.
That's where Flask comes into the picture.
from flask import Flask WPM = 220 app = Flask(__name__) @app.route('/readtime') def get_read_time(): url = "https://filipesilva.me/blog/12-great-ideas-for-programming-projects-that-people-will-use/" return str(estimate_read_time(url, WPM))
Re-using the existing code, I can very easily deploy on my machine a server that handles a specific HTTP request and estimates the read time of the URL defined. When I open http://localhost:5000/readtime on my browser I see magic.
Step 5: How to pass parameters to a Flask API
Our API needs to receive from our clients the actual articles they want estimated. To do that we are going to change our
GET /readtime endpoint to a POST. This allows us to stay in the good graces of HTTP specifications and what the majority of developers have come to expect.
We will be expecting a JSON payload with a list of URLs (the links for the articles) and a WPM number. The last change from the previous step is how we respond to this request. The
jsonify method will convert to JSON a list of pairs comprised of URL and corresponding read time.
from flask import Flask, request, jsonify app = Flask(__name__) @app.route('/readtime', methods=['POST']) def get_read_time(): request_payload = request.json urls = request_payload['urls'] wpm = request_payload['wpm'] return jsonify(estimate_read_times(urls, wpm))
(to be continued)
If you enjoyed this, there are many others to come.
This is my weekly newsletter.
You should consider be part of it to stay on top of new articles and other stuff I have yet to figure out. Yup, I'm improvising here.