Beautiful Soup

Mert Barbaros
5 min readMar 31, 2021

Post is originally created on Barbaros Blog, don’t forget to subscribe to my newsletter.

Unicode Characters & Strings

ASCII stands for American Standard Code for Information Interchange. Computers can only understand numbers, so an ASCII code is the numerical representation of a character such as ‘a’ or ‘@’ or an action of some sort. ASCII was developed a long time ago and now the non-printing characters are rarely used for their original purpose. If you analyze the table, for example for printing ‘Hello world’, computer used 72 for H or 10 for a new line. The problem is there are 127 numbers. So you can’t put everything in it.

Representing Simple Strings

In python, each characters is represented by a number between 0 and 256 in a 8 bit memory. 8 bit memory means “byte” of memory. Like my disk drive is 1 Tera”byte”. ord() function in python tells us the numeric value of a simple string.

print("Order of H is: ",ord('H'))
print("Order of e is: ", ord('e'))
print("Order of a new line is: ", ord('\n'))
#new line is a single character
Output:
Order of H is: 72
Order of e is: 101
Order of a new line is: 10

Please note that, lowercase letters order is greater than uppercase letters. In 1960s, people assumed that one byte was one character. Problem is, in this system you can’t represent Arabic or Japanese characters. So, it brings us to Unicode. Like said in the Unicode site “everyone in the world should be able to use their own language on phones and computers”. Japanese computers should talk with the European computers. Of course emojis too. To represent the wide range characters computers must handle, we represent characters with more than one bytes.

UTF-16: Two bytes (fixed)
UTF-32: Four bytes (fixed)
UTF-8: 1–4 bytes (variable)

UTF-8 is recommended practice for encoding data to be exchanged between systems. Because it’s the smartest way. In python 3, all strings are Unicode. In python 2, byte string (x = b’h’) and regular string (x = ‘h’) was the same. When you typed a different character like x = u’あ’ was returned as a unicode. However in python 3, byte string and the regular strings are different. This is why we should decode the stuff that we pull from outside of our computer. You can check out the Networked Technology refresher post. But please note that, %99.9 of data that you are working will be UTF-8. Here is the sample decoding operation in python.

while True:
data = mysock.recv(512)
# data is the bytes
if (len(data) < 1) :
break
mystring = data.decode()
#mystring is the unicode
print(mystring)

With the decoding operation, we represent the character properly in Python 3. In decode() function, you can tell the python what you are willing to set character set is. But by default, it’s UTF-8 on ASCII. Decode goes from bytes (data) to unicode (mystring). When we send the data, we use encode(). Encode function turn it strings to bytes. Encode it’s also uses UTF-8 by default. So that is why we should encode before the send, and decode after the receive (recv) operation.

Retrieving Web Pages

Urllib Library

Since HTTP is became common, urllib library developed in Python. It makes web pages like a file.

import urllib.request, urllib.error, urllib.parsefhand = urllib.request.urlopen('https://barbaros.blog/about')for line in fhand:
print(line.decode().strip())
Output (limited):<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
<title>About – Barbaros Blog</title>
<link href="https://fonts.googleapis.com/css?family=Lato%3A%2C400%2C700%2C900%7CRoboto%3A%2C400%2C700%2C900"

As you can see, urllib return HTML. Let’s work also with a txt file

import urllib.request, urllib.error, urllib.parse#open the file
fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
#let's count the words

counts = dict()
for line in fhand:
words = line.decode().strip()
for word in words:
counts[word] = counts.get(word,0) + 1
print(counts)

This example read the txt file and return the frequency of the words.

Parsing Web Pages

We will use web scraping for parsing web pages. People use web scraping for various reasons. You can pull data , get your own data back if you don’t have any export capability, monitor a site for new information, spider the web for making a search engine. Some sites would be snippy about scraping like Amazon. You should read the terms of services before you that, if you don’t want to be blocked by the site. For parsing web pages we can make string searches. However there is an easier way called Beautiful Soup library. It’s a free library built for web scraping. Don’t forget to read documentation in the link under the further readings and references.

Beautiful Soup

Let’s create a anchor tag crawler with beautiful soup

import urllib.request, urllib.error, urllib.parse
from bs4 import BeautifulSoup
url = input("Enter -: ")
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
#Retrive all of the anchor tags
tags = soup('a')
for tag in tags:
print(tag.get('href', None))
Output (Limited):Enter -: https://barbaros.blog
https://barbaros.blog/
http://barbaros.blog/
About
https://barbaros.blog/category/markets/
#
https://barbaros.blog/category/analytics/refresher/
https://barbaros.blog/category/analytics/growht-algorithms/
https://barbaros.blog/category/strategy/
https://barbaros.blog/
Tweets by barbarosblog1
https://instagram.com/barbarosblog
...

Let’s make a summary:

  1. The TCP/IP gives us pipes/sockets between applications.
  2. We designed application protocols to makes use of these pipes
  3. HTTP is a simple yet powerful protocol
  4. Python has a good support of sockets, HTTP and HTML parsing

If you work with SSL certified sites, it’s clever to add ignore SSL method.

import urllib.request, urllib.error, urllib.parse
from bs4 import BeautifulSoup
import ssl
#Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
#open
url = input("Enter -: ")
html = urllib.request.urlopen(url, context = ctx).read()
soup = BeautifulSoup(html, 'html.parser')
#Retrive all of the anchor tags
tags = soup('a')
for tag in tags:
print(tag.get('href', None))

n the SSL part, we ignore the SSL related certificate errors. You can do it when you work with SSL certificated sites. It’s a little hack.

Further Readings & References

Ascii Table
Unicode
Networked Technology, Barbaros Blog
Beautiful Soup Library, Crummy

--

--