Highest Paid K-Drama Actors 2021

Oluwafunmilayo C. Sofuwa
5 min readOct 19, 2021

A web-scraping approach

Image Link

I have this love affair with Korean dramas. The first ever K-Drama I watched was Boys Over Flowers (same with most people I know) and since then… let’s just say the rest is history!

I decided to combine my love for data and k-drama into an interesting project and I took a web-scraping approach.

In this article, I would be discussing:

  1. What Web scraping is?
  2. Resources I used in learning
  3. Summary of how I approached my project
  4. Who made the list?
  5. Points to note
  6. Link to my other pages

What is ‘Web Scraping’?

Wikipedia defines web scraping as ‘ data scraping used for extracting data from websites. It can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Zyte lists what a general DIY web scraping process looks like:

  1. Identify the target website
  2. Collect URLs of the pages where you want to extract data from
  3. Make a request to these URLs to get the HTML of the page
  4. Use locators to find the data in the HTML
  5. Save the data in a JSON or CSV file or some other structured format

This process however applies when you are working on small projects rather than data that needs to be scaled.

Resources I used

To learn about web scraping I made use of three YouTube videos:

  1. Web Scraping with Python — Beautiful Soup Crash Course
  2. Comprehensive Python Beautiful Soup Web Scraping Tutorial! (find/find_all, css select, scrape table)
  3. Solving real world data science tasks with Python Beautiful Soup! (movie dataset creation)

Project Summary

Initially I wanted to scrape American movies but since that was fairly common, I decided to go another route. My plan was to get the highest paid Korean actors and then get the only the television dramas they have featured in. I found various websites with the highest paid Korean drama actors however some there were slight differences between them. The initial website I started scraping from had an issue with their page so I ended up using the information from SeoulSpace. I particularly like the way the information is presented on their website which was similar to the initial website I wanted to use. They give a sort of backstory to each actor as well as their fees per episode and estimated total earnings from Kdramas. I scrapped the backstory and the fees per episode.

Once that was done, my next plan of action was to get the Wikipedia page of each actor or their filmography page. From their Wikipedia pages, I extracted their date of birth and age. My initial next step after this was to get the list of the series they have acted it and then get more information of each series using the OMDB API. I couldn’t move forward with this method for a number of reasons:

  • Some Korean series have the same title as some other movie which would have provided me with the wrong information from OMDB API
  • Some Korean series have alternative titles
  • Some Korean series information do not exist on OMDB API

With this realization, I chose to get the series information entirely from IMDB. The information I wanted to get include the series summary, alternative title, year it was released, IMDB rating as well as the series images (series poster). I would briefly list out each step I took.

  • I got the IMDB’s advanced search URL. With this URL, I would create the advanced URL search for each actor and then pick the first item from the search result.
  • The first item from the search result would contain the actor’s IMDB URL. I had to use this method because IMDB assigns a unique number to each series, movie, actor etc., that is on their website.
  • Using the actor’s IMDB URL, I extracted the actor’s name, the names of the series they have acted in and the series URL.
  • With the series URL, I got the series alternative title, year of release, summary of the series, genre and IMDB rating.
  • Once that was done, I needed to scrape the series images. To do this, I created a unique list of of the series URLs from step 3.
  • Most of these URLs contain an image for the series. The images have a link that takes you to another IMDB page with the picture and a few other information. I extracted these image links.
  • From the image link, I got the final link of where the images where gotten from or uploaded to. These final links directed me to the image of the series which I downloaded/scraped.
  • Lastly, I downloaded all my data tables as csv files to my local storage on my laptop as well as a zip folder of the images that I scraped.

Who made the list?

  1. Top of the list was Kim Soo-Hyun. If you have watched Moon Embracing the Sun. My Love from the Star or It’s Okay to Not Be Okay, you should know him.
  2. So Ji-sub known for his roles in I’m Sorry I Love You, Something Happened in Bali, and Master’s Sun.
  3. Crash Landing on You actor, Hyun Bin.
  4. One of my all time favorite from Boys Over Flowers, Lee Min-ho.
  5. Ji Chang-wook - Some of his top Korean dramas are Suspicious Partner, The Empress Ki, and Healer.
  6. Jo In-Sung
  7. Yoo Ah-in
  8. Lee Jong-suk
  9. Lee Seung-gi
  10. Song Joong-ki

Points to note

  • I did a lot of back and forth between my code and the websites I was scraping from.
  • I was working on a small project and not data that needed to be scaled
  • I noticed that there were times when beautiful soup wouldn’t wait for a website to load properly and therefore would affect the information I could get from that website. In such instances, I had to rerun my code multiple times before I got the information I needed. Stack overflow gave me this insight as I initially thought my code had an issue. It also stated that Selenium can be used to avoid this error which I plan to learn about soon.

Link to other pages

I hope this article is helpful to you or that you just enjoyed reading it. Here are links to the codes of this particular project and my other pages:

Project Link

Project Visualization

GitHub

Tableau

LinkedIn

--

--