Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Feb 29, 2024

# EDGAR - Reading Information Tables in Text Format - Advanced Text Mining # Do not use Beautiful Soup or any other parser for

# EDGAR - Reading Information Tables in Text Format - Advanced Text Mining

# Do not use Beautiful Soup or any other parser for this assignment. Do not use NumPy or Pandas as well.

#### So far, we have collected CIKs for each of the Mutual Funds, then looked up the links of all the 13F HRs and the Information Tables, and identified them either text tables or xml tables. In class, we obtained the information from the xml tables. In this assignment we will obtain information from the text file. These are not as nicely structured as the xml output. In a csv (HW_Mutual_Fund_Info_Table_Link.csv) you will find a few of these links. Your goal for this part of the homework is to obtain the link to the text files from the attached file (HW_Mutual_Fund_Info_Table_Link.csv). Then you will code that will go to the links of all the linked text files, and extract some columns. Do not use Beutiful Soup for this assignment. We provide some initial code to guide your initial steps:

import urllib

text_link = []

text_Name = []

text_date = []

input_file = open('HW_Mutual_Fund_Info_Table_Link.csv', 'r')

rows = input_file.readlines()

input_file.close()

# Your Code on Enumerating the Lists. The result should be three lists, text_link, text_Name,

#and text_date. Each should have length 122.

#### Keep only the links that correspond to a date after 2008 (Don't inlude 2008, start at 2009). Hint: you can use the datetime library.

from datetime import datetime

#Use the following list to store the filtered values.

filtered_dates = []

filtered_name = []

filtered_link = []

#Enter code here to to keep only the dates corresponding to after 2010.

#### Your filtered list should now have 11 elements. These represent 3 mutual funds. The first one has CIK=1311981 (from the link you can find this at data/1311981 ). The second was has CIK 813470. The third one has CIK 1432353.

['https://www.sec.gov/Archives/edgar/data/1311981/000116204413000513/0001162044-13-000513.txt',

'https://www.sec.gov/Archives/edgar/data/813470/000081347013000006/0000813470-13-000006.txt',

'https://www.sec.gov/Archives/edgar/data/813470/000081347013000001/0000813470-13-000001.txt',

'https://www.sec.gov/Archives/edgar/data/813470/000081347012000023/0000813470-12-000023.txt',

'https://www.sec.gov/Archives/edgar/data/813470/000081347012000019/0000813470-12-000019.txt',

'https://www.sec.gov/Archives/edgar/data/813470/000081347012000014/0000813470-12-000014.txt',

'https://www.sec.gov/Archives/edgar/data/813470/000081347012000003/0000813470-12-000003.txt',

'https://www.sec.gov/Archives/edgar/data/813470/000081347009000009/0000813470-09-000009.txt',

'https://www.sec.gov/Archives/edgar/data/813470/000081347009000005/0000813470-09-000005.txt',

'https://www.sec.gov/Archives/edgar/data/813470/000081347009000001/0000813470-09-000001.txt',

'https://www.sec.gov/Archives/edgar/data/1432353/000114420411008428/0001144204-11-008428.txt']

#### Next, for each text link, extract the name of issuer, CUSIP, and the Quantity of shares. You will also want to keep track of the mutual fund name as well as the filing report date.

#### Your output file should have 5 columns. The first is the issue date of the form which can be found in the filtered_date list (this will be repeated for the same form). The second is the mutual fund name which can be found in the filtered_name list (this will be repeated). The third, fourth and fifith are the name of issuer, CUSIP, and shares respectively. Make sure to account for the fact that while some of the text files have the same formatting, others do not. This means you will have to look through them to make sure your code works for the each text file. (Please make sure you use one chunk of code to process all the URLs, i.e. if we were to change the list of URLs your code should still work for those new URLs. Do not process the URLs separately.)

import requests

import random

headers_list = [

# Firefox 77 Mac

{

"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0",

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",

"Accept-Language": "en-US,en;q=0.5",

"Referer": "https://www.google.com/",

"DNT": "1",

"Connection": "keep-alive",

"Upgrade-Insecure-Requests": "1"

# Chrome 92.0 Win10

{

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",

"Accept-Language": "en-US,en;q=0.5",

"Accept-Encoding": "gzip, deflate, br",

"Referer": "https://www.google.com/",

"DNT": "1",

"Connection": "keep-alive",

"Upgrade-Insecure-Requests": "1"

# Chrome 91.0 Win10

{

"Connection": "keep-alive",

"DNT": "1",

"Upgrade-Insecure-Requests": "1",

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",

"Sec-Fetch-Site": "none",

"Sec-Fetch-Mode": "navigate",

"Sec-Fetch-Dest": "document",

"Referer": "https://www.google.com/",

"Accept-Encoding": "gzip, deflate, br",

"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8"

# Firefox 90.0 Win10

{

"Connection": "keep-alive",

"Upgrade-Insecure-Requests": "1",

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0",

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",

"Sec-Fetch-Site": "same-origin",

"Sec-Fetch-Mode": "navigate",

"Sec-Fetch-User": "?1",

"Sec-Fetch-Dest": "document",

"Referer": "https://www.google.com/",

"Accept-Encoding": "gzip, deflate, br",

"Accept-Language": "en-US,en;q=0.9"

}

]

Hint: If you wish to obtain the HTML content for a particular URL, you can use the following code:

```

headers = random.choice(headers_list)

r = requests.Session()

r.headers = headers

html = r.get(url).text

```

issue_date = []

mutual_fund_name = []

name_of_issuer = []

CUSIP = [] # CUSIP number

shares = [] # No. of Shares of the company in the Mutual Fund

# Your code goes here

Step by Step Solution

There are 3 Steps involved in it

Step: 1

To approach the task outlined in your assignment we will break down the process into smaller manageable steps The code provided here is conceptual and serves as a guideline since we cannot directly ex... blur-text-image