Skip to content

GotoRyusuke/ChineseFundReports

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChineseFundReports(巨潮资讯·东方财富基金报表爬取及项目提取)


(R4/8/27) Updates

cninf_orgid and cninf_crawler now support crawling info for stocks. Only a few stocks are tested, so some post data may have to be changed if you find it returns an error.

(R4/8/24) Updates

Methods for fatching orgID, which is necessary when we are posting data to the server, is separated from crawler class to a new class named cninf_orgid_finder; the workflow is now simpler and the codes more compact.

(R4/8/19) Updates

Main logic for counting word & sent freqs has been separated to a new module named chinese_counter to make the main classes under Word_freq tidier.

Introduction

This is a project to:

  • ✔️ crawl open fund reports from mainly two sources:
  • ✔️ convert pdf reports to txt file, and extract the expectation part from the reports;
  • ✔️ train a word2vec model and get some similar words for dicts at hand based on the corpus constructed using fund reports;
  • ✔️ do some word freq calculation based on a specific dictionary

Structure of the project

1. CNINF CRAWLER

A crawler is constructed to crawl info and reports from both source of reports. A pdf-to-txt convertor class specific to my task is also introduced. Basically, the convertor class can be revised slightly to fit in with any other similar needs.

This part consists of 3 classes:

  1. cninf_crawler: the class to crawl info and reports from CNINF website;
  2. pdf2txt: the class to convert fund reports in .pdf format to .txt format;
  3. extract_expc_from_report: the class to extract the expectation part from a certain report, if any.

2. WORD2VEC

This part provides two ways to train a word2vec model based on the corpus at hand

  1. method I:
    1. process_1reads the available reports one by one, aggregates all processed reports; into a single file, where each line is a sentence
    2. process_2imports files and removes stop words;
    3. word2vec_traintrains word2vec one the single file derived in step 2, and finds the synonyms for the words in our preliminary Chinese LM dicts
  2. method II: Instead of first aggregating all texts in the corpus and training the model, iter_train uses an iterable to provide content to the model for training, which is more recommended by gensim. Refer to gensim for more details.

After the model params are set, use get_similar class to get similar words for our dicts at hand. This class provides two ways (implemented with 4 methods) to get similar words. Check the comments for detailed usage.

3. EASTMONEY CRAWLER

This module is a crawler built by someone else in the team and I just take it. EastMoney does not have as many reports as CNINF does, so I hardly use it since my cninf crawler was built. The codes are not easy to read, and I highly recommend you to use cninf crawler instead as it is more organised.

4. WORD FREQ CALCULATOR

This part provides two classes to extract text and construct panel data based on existing reports files(in .txt format).

  • Since some of the texts under the expectation part in a report are directly exported from WIND terminal, I developed the module extract_expc_panel to extract panel info directly from a excel table, where the texts under expec as well as fund info are saved.
  • After saving the reports downloaded by the crawler and converted to .txt format(refer to Crawler repository), use extract_full_panel to generate a excel table that summarises the report info.

After we get two panels, use

to calculate word and sent freq based on the content of the reports. The utils module saves some funcs that I think should be isolated from the two classs mentioned above to make the modules tidier.

(R4/8/19) Updates: the main logic for counting word & sent freq has been isolated to a new module named chinese_counter.

Finally, after linking the results of expc to full panel, use fill_empty_expc to fill in the reports where the expectation part is not filed. The filling rules are summarised in the Chinese version logs which is not open to the public.

About

Repo to save codes for ChineseFundReports project

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages