欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Datawhale_day2

程序员文章站 2022-07-14 23:11:47
...

本章作业

  1. 假设字符3750,字符900和字符648是句子的标点符号,请分析赛题每篇新闻平均由多少个句子构成?
  2. 统计每类新闻中出现次数对多的字符

————————————————————————————————————————————

题1 代码:

import pandas as pd
import os


data_set = os.path.join(os.getcwd(), "数据集\\train_set.csv\\train_set.csv")
print(data_set)
train_df = pd.read_csv(data_set, sep='\t')
sum_sentences, lines = 0, 0
for index, content in enumerate(train_df["text"]):
    num_array = content.split(" ")
    num_dict = {}
    lines += 1
    for key in num_array:
        num_dict[key] = num_dict.get(key, 0) + 1
    if "3750" not in num_dict.keys() and "900" not in num_dict.keys() and "648" not in num_dict.keys():
        sum_sentences += 1
    else:
        sum_sentences += num_dict["3750"] if "3750" in num_dict.keys() else 0 + num_dict["900"] if "900" in num_dict.keys() else 0 + num_dict["648"] if "648" in num_dict.keys() else 0
print(sum_sentences/lines)

 ————————————————————————————————————————————

题2 思路:

使用dict计算每个字符出现的频率,选每一类中最大的即可

 

相关标签: Datawhale

推荐阅读